Course: Data Mining I

Topic outline

General

Collapse all Expand all
- Avisos Forum
- UC info SIGARRA Course Info
- Self-assessment #1 URL
- Important Dates
  Test #1: November 5th, ~~11:30-13:00 @ FC6 157 and FC6 163~~ 14:00-15:50 @ FC6 165
  Test #2: December 17th, ~~11:30-13:00 @ FC6 157 and FC6 163~~ 14:00-15:50 @ FC6 165
  Project Launch: September 25th
  Project Submission: December 11th (submissions allowed until December 11th @ 23:59)
  Project Presentation: December 13th (place and time to be announced)
- Links to Data Mining books used in this course:
  
  Data Mining: the Textbook (Charu C. Aggarwal)
  
  Data Mining: Concepts and Techniques
  (Jiawei Han, Micheline Kamber, Jian Pei)
  
  Introduction to Data Mining
  (Pang-Ning Tan, Michael Steinbach, Anuj Karpatne, Vipin Kumar)
  
  A Gentle Introduction to Data Analytics
  (João Moreira, André Carvalho, Tomáš Horváth)
- Introduction to Data Science (in Portuguese) File
  
  Uploaded 12/11/24, 16:00
Assignment
- Choose your group here. Group choice
- Upload your assignment here (Predicting Cardiac Pathologies in Children)
- Upload your presentation here (Predicting Cardiac Pathologies in Children) Assignment
- DATASET File
  
  Uploaded 25/09/24, 12:14
- Guideline for diagnostic (in Portuguese) File
  
  Uploaded 25/09/24, 11:56
- Table with BMI information for Portugal (in Portuguese) File
  
  Uploaded 25/09/24, 11:57
- Explanation about the PPA variable (in Portuguese) File
  
  Uploaded 25/09/24, 11:58
- Tables of blood pressure for children (in English) File
  
  Uploaded 25/09/24, 11:58
- Paper #1 with results on a similar dataset (169 subjects) URL
- Paper #2 with statistical results on the same dataset (7199 subjects) URL
- Data Analysis: some good practices File
  
  Uploaded 3/12/24, 18:15
- Presentations Dec 13th - 2:00-4:45pm, place: FC6 137 (S1)
- Grades after Exam - normal season:
  
  Results Exam:
- Comments and evaluation of homeworks and practical assignments: (max 1 point)
Homeworks
- Week #1: Read the paper mentioned below and answer the questions proposed. Assignment
  Paper: DataQuality and Integration Issues in Electronic Health Records
  
  by: Ricardo João Cruz-Correia, Pedro Pereira Rodrigues, Alberto Freitas, Filipa Canario Almeida, Rong Chen, and Altamiro Costa-Pereira
  What are the main kinds of data errors discussed in this paper?
  Would it be possible to automatically identify and categorize those errors?
  What are missing values?
  How missing values are categorized?
  What are the main methods used to handle missing values?
- Week #10: performance evaluation and validation Assignment
  Read the three suggested papers and answer:
  What are the problems related with evaluating models in the context of cross-validation? Discuss about micro-averaging and macro-averaging and their impact in the evaluation of corss-validated models.
  What is the relationship between ROC and PR curves? Why the second paper argues that ROC produces overly optimistic results when the class is imbalanced?
  What is the argument of paper #3 when saying that it is OK to use ROC with imbalanced classes?
- To read #1: Apples-to-apples: the pitfals of cross-validation URL
- To read #2: The relationship between ROC and Precision-Recall curves URL
- To read #3: refutation of the second paper URL
Python Books
Theoretical Classes
- Week #1 (Sep 24th)
- Presentation File
  
  Uploaded 16/09/24, 11:46
- Introduction to Data Mining File
  
  Uploaded 16/09/24, 11:49
- Week #2 (Oct 1st)
- Data understanding and manipulation File
  
  Uploaded 2/10/24, 10:11
- Week #3 (Oct 8th)
- Distances and dimensionality reduction (till slide 41, inclusive) File
  
  Uploaded 3/10/24, 19:45
- Recorded class URL
- Week #4 (Oct 15th)
- Distances and dimensionality reduction (cont. from slide 42) File
  
  Uploaded 14/10/24, 23:43
- Week #5 (Oct 22nd)
- Distances and dimensionality reduction (cont. from slide 55) File
  
  Uploaded 21/10/24, 22:34
- Data imputation File
  
  Uploaded 21/10/24, 23:40
- Data visualization File
  
  Uploaded 21/10/24, 22:37
- Week #6 (Oct 31st)
  NO CLASS (FCUP activities)
- Week #7 (Nov 5th)
  - Basic Concepts in Classification
  - Review for TEST #1
- Basic Concepts in Classification File
  
  Uploaded 12/11/24, 10:50
- Week #8 (Nov 12th)
- Basic Concepts in Classification: Decision Trees (from slide 15) File
  
  Uploaded 12/11/24, 10:50
- Naive Bayes classifier File
  
  Uploaded 12/11/24, 10:53
- Week #9 (Nov 19th)
- Naive Bayes classifier (from slide 9) and Belief Networks File
  
  Uploaded 18/11/24, 19:20
- Evaluating the Performance of a Classifier File
  
  Uploaded 25/11/24, 18:41
- Evaluation Metrics File
  
  Uploaded 26/11/24, 10:22
- Week #10 (Nov 26th)
- Evaluating the Performance of a Classifier (from slide 10) File
  
  Uploaded 25/11/24, 18:41
- Evaluation Metrics (revisited) File
  
  Uploaded 26/11/24, 10:23
- Week #11 (Dec 3rd)
- Regression and KNN File
  
  Uploaded 3/12/24, 09:56
- Python code associated with the regression and KNN slides File
  
  Uploaded 3/12/24, 09:54
- Melbourne data associated with regression and KNN slides File
  
  Uploaded 3/12/24, 09:55
- Support Vector Machines (SVM) File
  
  Uploaded 3/12/24, 11:04
- A little more detail on SVMs File
  
  Uploaded 3/12/24, 10:17
  
  Section 2.6.1.4 of this dissertation has a detailed and nice explanation about SVMs.
- Artificial Neural Networls File
  
  Uploaded 9/12/24, 18:12
- Week #12 (Dec 10th)
  Clustering
- Clustering File
  
  Uploaded 9/12/24, 21:26
- Week #13 (Dec 17th)
  Review for Test #2
- Ensembles File
  
  Uploaded 17/12/24, 11:19
- Basic Association Analysis File
  
  Uploaded 17/12/24, 15:52
Practical Classes
- Week #1: Introduction to Pandas
  Our first practice will be an introduction to Pandas, a Python library for data pre-processing and data analysis that can be found here. Before you start this tutorial, it may be helpful to have a version of Python installed and get used with the interface, if you do not wish to use Colab notebooks.
  In this class, we will follow exercises 01 to 09. Don't worry if you can not finish all exercises during class. You will be able to revisit them in future classes.
- Week #2: Exploring a dataset
  This is a colab notebook where you will have a chance of practicing data exploration with pandas.
- Upload here your pdf for practical #2 Assignment
- Week #3: Distances, correlation, entropy, mutual information
  
  This is a colab notebook where you will have a chance of practicing with the various distance calculations either implementing them yourself or using implementations found in Python libraries.
  
  Some questions you should answer:
  
  Give examples of suitable domains to apply Euclidean, Minkowski, Mahalanobis and Cosine distances
  
  What is the advantage of using each one of these distances?
  
  Why is it convenient to normalize data when calculating distances?
  
  Which distances are sensitive to non-normalized data?
  
  What is Bray-Curtis distance and what is it good for?
  
  What is the meaning and utility of a distance between:
  
  two objects?
  
  two features?
  
  What is the difference between distances, correlation, entropy and mutual information?
  A good guide to the use of correlation between variables of different types: https://datascientest.com/en/calculate-correlation-between-two-variables-how-do-you-measure-dependence
- Upload here your pdf for practical #3 Assignment
- Week #4: Data Visualization, sampling, dimensionality reduction, data imputation, data transformation
  
  Some questions you should answer:
  
  Why is it necessary to reduce the dimension of the data? What does it mean?
  
  Is it always convenient to impute missing data? Give examples when it is not convenient to impute missing data.
  
  What is the objective of a pairplot?
  
  What is a boxplot and when should be used?
  
  Discuss about different ways of transforming data: discretization (various ways), binarization, label encoding etc.
- Upload here your pdf for practical #4 Assignment
- Week #5: Data Visualization, sampling, dimensionality reduction, data imputation, data transformation
  (continuation of previous week with focus on data imputation and data visualization)
  
  Some questions you should answer:
  
  Why is it necessary to reduce the dimension of the data? What does it mean?
  
  Is it always convenient to impute missing data? Give examples when it is not convenient to impute missing data.
  
  What is the objective of a pairplot?
  
  What is a boxplot and when should be used?
  
  Discuss about different ways of transforming data: discretization (various ways), binarization, label encoding etc.
- Entropy revisited File
  
  Uploaded 22/10/24, 14:35
- Distances revisited File
  
  Uploaded 22/10/24, 15:08
- species.csv File
  
  Uploaded 22/10/24, 15:53
- Week #6:
  NO CLASS (FCUP activities)
- Week #7:
  TEST #1
- Data Mining 1 - Test #1 Quiz
- Upload a pdf wih your answers here. Assignment
- Week #8:
  
  Decision Trees
  
  Apply the Python DecisionTreeClassifier to the iris dataset
  
  Use Decision Trees on the German credit dataset.
  
  What is the misclassification error of DT on the training data?
  
  Take one example and explain how the DT obtains the classification.
  
  Plot the tree.
  
  Try different pruning approaches, obtaining trees from 1 node to maximum size.
  
  Try to identify the possible cutpoints of the duration attribute.
  
  Naive Bayes
  Repeat the same steps above, now using a naive Bayes classifier. For the iris dataset, you will need to discretize the variables. For the German credit dataset you have three options: train with only categorical features, train with only numerical features or all. For each one you need to use a different Python package (see corresponding notebook below).
- Code for decision trees (iris, with pruning) URL
- Code for naive Bayes (German credit dataset) URL
- Upload here your report on decision trees Assignment
- Week #9:
  
  Intermediate assignment presentation.
- Week #10: Decision Boundaries and Performance Evaluation of Classifiers
- Decision boundaries URL
- Performance Evaluation of Classifiers URL
- Week #11: Regression and SVMs
- Regression and Logistic Regression URL
- SVM exercises URL
- Week #12: Clustering and Ensemble models
- Hierarchical Clustering URL
  
  Here it is some Python code applying hierarchical clustering to the iris dataset.
  Explore the various options of clustering, including k-means, k-means++ and dbscan. Identify differences between these different clustering methods.
  Apply these methods and evaluate the quality of the generated clusters using your favorite dataset.
- Week #13:
  TEST #2
- Data Mining 1 - Test #2 Quiz
- Upload a pdf wih your answers here. Assignment
- Subjects:
  - Machine learning concepts. Supervised and unsupervised learning
  - Classification: Decision trees, Logistic Regression, SVMs, Bayesian networks, Neural Networks
  - Regression: linear and ridge regression, regularization
  - Evaluation metrics for classification and regression
  - Model validation
  - Clustering analysis
- PPT_to_PDF Folder
- Grades of Test #2

Topic outline

General

Assignment

Homeworks

Python Books

Theoretical Classes

Practical Classes