Data Mining, Part 1

Course Title: 

Data Mining: Models and Algorithms

Instructors: 

András A. Benczúr and András Lukács

Duration:

Weeks 1-7, 2x2 hours, 2 credits

Short Description of the Course:

Knowledge discovery in databases (KDD) includes general techniques of search in data sets and prediction of future properties. Data mining concentrates mainly on models for understanding data and for predicting unknown properties. Moreover, data mining is shaped for processing data of very large volume and complexity (dimension). A special issue of MIT’s Technology Review listed data mining as one of the ten most promising disciplines in the first decade of the new millennium. This opinion was repeated in the cover story of Business Week magazine on 26 Jan 2006. What has created this outstanding interest in and growing importance of data mining is its increasing number of successful applications in industry, especially in tertiary (service) and quaternary (knowledge-based) sectors.

This course offers a general introduction to data mining. Standard techniques, models, and algorithms of the field will be discussed. Moreover, the course provides a good base for its follow-up course, Data Mining in Bioinformatics. Lectures are supplemented by computer exercises and student projects in small teams.

Aim of the Course:

The aim of the course is to provide a basic but comprehensive introduction to data mining. By the end of the course, students will be able to build models, choose algorithms, and implement and evaluate them.

Prerequisites:

The course requires basic knowledge in calculus, probability theory, and linear algebra. Knowledge of graphs and basic algorithms is an advantage.

Detailed Program and Class Schedule:

1.      Motivations for data mining. Examples of application domains. Methodology of knowledge discovery in databases (KDD) and data mining (DM). Formulation of main problems of data mining.

2.      Understanding data: preparation and exploration. Sampling.

3.      Basics of classification. Concepts of training and prediction. Decision trees.

4.      Models and algorithms for classification: k-NN, naïve-Bayes. Measuring quality and comparison of classification models.

5.      Introduction to the WEKA data mining software. Classification with WEKA.

6.      More models and algorithms for classification: neural networks, linear separation methods, support vector machine (SVM).

7.      Feature selection: filter and wrapper methods. Midterm test.

8.      Basics of cluster analysis. Type of variables, measuring similarity and distances. Partitioning clustering algorithms, k-means, k-medoids.

9.      Hierarchical clustering algorithms. Density based clustering, DBSCAN, OPTICS. Cluster analysis with WEKA.

10. Introduction to frequent itemset mining. Applications for finding association rules.

11. Level-wise algorithms, APRIORI. Partitioning and Toivonen algorithms.

12. Pattern growth methods, FP-growth. Constraints handling.

13. Hierarchical and general association rules. Pattern mining with WEKA.

14. Sequental and subgraph patterns. Final test.

Method of Instruction:

Handouts, PowerPoint presentations, relevant research papers, web page, course mailing list and Wiki. Weekly regular office hour for consultations.

Textbooks:

Jiawei Han and Micheline Kamber: Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann Publishers, 2006.

Pang-Ning Tan, Michael Steinbach, Vipin Kumar: Introduction to Data Mining, Addison-Wesley, 2006.

T. Hastie, R. Tibshirani, J. H. Friedman: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001.

Instructors’ Bios:

András Benczúr (born 1969) is a senior researcher of the Computer Science and Automation Institute of the Hungarian Academy of Science (MTA SZTAKI). He is co-founder of the Data Mining and Web Search Group and head of the Informatics Laboratory. He has been teaching Algorithms, and Web Information Retrieval at Eötvös Loránd University and Statistics at Central European University (CEU), Budapest.

He received his Ph.D. degree at MIT, US in 1997. His primary research areas are information retrieval, data mining and algorithms. He has been awarded the “Young Researcher Award” and the “Béla Gyires Award” of the Hungarian Academy of Sciences. He won a “Yahoo! Faculty Research Grant” in 2006. Benczúr’s group won 1st place at the KDD Cup of the ACM in 1997.

He is the author or co-author of more than 30 refereed research papers with over 200 citations. He has served as coordinator and/or principal researcher of several national and international information retrieval and data mining projects.

András Lukács (born 1968) is a senior researcher of the Computer Science and Automation Institute of the Hungarian Academy of Science (MTA SZTAKI). He is a co-founder and head of the Data Mining and Web Search Group. He has been teaching Data Mining at Eötvös Loránd University, Budapest. He also helped create a new major in Applied Mathematics and introduces Data Mining courses.

He received his Candidate degree (~Ph.D.) in Mathematics from the Hungarian Academy of Science in 1998, and took a postdoctoral position at CWI, Amsterdam. His primary research areas are data mining and combinatorics. In 1996-98 he served as managing editor of the international journal Combinatorica. He is also interested in applying data mining and mathematical modeling to the web, telecommunications, social networks and molecular biology. He is the author or co-author of more than 15 refereed research papers with over 70 citations. He has been coordinator and/or principal researcher of several national and international data mining projects from pharmacology, telecommunication, finance, and homeland security-related industrial domains.

 

 Download in PDF

 

Back