Data scientist is called "the sexiest job of the Century" by Harvard Business Review.
In the first part of the course, we learn the basics of data science and supervised learning. We give a general introduction to data analysis, modeling, and algorithms of data science with a special focus on supervised learning methods.
Lectures are supplemented by problem-solving sessions, Python programming exercises and student projects in small teams.
Aim of the Course:
The aim of the course is to provide a comprehensive introduction to data science with a focus on machine learning. By the end of the course, students will be able to choose the right algorithms for data science problems to build, implement and evaluate machine learning models.
The aim of the course is to provide the knowledge and skills needed to excel in a job interview for a junior data scientist position.
Basics of linear algebra (basic matrix operation, solving systems of linear equations, equations of lines and planes)
Basics of multivariate calculus (partial derivatives, gradient, finding maxima and minima of uni- and multivariate functions)
Basics of probability (Conditional probability, Bayes theorem, correlation, covariance, binomial distribution, normal distribution)
Basics of Python programming
- Introduction to Data Science: Concept, history and process (CRISP-DM) of data science, the goal of data science and its applications. Attributes, datasets, Big Data, Machine Learning tasks.
- Data exploration, preparation and similarity measures: Data preparation, explanatory analysis, data visualization, summary statistics, sampling, attribute aggregation, transformation, and discretization. Minkowski distance, Mahalanobis distance, Cosine similarity, SMC, Jaccard index, Hamming distance, DTW.
- kNN and Decision Tree: Method of nearest neighbors and its accelerations (K-d tree), Bayes classifier, Decision Tree, Hunt algorithm, split purity, impurity metrics, validation.
- Overfitting, validation: Generalization, training, test, and validation sets. Cross-validation, under and overfitting, Occam’s razor, confusion matrix, performance indicators, ROC, AUC
- Naive Bayes: Naive Bayes classifier, a posteriori and maximum likelihood estimation, estimation with normal distribution, Laplace and m estimation
Python: pandas, Scikit-learn, NumPy, SciPy, matplotlib, IPython, Keras, TensorFlow, BeatufilSoup, Selenium.
Topics: arrays, web scraping, data gathering (API), data import (CSV, JSON, XML/HTML), classification and regression tasks
Method of instruction:
Lectures (presentations) Problem solving sessions (handouts) Programming sessions (IPython notebooks)
Homework assignments (biweekly) in Python (50%)
Final test (50%)
A sample final test is available here.
Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to data mining. 2005.
Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.
Roland Molontay (born 1991) obtained his PhD degree in network and data science from Budapest University of Technology and Economics (BME). He was a visiting PhD student at Brown University in 2016. Currently he holds a research position at MTA-BME Stochastics Research Group and he also teaches mathematics and data science at BME for undergraduate and graduate students. He has been participating in many successful data intensive R&D projects with renowned companies (such as NOKIA-Bell Labs) throughout the years. He has been awarded the Gyula Farkas Memorial Prize in 2020 for his outstanding work in applied mathematics. He is the founder and leader of the Human and Social Data Science Lab at BME.