Instructor(s):

Roland Molontay
Weeks
7-14
Contact hours
2x2 hours
Credit
2 credits

Short description:
Data scientist is called "the sexiest job of the Century" by Harvard Business Review.
In the second part of the course, we learn advanced supervised learning techniques including neural networks and ensemble methods together with unsupervised learning techniques (especially clustering). Students will have the option to define their data science projects and work in teams during the semester.
Lectures are supplemented by problem-solving sessions, Python programming exercises and student projects in small teams.

Aim of the Course:
The aim of the course is to provide a comprehensive introduction to data science with a focus on machine learning. By the end of the course, students will be able to choose the right algorithms for data science problems to build, implement and evaluate machine learning models. Students will also be able to analyze real-world data sets using complex data science methods.
The aim of the course is to provide the knowledge and skills needed to excel in a job interview for a junior data scientist position.

Prerequisites:
Basics of linear algebra (basic matrix operation, solving systems of linear equations, equations of lines and planes)
Basics of multivariate calculus (partial derivatives, gradient, finding maxima and minima of uni- and multivariate functions)
Basics of probability (Conditional probability, Bayes theorem, correlation, covariance, binomial distribution, normal distribution)
Basics of Python programming

Syllabus:

  1. Linear regression: Parametric and nonparametric regression, kNN and Decision Tree for regression task, MSE, decomposition of MSE and variance, Bias–Variance tradeoff, the optimal solution of regression, linear regression, gradient descent, stochastic gradient descent, learning rate, regularization, polynomial regression, interpreting linear regression models.
  2. Logistic regression and SVM: Classification by regression, sigmoid function, logistic regression, linear separability, non-linear decision boundary, logit model, maximal margin, support vectors and SVM
  3. Neural networks: Biological motivation, activation function, perceptron and its relation to other algorithms, representing Boolean functions with neural networks, deep-learning, forward propagation, backpropagation.
  4. Ensemble learning: Ensemble methods, bagging, metamodels, boosting and AdaBoost, gradient boosting, Random Forest, semi-supervised learning, classification of imbalanced data, SMOTE.
  5. Cluster analysis: Concept, types, clustering algorithms, k-means algorithm, hierarchical clustering, distance of clusters, Simple-linkage and Complete-linkage clustering, DBSCAN algorithm, core border and noise points, validation of clustering (distance matrix, SSE, silhouette)
  6. Recommendation systems: content-based recommender, collaborative filtering, user-based and k-nearest neighbors recommender, latent factor recommender system, matrix factorization.

Technologies:
Python:
pandas, Scikit-learn, NumPy, SciPy, matplotlib, IPython
Topics: classification and regression tasks, gradient descent, ensemble methods
Method of instructions
Lectures (presentations) Problem solving sessions (handouts) Programming sessions (IPython notebooks)

Requirements:
Final exam (50%)
A sample final exam is available here.
Team project (50%)

Recommended literature:
Tan, Pang-Ning, Michael Steinbach, and Vipin Kumar. Introduction to data mining. 2005.
Leskovec, Jure, Anand Rajaraman, and Jeffrey David Ullman. Mining of massive datasets. Cambridge University Press, 2014.

Instructors' bio:

Roland Molontay (born 1991) obtained his PhD degree in network and data science from Budapest University of Technology and Economics (BME). He was a visiting PhD student at Brown University in 2016. Currently he holds a research position at MTA-BME Stochastics Research Group and he also teaches mathematics and data science at BME for undergraduate and graduate students. He has been participating in many successful data intensive R&D projects with renowned companies (such as NOKIA-Bell Labs) throughout the years. He has been awarded the Gyula Farkas Memorial Prize in 2020 for his outstanding work in applied mathematics. He is the founder and leader of the Human and Social Data Science Lab at BME.

Students' Review About This Course

"The Data Science course solidified my decision to pursue a career in the field. Professor Molontay engaged us with the material really well as we discussed topics from gradient boosting to artificial neural networks. Professor Molontay even mentored my class project group in transforming our final project into a research paper which has been accepted into the journal Applied Network Science."

Tiernon Riesenmy

Tiernon Riesenmy

The University of Kansas

"Data Science was a great introduction to how to gather and manage big datasets! Professor Molontay gave a great overview of all the algorithms one can use to extract information from these datasets and clearly explained how these algorithms manage to do so. It was a super rewarding class and inspired me to explore topics in Machine Learning further!"

Kiersten Campbell

Kiersten Campbell

Williams College

"Data Science is a great class. It is taught very well. Prof. Molontay was constantly checking in on our progress and how we were doing. I think that was really helpful. He obviously cared about how much the students were learning and that we were actually grasping the concepts and not just getting by. He was always in very close contact with the students which was good."

Kate Barnes

Kate Barnes

Colorado College