Machine learning has a far-reaching application in many areas of modern biology, such as genomic analysis, medical image mapping, drug generation, and medical diagnostics.
Machine Learning in Biology is a course designed to gradually initiate students into the domain of machine learning, neural networks, and their multiple usages in biology.
The course lasts 1 semester, with 14 theoretical and 14 practical classes.
The course is available to Bachelor's and Master's degree holders in the science faculties at Lomonosov Moscow State University. The course introduces students to the mathematical apparatus of machine learning and lets them try their hand at real-life challenges.
Topics covered by the course:
Unit 1. Basics of Machine Learning
1. Machine Learning. Types of problems, examples. Machine Learning in Biology. Notable success stories. AlphaFold
2. Overlearning, underlearning. K-nearest neighbor method. Its application when a good feature vector is available. Regressions. Classifications. Assessment of model quality. Split between learning and testing. Linear Regression. Maximum Likelihood Method. Gradient Descent. Regularization. Logistical Regression.
3. The Support Vector Machine. Kernel Trick. Applications in biology (deltaSVM). Hyperparameters. Why selecting through test is a bad idea. Cross-validation. The issue of correct partitioning of biological data (TargetFinder: the leakage issue).
4. The Curse of Dimensionality. Dimensionality Reduction Methods. PCA. kernel-PCA. t-SNE. UMAP. Issues with the dimensionality reduction methods. The Batch Effect. Imaging of transcriptomic data, specifically single cell data.
5. Clustering. K-Means. DBSCAN. Affinity propagation. Hierarchical Clustering. Analysis of differential expression data. Clustering of biological assets (structure, drugs). Unit 2. Tree models and boosting as state-of-the-art methods
6. Decision Trees. Category features and how to handle them (label encoding and one-hot encoding). Category features in biology. Random Forest. The importance of Random Forest in biology. Random Forest diagnostics.
7. Gradient Boosting. Modifications of Gradient Boosting: xgboost, lightgbm, catboost. Prediction of drug properties with gradient boosting. Assessing the importance of features in a Decision Tree (Gini impurity and mean accuracy decrease). Boruta. Permutation method. SHAP. Analysis of functional groups instrumental in the inhibition of the specified ferment with the aid of SHAP.
Unit 3. Neural Networks
8. Neural Networks. Introduction. Backpropagation. Chain rule. PyTorch framework. Application of multilayer neural networks on biological data.
9. Convolutional Neural Networks. Applications in biology. Diagnosing conditions caused by genomic mutations. Predicting the effects of single nucleotide polymorphisms. Predicting the energy of protein-ligand binding.
10. Deep Neural Network Learning Methods. Activation function. Initialization of weights. Batch normalization. Optimizers. Working with category features in neural networks. Machine learning in immunology.
11. Autoencoders. Representation learning. Biological image segmenting. Predicting noxious mutations in the encoding sections. Using neural networks to work with noisy data in biology. U-Net.
12. Recurrent Neural Networks. Predicting splicing areas. Predicting secondary RNA structure. GANs. Drug design.
13. Attention Mechanism. Attention transformers. Representation learning using attention transformers. GPT-3. BERT. Biological text mining with BioBERT. DINO.
14. Graph Neural Networks. Message passing. Using graph neural networks to analyze protein-disease interaction graphs.
msubioai@gmail.com
We plan to admit 2 study groups of 20 students each.
Prerequisites: