====== Machine Learning Overview ======
This page is a concise [[ml_overview#overview of topics]] in machine learning, with links to readings and other learning materials.  Roughly, these topics are the union of topics covered in various ML books and courses.

This is a resource to help you get up to speed in various topics if you're trying to learn ML on your own or broaden your ML knowledge.

See also [[https://aman.ai/primers/ai/|Aman.ai - AI Fundamentals]]
===== Books =====
  * Pattern Recognition and Machine Learning, Bishop, 2006 (Referenced below as Bishop) available [[https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf|here]] or [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf|local copy]]
  * An Introduction to Statistical Learning (Reference below as ISL) available [[https://www.statlearning.com/|here]] or [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf|local copy]]
  * [[https://web.stanford.edu/~hastie/ElemStatLearn/|The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2009]] (Referenced below as ESL) available [[https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf|here]] or [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf|local copy]]
  * [[http://ciml.info/|CIML]] (Referenced below as CIML) available [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf|here]]
  * [[http://www.cs.cmu.edu/~tom/mlbook.html|Machine Learning, Tom Mitchell, McGraw Hill, 1997]] (Referenced below as MLBook) available [[https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf|here]]
  * [[https://probml.github.io/pml-book/|Machine Learning Books by Kevin Murphy]]
    * Machine Learning A Probabilistic Perspective, 2012 (referenced below as Murphy) available [[http://noiselab.ucsd.edu/ECE228/Murphy_Machine_Learning.pdf|here]]
    * Probabilistic Machine Learning: An Introduction, 2021 (reference below as PML1) available [[https://github.com/probml/pml-book/releases/latest/download/book1.pdf|here]] or [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:pml-book1.pdf|local copy]]
  * [[https://www.deeplearningbook.org|Deep Learning Book]] (Referenced below as DLBook)
  * [[https://www.jair.org/index.php/jair/article/view/11030|A Primer on Neural Network Models for Natural Language Processing, Yoav Goldberg, 2016]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:nn_primer.pdf|local copy]] (Referenced below as NNPrimer)
  * The Matrix Cookbook, available [[https://svivek.com/teaching/deep-learning-nlp/spring2019/resources/matrixcookbook.pdf|here]]
  * [[https://www.eleuther.ai/beginners.pdf|Intro to ML]] (note this was released April 1st)

===== Courses =====
  * [[https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/|Machine Learning for Intelligent Systems @ Cornell]]
  * Very quick intro to machine learning (slides): [[https://indico.physics.lbl.gov/event/569/contributions/1382/attachments/1268/1407/gs-20170913.pdf|Introduction to Machine Learning]] {{papers:quick-ml-intro.pdf|local copy}}

===== Overview of Topics =====
This overview contains links to particular pages in textbooks, lectures, blog posts, and videos covering the topic, listed easiest to hardest to understand, with videos listed at the end.  In other words, for each topic, introductory material is listed first with more advanced material afterwards, although you may find more advanced material easier to understand in some cases.

**//The blog posts and some of the videos are introductory and give the overall gist of the method, but may contain mathematical or conceptual errors. Videos that are lectures should be fine.//**

  * **Introduction to Machine Learning** [[https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf#page=13|MLBook p. 1-15]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:pml-book1.pdf#page=29|PML p. 1-28]]
    * "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed." [[https://en.wikipedia.org/wiki/Arthur_Samuel|Arthur Samuel]] [[https://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=45FE379DC2BFEA630F406F16589305D1?doi=10.1.1.368.2254&rep=rep1&type=pdf|1959]]
  * **Basic Machine Learning Concepts**
    * **Inductive Bias** [[https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf#page=51|MLBook p. 39-45]]
    * **Overfitting/Underfitting**
    * **Approximation error vs estimation error aka Bias-Variance Tradeoff** [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=71|CIML p. 71-72]] [[https://people.eecs.berkeley.edu/~bartlett/courses/281b-sp08/20.pdf|Bartlett notes]] Sometimes also called the bias-variance tradeoff.
    * **Features**
    * **Hyperparameters**
    * **Train/dev/test split**
    * **Don't look at the test data** [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=25|CIML p. 25]] "Do not look at your test data. Even once. Even a tiny peek. Once you do that, it is not test data any more. Yes, perhaps your algorithm hasn’t seen it. But you have. And you are likely a better learner than your learning algorithm."
  * **Additional ML Topics**
    * **Generative vs Discriminative Classifiers**
      * **"Generative Story"** [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=123|CIML p. 123-124]]
    * **Bayesian statistics**
    * **MLE vs MAP estimation** (and examples of MAP in machine learning) [[https://towardsdatascience.com/mle-vs-map-a989f423ae5c|Blog]]
  * **Classification**
    * **Naive Bayes** [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=120|CIML p. 120-123]], [[http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf#page=5|MLBook p. 5]]
      * Note: Naive Bayes is a generative classifier - it estimates p(x,y).  It can be used for binary or multiclass classification.  A Naive Bayes classifier for documents where the input features are words is called a "Bag of Words model"
      * Logistic regression (LR) and Naive Bayes have the same model form, but Naive Bayes maximizes p(x,y) while LR maximizes p(y|x).  See [[https://svivek.com/teaching/lectures/slides/naive-bayes/naive-bayes-linear.pdf|Vivek's NB Note]] or [[http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf#page=14|MLBook p. 14]]
    * **Logistic Regression** [[http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf#page=7|MLBook p. 7-14]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf#page=138|ESL p. 119-122]]
    * ** *Decision Trees** [[https://towardsdatascience.com/decision-trees-in-machine-learning-641b9c4e8052|Blog1]] [[https://towardsdatascience.com/a-guide-to-decision-trees-for-machine-learning-and-data-science-fe2607241956|Blog2]] [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=10|CIML p. 10-18]] [[https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf#page=324|ESL p.305-310]]
      * ** *Random Forests** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=325|ISLR p. 319-321]]
    * ** *k Nearest Neighbors (k-NN)** [[https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761|Blog]] [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=29|CIML p. 29-40]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=145|Bishop p. 125-127]] (starting from "We close this chapter by showing how the K-nearest-neighbour technique for density estimation can be extended to the problem of classification...")
    * **Perceptron** [[https://towardsdatascience.com/perceptron-learning-algorithm-d5db0deab975|Blog]] [[https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote03.html|Lecture (w/ video)]] [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=41|CIML p. 41-54]] [[https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf#page=98|MLBook p. 86]], [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=212|Bishop p. 192-196]], [[https://en.wikipedia.org/wiki/Perceptron|Wikipedia]]
      * The perceptron algorithm is actually minimizing a function of the data. It turns out, stochastic gradient descent (SGD) with stepsize 1 on a particular function of the data (called the perceptron loss function) is exactly the perceptron algorithm.
      * There is a multiclass version of the perceptron
    * **Neural Networks** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:nn_primer.pdf#page=10|NNPrimer p. 354-379]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=245|Bishop p. 225-272]] [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=129|CIML p. 129-140]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf#page=411|ESL 392-401]]
      * **Bayesian Neural Networks** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=297|Bishop p. 277-284]]
    * **Support Vector Machines** [[https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47|Blog1]] [[https://monkeylearn.com/blog/introduction-to-support-vector-machines-svm/|Blog2]] [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=100|CIML p. 100-103]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=345|Bishop p.325-345]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=342|ISL p. 337-349]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf#page=436|ESL p. 417-422]]
      * Either the primal or the dual version of the SVM optimization problem can be used.  Historically, the dual version was used.  However, the dual must be optimized using a specialized optimization algorithm such as sequential minimal optimization (SMO), while the unconstrained version of the primal can be optimized using any gradient-based optimizer such as stochastic gradient descent (SGD), which is usually faster in practice.  For this reason, for large-scale learning, the primal version with a gradient-based optimizer is often preferred.  See [[ml:Support Vector Machines|SVM]].
      * ** *Kernel Methods** [[https://monkeylearn.com/blog/introduction-to-support-vector-machines-svm/|Blog]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=355|ISL p. 350-354]] [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=141|CIML p. 141-148]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf#page=442|ESL p. 423-438]] Kernel methods can also be used for regression.
    * **Loss Functions and Training**
  * **Regression**
      * ** *Linear Regression** [[https://machinelearningmastery.com/linear-regression-for-machine-learning/|Blog1]] [[https://ml-cheatsheet.readthedocs.io/en/latest/linear_regression.html|Blog2]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf#page=62|ESL p. 43-51]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=157|Bishop p. 137-147]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=68|ISL p. 59-119]]
        * MAP vs MLE linear regression (MAP adds a regularizer term)
        * Bayesian linear regression
      * **"Non-Linear" Regression**
        * **Polynomial Regression** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=273|ISL p. 266-268]]
        * **Splines** [[https://www.stat.cmu.edu/~ryantibs/advmethods/notes/smoothspline.pdf|Notes]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=278|ISL p. 271-280]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf#page=158|ESL p. 139-181]]
  * **Practicalities**
    * ** *Hyperparameters and Model Selection**
      * **Train/dev/test split** [[https://www.youtube.com/watch?v=1waHlpKiNyY|Video]]
        * The most practical and principled way to select the model and hyperparameters is on a development set.
    * **Feature Selection and Feature Engineering** [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=55|CIML p. 55-62]]
    * **Regularization**
      * **Early-stopping**
      * **L2 regularization**
      * **L1 regularization**
      * **Pruning (for decision trees)**
    * **Evaluation**
      * **Accuracy**
      * **Precision, Recall, F1: Macro vs Micro averaging**
      * ** *Area Under the Curve (AUC)** [[https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc|Blog]]
      * **Tests of Significance** [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=67|CIML p. 67-69]]
    * **Data Resampling Methods**
      * **k-fold Cross Validation** Be careful using this method on NLP datasets! Due to the non-IID nature of NLP datasets, it is generally not recommended to use k-fold cross validation (can over-estimate performance).  Better to use a thoughtfully-chosen train/dev/test split.
      * ** *Bootstrap Resampling**
      * **Jacknife**
    * **Debugging ML** [[http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf#page=69|CIML p. 69-71]] [[http://karpathy.github.io/2019/04/25/recipe/|Blog]]
  * **Deep Learning**
    * **NN Architectures**
      * **Feedforward NNs**
      * **Convolutional NNs (CNNs)** [[https://www.ibm.com/cloud/learn/convolutional-neural-networks|Blog1]]  [[https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-convolutional-neural-networks|Blog2]] [[https://www.deeplearningbook.org/contents/convnets.html|DLBook]] [[https://www.youtube.com/watch?v=YRhxdVk_sIs|Video]]
      * **Recurrent NNs (RNNs)** [[https://www.ibm.com/cloud/learn/recurrent-neural-networks|Blog1]] [[http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/|Blog2]] [[https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks|Blog3]] [[https://towardsdatascience.com/transformers-141e32e69591|Blog4 (covers RNNs, LSTMs, Attention, Transformers]] [[https://www.deeplearningbook.org/contents/rnn.html|DLBook]]
      * ** *Attention** [[https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/|Blog1]] [[https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f|Blog2]] [[https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/|Blog3]] [[https://arxiv.org/pdf/1709.07809.pdf#page=48|NMT p. 48-52]] [[nlp:Attention Mechanisms|Attention]]
      * ** *Transformers** [[https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html|Blog1]] [[https://towardsdatascience.com/transformers-141e32e69591|Blog2]] [[https://jalammar.github.io/illustrated-transformer/|Blog3]] [[https://arxiv.org/pdf/1706.03762.pdf|Paper (hard)]] [[https://nlp.seas.harvard.edu/2018/04/03/attention.html|Annotated Transformer (perhaps easier)]]
    * **Training Methods** [[https://www.youtube.com/watch?v=sZAlS3_dnk0|Video]]
      * **Generative Adversarial Networks (GANs)**
  * ** *Reinforcement Learning**
  * **Graphical Models**
    * **Bayesian Networks** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=380|Bishop p. 360]]
      * **Hidden Markov Models (HMMs)**
    * **Undirected Graphical Models (MRFs and CRFs)** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=403|Bishop p. 383]]
      * **Linear-chain Conditional Random Fields**
    * **Factor Graphs** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=419|Bishop p. 399]]
    * **Inference**
      * **Variable Elimination**
      * **Belief Propagation** (Sum-Product and Max-Product Algorithms) [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=422|Bishop p. 402-415]] 
      * **Junction Tree Algorithm**
      * **Loopy Belief Propagation** [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:bishop.pdf#page=437|Bishop p. 417-418]]
      * **Variational Inference**
    * **Sampling Methods**
  * **Combining Models**
    * **Ensembling**
    * **Mixture of Experts**
    * ** *Boosting**
    * **Bayesian Model Averaging**
    * **Bagging**  [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=322|ISL p. 316-318]]
  * **Unsupervised Methods**
    * **Density Estimation**
    * ** *EM Algorithm** [[https://machinelearningmastery.com/expectation-maximization-em-algorithm/|Blog]] [[https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf#page=203|MLBook p. 191-196]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:eslii_print12.pdf#page=291|ESL p. 272-279]] [[http://noiselab.ucsd.edu/ECE228/Murphy_Machine_Learning.pdf#page=379|Murphy p. 348-359]] [[https://home.ttic.edu/~dmcallester/ttic101-07/lectures/em/em.pdf|Lecture Notes (covers hard and soft EM and application to HMMs)]] [[https://www.youtube.com/watch?v=REypj2sy_5U|Video (EM for Gaussian Mixture Models)]]
      * There is both soft EM (soft assignment, the usual version) and hard EM (hard assignment during E-step).  Both versions "work" in that they will both converge to a local maximum.  The hard EM version can converge faster but sometimes doesn't work as well (see [[https://home.ttic.edu/~dmcallester/ttic101-07/lectures/em/em.pdf|here]] or [[https://www.cs.cmu.edu/~tom/10601_fall2012/recitations/em.pdf|here]]).
      * There are also [[https://www.aclweb.org/anthology/N09-1069.pdf|online versions of EM]] and other variants, see [[http://noiselab.ucsd.edu/ECE228/Murphy_Machine_Learning.pdf#page=396|Murphy p. 365-369]]
    * ** *Clustering**
      * **K-Means Clustering** [[https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1|Blog1]] [[https://stanford.edu/~cpiech/cs221/handouts/kmeans.html|Blog2]] [[https://jlab.soe.ucsc.edu/nlp-wiki/lib/exe/fetch.php?media=book:islr-7th.pdf#page=391|ISL p. 386-390]] [[http://noiselab.ucsd.edu/ECE228/Murphy_Machine_Learning.pdf#page=383|Murphy p. 352-354]] Video: [[https://www.youtube.com/watch?v=c_oCcpN0_0s|1]] [[https://www.youtube.com/watch?v=WZL4d47hmFs|2]] [[https://www.youtube.com/watch?v=IJt62uaZR-M|3]]
        * K-Means is an instance of the hard EM algorithm, see [[https://home.ttic.edu/~dmcallester/ttic101-07/lectures/em/em.pdf|Lecture Notes]]
      * **Hierarchical Clustering**
      * **Agglomorative Clustering**
    * ** *Principle Component Analysis (PCA)**
  * **Structured Prediction**
    * **Structured Perceptron**
    * **Structured SVM**
    * **Conditional Random Fields (CRFs)**
  * **Probability and Statistics Background**
    * **Terminology**
      * **Probability Distribution** (referred to as just a **"Distribution"**)
      * To **sample** from a probability distribution
      * **Parameters**
      * **Random Variable**
      * **Independent**
      * **Independent and Identically Distributed (IID)**
      * **Joint Distribution**
      * **Marginal Distribution** (referred to as just a **Marginal**).  Also to **marginalize**
        * To compute a marginal, you marginalize (sum) over the other random variables
    * **Probability Distributions: Uniform, Normal, Poisson, Binomial, etc**
    * ** *Bias-Variance Decomposition** [[https://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote12.html|Lecture]], [[http://cs229.stanford.edu/summer2020/BiasVarianceAnalysis.pdf|Notes]] This is a statistics term, used when analyzing mean squared error in regression or density estimation, for example. In machine learning, it's more properly called approximation error (≈ bias) and estimation error (≈ variance) because you can't compute the bias (E[y]) or variance E[(y - E[y])^2] for non-numeric outputs like classes in multi-class classification.  However, these terms are often applied to ML somewhat loosely.
    * **Density Estimation**
      * **Histograms**
      * **Kernel Density Estimators**
    * **Gaussian Processes**
  * **Theory**
    * **Concept Learning**
    * **Hypothesis Space**
    * **Inductive Bias** [[https://www.cin.ufpe.br/~cavmj/Machine%20-%20Learning%20-%20Tom%20Mitchell.pdf#page=51|MLBook p. 39-45]]
    * **Bias-Variance Tradeoff**
    * **VC dimension**
    * **NP hardness of Learning**
    * **PAC Learning Theory**
    * **PAC-Bayesian Learning Theory**
  * **Information Theory** [[http://noiselab.ucsd.edu/ECE228/Murphy_Machine_Learning.pdf#page=87|Murphy p. 56-61]]
    * **Entropy**
    * **Cross-entropy**
    * **Mutual Information**
    * **KL-Divergence**
  * **Software**
    * **R**
    * **scikit-learn**
    * **TensorFlow**
    * **PyTorch**
    * **NLTK**
    * **SpaCy**
    * **OpenCV**

===== Related Pages =====
  * [[Deep Learning]]
  * [[ML Glossary]] (glossary of slightly more advanced terms)