====== Generalization in Deep Learning ======
The theory of generalization in deep learning is not well understood, and is an active area of research.

===== Overviews =====
  * [[https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neural-networks-dramatically-overfitted.html|Lil'log - Are Deep Neural Networks Dramatically Overfitted?]]  Good summary from 2019.
  * **Overview Papers**
    * [[https://arxiv.org/pdf/2012.10931|He and Tao 2022 - Recent Advances in Deep Learning Theory]]
  * **Textbooks**
    * **[[https://arxiv.org/pdf/2106.10165.pdf|Roberts & Yaida 2021 - The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks]]**

===== Key Papers =====
  * **[[https://arxiv.org/pdf/1611.03530.pdf|Zhang et al 2017 - Understanding deep learning requires rethinking generalization]]** [[https://github.com/pluskid/fitting-random-labels|Code]]
  * [[https://arxiv.org/abs/1803.03635|Frankle & Carbin 2019 - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks]]  After training, only a subset of the edges in the network matter.  The rest can be pruned (take only the edges with the highest weights) before training if you keep the same initialization weights.
  * [[https://arxiv.org/pdf/1703.00810.pdf|Shwartz-Ziv & Tishby 2017 - Opening the Black Box of Deep Neural Networks via Information]] Shows that "neural networks get trained in two phases: i) empirical risk minimization (ERM) phase where the stochastic gradient descent (SGD) algorithm generates high SNR gradients, the loss rapidly decreases and ii) a much slower representation compression(RC) phase where efficient representations for intermediate activations are learned." (from [[https://arxiv.org/pdf/2006.07522.pdf|Raj 2020]])
  * **[[https://arxiv.org/pdf/1706.05394.pdf|Arpit et al 2017 - A Closer Look at Memorization in Deep Networks]]**
  * [[https://arxiv.org/pdf/1810.08591.pdf|Neal et al 2018 - A Modern Take on the Bias-Variance Tradeoff in Neural Networks]]
  * [[https://openreview.net/pdf?id=ry_WPG-A-|Saxe et al 2018 - On the Information Bottleneck Theory of Deep Learning]]
  * [[https://arxiv.org/pdf/2002.11448.pdf|Unterthiner et al 2020 - Predicting Neural Network Accuracy from Weights]]
  * [[https://arxiv.org/pdf/2006.07522.pdf|Raj et al 2020 - Understanding Learning Dynamics of Binary Neural Networks via Information Bottleneck]]
  * [[https://www.annualreviews.org/doi/pdf/10.1146/annurev-conmatphys-031119-050745|Bahri et al 2020 - Statistical Mechanics of Deep Learning]]
  * [[https://arxiv.org/pdf/2105.12806.pdf|Bubeck & Sellke 2021 - A Universal Law of Robustness via Isoperimetry]] Best paper at NeurIPS 2021
  * [[https://arxiv.org/pdf/2205.15549.pdf|Lee & Cherkassky 2022 - VC Theoretical Explanation of Double Descent]] Under their setting (one hidden layer), "during second descent, the norm of weights in the output layer can be used to approximate the VC-dimension of a neural network."  Note that this only happens during second descent.

===== Sample Complexity Bounds =====
  * Aarti Singh's work: [[https://www.youtube.com/watch?v=pcuW7llesxs|2019 - Sample-complexity of Estimating Convolutional and Recurrent Neural Networks (video)]]
    * [[https://arxiv.org/pdf/1805.07883.pdf|Du et al 2019 - How Many Samples are Needed to Estimate a Convolutional or Recurrent Neural Network?]] Shows O(1/sqrt(n)).  Also shows matching lower bound.

===== Phenomena =====

==== Double Descent =====
From [[https://arxiv.org/pdf/2003.01897.pdf|Nakkiran et al 2020 - Optimal Regularization Can Mitigate Double Descent]]:
<blockquote>
Recent works have demonstrated a ubiquitous “double descent” phenomenon present in a range
of machine learning models, including decision trees, random features, linear regression, and deep
neural networks (Opper, 1995, 2001; Advani & Saxe, 2017; Spigler et al., 2018; Belkin et al., 2018;
Geiger et al., 2019b; Nakkiran et al., 2020; Belkin et al., 2019; Hastie et al., 2019; Bartlett et al.,
2019; Muthukumar et al., 2019; Bibas et al., 2019; Mitra, 2019; Mei & Montanari, 2019; Liang &
Rakhlin, 2018; Liang et al., 2019; Xu & Hsu, 2019; Dereziski et al., 2019; Lampinen & Ganguli, 2018;
Deng et al., 2019; Nakkiran, 2019). The phenomenon is that models exhibit a peak of high test risk
when they are just barely able to fit the train set, that is, to interpolate. For example, as we increase
the size of models, test risk first decreases, then increases to a peak around when effective model
size is close to the training data size, and then decreases again in the overparameterized regime.
Also surprising is that Nakkiran et al. (2020) observe a double descent as we increase sample size,
i.e. for a fixed model, training the model with more data can hurt test performance.
</blockquote>
  * [[https://arxiv.org/pdf/1812.11118.pdf|Belkin et al 2018 - Reconciling modern machine learning practice and the bias-variance trade-off]] Introduced the concept of double-descent.  Note: they use squared-error loss, not cross-entropy!  (Does this affect the results?)
  * **[[https://arxiv.org/pdf/1912.02292.pdf|Nakkiran et al 2019 - Deep Double Descent: Where Bigger Models and More Data Hurt]]** Has more convincing experiments than the original paper of Belkin 2018.
  * [[https://arxiv.org/pdf/1912.13053.pdf|Xiao et al 2019 - Disentangling Trainability and Generalization in Deep Neural Networks]]
  * [[https://arxiv.org/pdf/2003.01897.pdf|Nakkiran et al 2020 - Optimal Regularization Can Mitigate Double Descent]]
  * **[[https://arxiv.org/pdf/2004.04328.pdf|Loog et al 2020 - A Brief Prehistory of Double Descent]]**
  * [[https://arxiv.org/pdf/2007.10099.pdf|Heckel & Yilmaz 2020 - Early Stopping in Deep Networks: Double Descent and How to Eliminate it]]
  * [[https://arxiv.org/pdf/2011.03321.pdf|Adlam & Pennington 2020 - Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition]]
  * **[[https://arxiv.org/pdf/2107.12685.pdf|Kuzborskij et al 2021 - On the Role of Optimization in Double Descent: A Least Squares Study]]** Shows that for least squares, double descent is related to the condition number of the underlying optimization problem
  * **Theory Papers**
    * **[[https://arxiv.org/pdf/1911.05822.pdf|Deng et al 2019 - A Model of Double Descent for High-dimensional Binary Linear Classification]]**
    * **[[https://arxiv.org/pdf/2205.15549.pdf|Lee & Cherkassky 2022 - VC Theoretical Explanation of Double Descent]]** See page 3, the two settings of controlling VC dimension. Under their setup (one hidden layer), "during second descent, the norm of weights in the output layer can be used to approximate the VC-dimension of a neural network."  Note that this only happens during second descent.

==== Grokking ====
  * [[https://arxiv.org/pdf/2201.02177.pdf|Power et al 2022 - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets]]
  * [[https://arxiv.org/pdf/2505.20896|Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?]]: "We find that the model’s final solution builds upon, rather than replaces, the heuristics learned in earlier phases. This adds nuance to the traditional narrative about “grokking”, where models are thought to discard superficial heuristics in favor of more systematic solutions. Instead, our model maintains its early-line heuristics while developing additional mechanisms to handle cases where these heuristics fail, suggesting cumulative learning where sophisticated capabilities emerge by augmenting simpler strategies."

===== Related Pages =====
  * [[ml:Regularization]]