Differences

This shows you the differences between two versions of the page.

--- ml:theory:generalization_in_deep_learning [2023/05/17 05:15] – [Double Descent] jmflanig
+++ ml:theory:generalization_in_deep_learning [2025/05/29 07:00] (current) – [Grokking] jmflanig
@@ Line 4: / Line 4: @@
 ===== Overviews =====
   * [[https://lilianweng.github.io/lil-log/2019/03/14/are-deep-neural-networks-dramatically-overfitted.html|Lil'log - Are Deep Neural Networks Dramatically Overfitted?]]  Good summary from 2019.
+  * **Overview Papers**
+    * [[https://arxiv.org/pdf/2012.10931|He and Tao 2022 - Recent Advances in Deep Learning Theory]]
   * **Textbooks**
     * **[[https://arxiv.org/pdf/2106.10165.pdf|Roberts & Yaida 2021 - The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks]]**
@@ Line 52: / Line 54: @@
   * **Theory Papers**
     * **[[https://arxiv.org/pdf/1911.05822.pdf|Deng et al 2019 - A Model of Double Descent for High-dimensional Binary Linear Classification]]**
-  * **[[https://arxiv.org/pdf/2205.15549.pdf|Lee & Cherkassky 2022 - VC Theoretical Explanation of Double Descent]]** See page 3, the two settings of controlling VC dimension. Under their setup (one hidden layer), "during second descent, the norm of weights in the output layer can be used to approximate the VC-dimension of a neural network."  Note that this only happens during second descent.
+    * **[[https://arxiv.org/pdf/2205.15549.pdf|Lee & Cherkassky 2022 - VC Theoretical Explanation of Double Descent]]** See page 3, the two settings of controlling VC dimension. Under their setup (one hidden layer), "during second descent, the norm of weights in the output layer can be used to approximate the VC-dimension of a neural network."  Note that this only happens during second descent.
+==== Grokking ====
+  * [[https://arxiv.org/pdf/2201.02177.pdf|Power et al 2022 - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets]]
+  * [[https://arxiv.org/pdf/2505.20896|Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?]]: "We find that the model’s final solution builds upon, rather than replaces, the heuristics learned in earlier phases. This adds nuance to the traditional narrative about “grokking”, where models are thought to discard superficial heuristics in favor of more systematic solutions. Instead, our model maintains its early-line heuristics while developing additional mechanisms to handle cases where these heuristics fail, suggesting cumulative learning where sophisticated capabilities emerge by augmenting simpler strategies."
 ===== Related Pages =====
   * [[ml:Regularization]]