Differences

This shows you the differences between two versions of the page.

--- ml:theory:generalization_in_deep_learning [2025/03/06 10:18] – [Overviews] jmflanig
+++ ml:theory:generalization_in_deep_learning [2025/05/29 07:00] (current) – [Grokking] jmflanig
@@ Line 58: / Line 58: @@
 ==== Grokking ====
   * [[https://arxiv.org/pdf/2201.02177.pdf|Power et al 2022 - Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets]]
+  * [[https://arxiv.org/pdf/2505.20896|Wu et al 2025 - How Do Transformers Learn Variable Binding in Symbolic Programs?]]: "We find that the model’s final solution builds upon, rather than replaces, the heuristics learned in earlier phases. This adds nuance to the traditional narrative about “grokking”, where models are thought to discard superficial heuristics in favor of more systematic solutions. Instead, our model maintains its early-line heuristics while developing additional mechanisms to handle cases where these heuristics fail, suggesting cumulative learning where sophisticated capabilities emerge by augmenting simpler strategies."
 ===== Related Pages =====
   * [[ml:Regularization]]