Differences

This shows you the differences between two versions of the page.

--- ml:optimization_in_deep_learning [2022/02/24 23:24] – rzhao17
+++ ml:optimization_in_deep_learning [2025/03/25 00:49] (current) – [Effects on Optimization] jmflanig
@@ Line 6: / Line 6: @@
   * Weight normalization
     * Improves the conditioning of the optimization problem ([[https://arxiv.org/pdf/1602.07868.pdf|Salimans & Kingma 2016]])
+  * Lipschitz Constant
+    * [[https://arxiv.org/pdf/2306.09338|Qi et al 2023 - Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant]] Talks about the effect of the Lipschitz constant on optimizing deep neural networks.
 ===== On Global Optimization of Neural Networks =====
@@ Line 29: / Line 31: @@
   * [[https://www.aclweb.org/anthology/P18-1173.pdf|Peng et al 2018 - Backpropagating through Structured Argmax using a SPIGOT]]
-===== Instability of Fine-tuning Bert =====
+===== Instabilities =====
-  * [[https://arxiv.org/pdf/2006.04884.pdf|Mosbach 2021 - ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES]] catastrophic forgetting and small size of the fine-tuning datasets fail to explain the fine-tuning instability, which is caused by optimization difficulties and differences in generalization.
+==== Instability of Fine-tuning ====
+  * [[https://arxiv.org/pdf/2006.04884.pdf|Mosbach 2020 - ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES]] catastrophic forgetting and small size of the fine-tuning datasets fail to explain the fine-tuning instability, which is caused by optimization difficulties and differences in generalization.
+===== Implicit Regularization of SGD =====
+See also Soheil Feizi's lecture [[https://www.youtube.com/watch?v=hGRssrXF-vI&list=PLHgjs9ncvHi80UCSlSvQe-TK_uOyDv_Jf&index=4|The implicit bias of gradient descent]].
+  * [[https://www.jmlr.org/papers/volume19/18-188/18-188.pdf|Soudry et al 2018 - The Implicit Bias of Gradient Descent on Separable Data]]
+  * [[https://arxiv.org/pdf/1806.01796.pdf|Nacson et al 2018 - Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate]]
+===== Miscellaneous Topics =====
+==== Effect of Skip Connections ====
+  * [[https://arxiv.org/pdf/1702.08591.pdf|Balduzzi et al 2017 - The Shattered Gradients Problem: If resnets are the answer, then what is the question?]]
+  * [[https://arxiv.org/pdf/1701.09175.pdf|Orhan & Pitkow 2017 - Skip Connections Eliminate Singularities]]
+===== People =====
+  * [[https://scholar.google.com/citations?user=AEBWEm8AAAAJ&hl=en|Daniel Soudry]]
 ===== Related Pages =====
   * [[Optimization]]