====== Optimization Topics in Deep Learning ======

===== Effects on Optimization =====
  * Batch normalization
    * Makes the objective function and gradients smoother (more Lipschitz) ([[https://arxiv.org/pdf/1805.11604.pdf|Santurkar 2018 - How Does Batch Normalization Help Optimization?]]. Another perspective: see section 3 of [[https://arxiv.org/pdf/2002.10444.pdf|De 2020]].)
  * Weight normalization
    * Improves the conditioning of the optimization problem ([[https://arxiv.org/pdf/1602.07868.pdf|Salimans & Kingma 2016]])
  * Lipschitz Constant
    * [[https://arxiv.org/pdf/2306.09338|Qi et al 2023 - Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant]] Talks about the effect of the Lipschitz constant on optimizing deep neural networks.

===== On Global Optimization of Neural Networks =====
  * [[https://arxiv.org/pdf/1412.0233.pdf|Choromanska et al 2014 - The Loss Surfaces of Multilayer Networks]]
  * [[https://arxiv.org/pdf/1810.02054.pdf|Du et al 2018 - Gradient Descent Provably Optimizes Over-parameterized Neural Networks]] "We show that as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function, for an m hidden node shallow neural network with ReLU activation and n training data."
  * [[https://arxiv.org/pdf/1910.00359.pdf|Goldblum et al 2020 - Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory]] Examines optimization issues in deep learning and their relation to ML theory

===== Symmetries and Local Minima =====
  * [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.29.6931&rep=rep1&type=pdf|Sussman 1991 - Uniqueness of the weights for minimal feedforward nets with a given input-output map]] Also points out some discrete symmetries in neural networks: swapping any two neurons in the same layer, negating the input weights and output weights for any neuron (if using tahn activation), etc.

===== Properties of Neural Networks =====

  * [[https://arxiv.org/pdf/1912.07145.pdf|Yao et al 2019 -  PYHESSIAN: Neural Networks Through the Lens of the Hessian]]

==== Lipschitz Constant ====
If the weights are unbounded, the Lipschitz constant will be unbouned (true even for logistic regression).
  * [[https://arxiv.org/pdf/1906.04893.pdf|2019 - Efficient and Accurate Estimation of Lipschitz Constants for Deep Neural Networks]]
  * [[https://arxiv.org/pdf/1805.10965.pdf|2018 - Lipschitz regularity of deep neural networks: analysis and efficient estimation]]

===== Backpropagating Through Discontinuities =====
  * [[https://arxiv.org/pdf/1308.3432.pdf|Bengio et al 2013 - Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation]] Introduces straight-through estimator
  * REINFORCE
  * [[https://www.aclweb.org/anthology/P18-1173.pdf|Peng et al 2018 - Backpropagating through Structured Argmax using a SPIGOT]]

===== Instabilities =====

==== Instability of Fine-tuning ====

  * [[https://arxiv.org/pdf/2006.04884.pdf|Mosbach 2020 - ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES]] catastrophic forgetting and small size of the fine-tuning datasets fail to explain the fine-tuning instability, which is caused by optimization difficulties and differences in generalization.

===== Implicit Regularization of SGD =====
See also Soheil Feizi's lecture [[https://www.youtube.com/watch?v=hGRssrXF-vI&list=PLHgjs9ncvHi80UCSlSvQe-TK_uOyDv_Jf&index=4|The implicit bias of gradient descent]].
  * [[https://www.jmlr.org/papers/volume19/18-188/18-188.pdf|Soudry et al 2018 - The Implicit Bias of Gradient Descent on Separable Data]]
  * [[https://arxiv.org/pdf/1806.01796.pdf|Nacson et al 2018 - Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate]]

===== Miscellaneous Topics =====

==== Effect of Skip Connections ====
  * [[https://arxiv.org/pdf/1702.08591.pdf|Balduzzi et al 2017 - The Shattered Gradients Problem: If resnets are the answer, then what is the question?]]
  * [[https://arxiv.org/pdf/1701.09175.pdf|Orhan & Pitkow 2017 - Skip Connections Eliminate Singularities]]

===== People =====
  * [[https://scholar.google.com/citations?user=AEBWEm8AAAAJ&hl=en|Daniel Soudry]]

===== Related Pages =====
  * [[Optimization]]