ml:optimization_in_deep_learning
Table of Contents
Optimization Topics in Deep Learning
Effects on Optimization
- Batch normalization
- Makes the objective function and gradients smoother (more Lipschitz) (Santurkar 2018 - How Does Batch Normalization Help Optimization?. Another perspective: see section 3 of De 2020.)
- Weight normalization
- Improves the conditioning of the optimization problem (Salimans & Kingma 2016)
- Lipschitz Constant
- Qi et al 2023 - Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant Talks about the effect of the Lipschitz constant on optimizing deep neural networks.
On Global Optimization of Neural Networks
- Du et al 2018 - Gradient Descent Provably Optimizes Over-parameterized Neural Networks “We show that as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function, for an m hidden node shallow neural network with ReLU activation and n training data.”
- Goldblum et al 2020 - Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory Examines optimization issues in deep learning and their relation to ML theory
Symmetries and Local Minima
- Sussman 1991 - Uniqueness of the weights for minimal feedforward nets with a given input-output map Also points out some discrete symmetries in neural networks: swapping any two neurons in the same layer, negating the input weights and output weights for any neuron (if using tahn activation), etc.
Properties of Neural Networks
Lipschitz Constant
If the weights are unbounded, the Lipschitz constant will be unbouned (true even for logistic regression).
Backpropagating Through Discontinuities
- Bengio et al 2013 - Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation Introduces straight-through estimator
- REINFORCE
Instabilities
Instability of Fine-tuning
- Mosbach 2020 - ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES catastrophic forgetting and small size of the fine-tuning datasets fail to explain the fine-tuning instability, which is caused by optimization difficulties and differences in generalization.
Implicit Regularization of SGD
See also Soheil Feizi's lecture The implicit bias of gradient descent.
Miscellaneous Topics
Effect of Skip Connections
People
Related Pages
ml/optimization_in_deep_learning.txt · Last modified: 2025/03/25 00:49 by jmflanig