Optimization Topics in Deep Learning

Effects on Optimization

Batch normalization
- Makes the objective function and gradients smoother (more Lipschitz) (Santurkar 2018 - How Does Batch Normalization Help Optimization?. Another perspective: see section 3 of De 2020.)
Weight normalization
- Improves the conditioning of the optimization problem (Salimans & Kingma 2016)
Lipschitz Constant
- Qi et al 2023 - Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant Talks about the effect of the Lipschitz constant on optimizing deep neural networks.

Choromanska et al 2014 - The Loss Surfaces of Multilayer Networks
Du et al 2018 - Gradient Descent Provably Optimizes Over-parameterized Neural Networks “We show that as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function, for an m hidden node shallow neural network with ReLU activation and n training data.”
Goldblum et al 2020 - Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory Examines optimization issues in deep learning and their relation to ML theory

Sussman 1991 - Uniqueness of the weights for minimal feedforward nets with a given input-output map Also points out some discrete symmetries in neural networks: swapping any two neurons in the same layer, negating the input weights and output weights for any neuron (if using tahn activation), etc.

If the weights are unbounded, the Lipschitz constant will be unbouned (true even for logistic regression).

Mosbach 2020 - ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES catastrophic forgetting and small size of the fine-tuning datasets fail to explain the fine-tuning instability, which is caused by optimization difficulties and differences in generalization.