This is an old revision of the document!

Optimization Topics in Deep Learning

Effects on Optimization

Batch normalization
- Makes the objective function and gradients smoother (more Lipschitz) (Santurkar 2018 - How Does Batch Normalization Help Optimization?. See also section 3 of De 2020.)
Weight normalization
- Improves the conditioning of the optimization problem (Salimans & Kingma 2016)

Choromanska et al 2014 - The Loss Surfaces of Multilayer Networks
Du et al 2018 - Gradient Descent Provably Optimizes Over-parameterized Neural Networks “We show that as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function, for an m hidden node shallow neural network with ReLU activation and n training data.”
Goldblum et al 2020 - Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory Examines optimization issues in deep learning and their relation to ML theory

If the weights are unbounded, the Lipschitz constant will be unbouned (true even for logistic regression).