Table of Contents

Optimization Topics in Deep Learning

Optimization Topics in Deep Learning

Effects on Optimization

Batch normalization
- Makes the objective function and gradients smoother (more Lipschitz) (Santurkar 2018 - How Does Batch Normalization Help Optimization?. Another perspective: see section 3 of De 2020.)
Weight normalization
- Improves the conditioning of the optimization problem (Salimans & Kingma 2016)
Lipschitz Constant
- Qi et al 2023 - Understanding Optimization of Deep Learning via Jacobian Matrix and Lipschitz Constant Talks about the effect of the Lipschitz constant on optimizing deep neural networks.

On Global Optimization of Neural Networks

Choromanska et al 2014 - The Loss Surfaces of Multilayer Networks
Du et al 2018 - Gradient Descent Provably Optimizes Over-parameterized Neural Networks “We show that as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function, for an m hidden node shallow neural network with ReLU activation and n training data.”
Goldblum et al 2020 - Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory Examines optimization issues in deep learning and their relation to ML theory

Symmetries and Local Minima

Sussman 1991 - Uniqueness of the weights for minimal feedforward nets with a given input-output map Also points out some discrete symmetries in neural networks: swapping any two neurons in the same layer, negating the input weights and output weights for any neuron (if using tahn activation), etc.

Properties of Neural Networks

Yao et al 2019 - PYHESSIAN: Neural Networks Through the Lens of the Hessian

Lipschitz Constant

If the weights are unbounded, the Lipschitz constant will be unbouned (true even for logistic regression).

Backpropagating Through Discontinuities

Bengio et al 2013 - Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation Introduces straight-through estimator
REINFORCE
Peng et al 2018 - Backpropagating through Structured Argmax using a SPIGOT

Instabilities

Instability of Fine-tuning

Mosbach 2020 - ON THE STABILITY OF FINE-TUNING BERT: MISCONCEPTIONS, EXPLANATIONS, AND STRONG BASELINES catastrophic forgetting and small size of the fine-tuning datasets fail to explain the fine-tuning instability, which is caused by optimization difficulties and differences in generalization.

Implicit Regularization of SGD

See also Soheil Feizi's lecture The implicit bias of gradient descent.

Miscellaneous Topics

Effect of Skip Connections

People

Daniel Soudry

Related Pages

Optimization