Differences

This shows you the differences between two versions of the page.

--- ml:alternative_training_methods [2022/05/14 23:51] – [Papers] jmflanig
+++ ml:alternative_training_methods [2023/08/11 20:05] (current) – [Papers] jmflanig
@@ Line 2: / Line 2: @@
 ===== Papers =====
+  * [[https://direct.mit.edu/neco/article-abstract/6/3/469/5795/Alopex-A-Correlation-Based-Learning-Algorithm-for?redirectedFrom=fulltext|Unnikrishnan & Venugopal 1994 - Alopex: A Correlation-Based Learning Algorithm for Feedforward and Recurrent Neural Networks]] [[http://laspp.fri.uni-lj.si/brane/unnikrishnan94alopex.pdf|pdf]] A stochastic training algorithm that does not use gradients, but instead looks at how stochastic changes in the weights change the loss function. Paper claims it can be used with discontinuous activation functions. This is a local search optimization method, similar to simulated annealing. Very simple to implement, and can be parallelized. Experiments in the paper show it is comparable in number of iterations required as backprop. Used in  {{papers:forcada97p.pdf|Forcada 1997}}.
   * [[https://arxiv.org/pdf/1712.06567.pdf|Such et al 2017 - Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning]] Training neural networks with genetic algorithms instead of backprop
-  * [[https://arxiv.org/pdf/2202.08587.pdf|Baydin et al 2022 - Gradients without Backpropagation]] Uses forward model automatic differentiation to compute a "forward gradient" (no backward pass like backprop).  Essentially it computes the change in loss in a random direction.  When scaled by the loss, this is an unbiased estimate of the true gradient, which they plug into gradient descent.  This has a number of important implications:
+  * [[https://arxiv.org/pdf/1703.01785.pdf|Franceschi et al 2021 - Forward and Reverse Gradient-Based Hyperparameter Optimization]]
+  * [[https://arxiv.org/pdf/2202.08587.pdf|Baydin et al 2022 - Gradients without Backpropagation]] [[https://github.com/orobix/fwdgrad|github]] Uses forward mode automatic differentiation to compute a "forward gradient" (no backward pass like backprop).  Essentially it computes the change in loss in a random direction.  When scaled by the loss, this is an unbiased estimate of the true gradient, which they plug into stochastic gradient descent.  This has a number of important implications:
+    * Because this doesn't require storing the whole computation graph like backprop does, computation nodes can be removed from the graph once they are not needed in further computation.  For example, each layer of the transformer can thrown away once it has been used.  This could save GPU memory and perhaps allow much deeper networks.
     * They could have computed the finite differences approximation to the gradient by taking a small step in the random direction.  This would allow computing the change in loss for discontinuous functions.
-    * The direction doesn't have to be sampled from a random normal - only the components need to be independent.  They could have sampled the components from {-1,1} (two discrete values).  This would allow them to optimize binary neural networks with their technique.
+    * The direction doesn't have to be sampled from a random normal - the components only need to be independent.  They could have sampled the components from {-1,1} (two discrete values).  This would allow them to optimize [[model_compression#binarized_neural_networks|binary neural networks]] with their technique.
+    * Follow-up work: [[https://arxiv.org/pdf/2209.06302.pdf|Belouze 2022 - Optimization without Backpropagation]]
+  * [[https://arxiv.org/pdf/2212.13345.pdf|Hinton 2022 - The Forward-Forward Algorithm: Some Preliminary
+Investigations]]
+  * [[https://arxiv.org/pdf/2305.17333.pdf|Malladi et 2023 - Fine-Tuning Language Models with Just Forward Passes]]
+===== Related Pages =====
+  * [[NN Training|Neural Network Training]]
+  * [[ml:optimizers#Gradient-Free Optimizers]]