====== History of Machine Learning ======

===== Early History of ML (prior to 1980) =====
  * First use of the term "machine learning": [[https://en.wikipedia.org/wiki/Arthur_Samuel|Arthur Samuel]] [[http://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf|1959]] "Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed."  Caveat: see [[https://hsm.stackexchange.com/questions/6423/who-coined-the-term-machine-learning|here]]
  * See also Machine Learning: An Artificial Intelligence Approach vol 1 by Michalski, Carbonell, and Mitchell Chapter 1, 1.4 and bibliography p. 20-23 and Comprehensive Bibliography of ML p. 511-556.
  * Overviews
    * [[https://ntrs.nasa.gov/api/citations/19680014786/downloads/19680014786.pdf|Ho & Agrawala 1968 - On pattern classification algorithms introduction and survey]]
    * [[https://web.archive.org/web/20190428055803id_/https://commons.erau.edu/cgi/viewcontent.cgi?article=2975&context=space-congress-proceedings|Fu 1969 - Learning Control Systems - Review and Outlook]]
    * [[https://www.sciencedirect.com/science/article/abs/pii/S0076539208604907|Fu 1970 - Statistical Pattern Recognition]]
  * Books
    * [[https://www.amazon.com/Learning-Machines-Foundations-Trainable-Pattern-Classifying/dp/0070465703|Nilsson 1965 - Learning Machines: Foundations of Trainable Pattern-Classifying Systems]]
    * [[https://mitpress.mit.edu/books/perceptrons|Minsky & Papert 1969 - Perceptrons: An Introduction to Computational Geometry]]  [[https://mitpress.mit.edu/books/perceptrons-expanded-edition|1987 Expanded edition]]
    * [[https://www.amazon.com/Pattern-Recognition-N-Bongard/dp/0333116666|Bongard 1970 - Pattern Recognition]] [[http://neolobe.com/files/pattern_recognition-bongard.pdf|pdf]] (translated from Russian). Cited by Carbonell p. 518.
    * [[https://www.google.com/books/edition/Machine_Learning/BSbSBwAAQBAJ?hl=en&gbpv=0|Carbonell et al 1986 - Machine Learning: A Guide to Current Research]] Duplicate [[https://www.google.com/books/edition/_/I5GcpyPx3RwC?hl=en&sa=X&ved=2ahUKEwjmnrO6hdb4AhVSFTQIHdWrAE0Qre8FegQIERAG|here]] ISBN: 0-89838-214-9
  * Papers
    * [[https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x|Fisher 1937 - ]] Introduced [[https://en.wikipedia.org/wiki/Linear_discriminant_analysis|Linear Discriminant Analysis (LDA)]] and the famous **[[https://archive.ics.uci.edu/ml/datasets/iris|Iris dataset]]** (see [[https://arxiv.org/pdf/1906.09436.pdf|Ghojogh 2019]]). This is the **classic supervised binary machine learning setup, where the features are assumed to be normally distributed**. More on LDA {{papers:lecture-note-lda.pdf|here}}.
    * {{papers:McCulloch_and_Pitts.pdf|McCulloch & Pitts 1943 - A Logical Calculus of the Ideas Immanent in Nervous Activity}}
    * [[https://uberty.org/wp-content/uploads/2015/07/Norbert_Wiener_Cybernetics.pdf|Wiener 1948 - Cybernetics]] See also [[http://web.mit.edu/esd.83/www/notebook/Cybernetics.PDF|here]]. Mostly control theory. Doesn't talk about learning, although chapters from the 1961 2nd edition do.
    * [[https://ieeexplore.ieee.org/document/6444817|Mcculloch 1949 - The Brain As A Computing Machine]] Cited by [[http://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf|Samuel 1959]]
    * [[https://apps.dtic.mil/sti/citations/ADA800276|Fix & Hodges 1951 - Discriminatory Analysis - Nonparametric Discrimination - Consistency Properties]] [[https://apps.dtic.mil/sti/pdfs/ADA800276.pdf|pdf]] Introduces the **binary classification task** (with unknown non-parametric distributions) into the statistics literature. According to the abstract of [[https://apps.dtic.mil/sti/citations/ADA800391|Fix 1952]], which uses the same setup, "A classification procedure is worked out for the following situations. Two large samples, one from each of two populations, have been observed. An individual of unknown origin is to be classified..."
    * [[https://apps.dtic.mil/sti/citations/ADA800391|Fix & Hodges 1952 - Discriminatory Analysis - Nonparametric Discrimination: Small Sample Performance]] [[https://apps.dtic.mil/sti/pdfs/ADA800391.pdf|pdf]] First introduces the **k-nearest neighbors classifier** (cited as [53] by survey paper [[https://ntrs.nasa.gov/api/citations/19680014786/downloads/19680014786.pdf|Ho 1968]]) Was implemented on a computer? Need to check.
    * [[https://www.tandfonline.com/doi/abs/10.1080/14786441208520256|Oettinger 1952 - Programming a digital computer to learn]] [[https://www.semanticscholar.org/paper/CXXIV.-Programming-a-digital-computer-to-learn-Oettinger/06f93ebcb1797f853b90c4954bc822d11d62e4d3|Semantic Scholar]] [[https://drive.google.com/uc?export=view&id=1RTKNUS-nR_ksFW9_PhfGksHwIZPKXSXQ|pdf (UCSC only)]] Perhaps one of the first experiments with a learning algorithm on a computer (besides [[https://apps.dtic.mil/sti/citations/ADA800391|Fix 1952]]). Experiments on the [[https://en.wikipedia.org/wiki/EDSAC|EDASC computer]]. The idea was suggested to the author by Wilks. Introduces a "response-learning s-machine" (Sec 3) which is a **reinforcement learning** machine in the multi-armed bandit setting. The equations governing the machine are given in Sec 5, and trade off between exploration and exploitation. Limitations: there are no input observations into the machine beside the reinforcement learning signal, so there is very limited learning. I believe there may be earlier experiments with electro-mechanical machines that do a similar thing (see the references).
    * Bush, R., and Hosteller, F., "Stochastic Models for Learning". John Wiley and Sons, 1955. Cited as [8] in [[https://web.archive.org/web/20190428055803id_/https://commons.erau.edu/cgi/viewcontent.cgi?article=2975&context=space-congress-proceedings|Fu 1969]].
    * {{papers:rosenblatt-1957.pdf|Rosenblatt 1957 - The Perceptron - A Perceiving and Recognizing Automaton}} Introduces **Perceptron algorithm**.
    * [[https://ieeexplore.ieee.org/document/5222035|Chow 1957 - An optimum character recognition system using decision functions]] [[https://drive.google.com/uc?export=view&id=1DvllYZNQF0haXx-2dtPG59Pg3U_6zdLP|pdf (UCSC only)]] Uses Bayes rule and minimizes the expected risk. Built on the theory of statistical decision functions, see citations 3-5. I believe they did not actually implement the system, paper contains only theoretical results.
    * {{papers:rosenblatt-1958.pdf|Rosenblatt 1958 -  The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain}} See also [[https://apps.dtic.mil/sti/pdfs/AD0256582.pdf|1961 book]].
    * [[https://dl.acm.org/doi/10.1147/rd.21.0002|Friedberg 1958 - A Learning Machine: Part I]] January 1958.
    * Andrews, A. M., "Learning Machines", Proceedings of the Symposium on the Mechanization of Thoughts Processes, H.M. Stationary Office, London, England, 1959. [[https://www.google.com/books/edition/Mechanisation_of_Thought_Processes/3Kk5yAEACAAJ?hl=en|vol1]] [[https://www.google.com/books/edition/Mechanisation_of_Thought_Processes/Ip8hAQAAIAAJ?hl=en|vol2]] Symposium held Nov 24-27, 1959. Cited by Carbonell p. 515.
    * [[http://gpbib.cs.ucl.ac.uk/gp-html/DBLP_conf_ifip_KilburnGS59.html|Kilburn, Grimsdale & Sumner 1959 - Experiments in Machine Learning and Thinking]] {{papers:Kilburn-1959.pdf|pdf}} (June 15-20, 1959). Cited by Carbonell p. 530. Work started in Feb 1957 on the Mark I computer. Perhaps the first mention of the term "**machine learning**" in a paper that has experiments on a computer. First instance of **genetic programming**?
    * [[http://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf|Samuel 1959 - Some Studies in Machine Learning Using the Game of Checkers]] (July, 1959).
    * [[https://www.sciencedirect.com/science/article/pii/S0019995859800140|Martens 1959 - Two notes on machine “Learning”]] {{papers:Martens-1959.pdf|pdf}}
    * [[https://apps.dtic.mil/sti/pdfs/AD0298258.pdf|Novikoff  1963 - On convergence proofs for perceptrons]]
    * [[https://www.sciencedirect.com/science/article/pii/S1474667017689444|Tsypkin 1968 - Generalized Linear Learning Algorithms and Their Applications]] Also Ya. Z. Tsypkin, "Generalized learning algorithms," Avtomatika t relemekhanika, No. 1, Moscow (1970).
    * [[https://ieeexplore.ieee.org/abstract/document/1099015|Tsypkin 1968 - Self-learning--What is it?]]
    * [[https://www.tandfonline.com/doi/abs/10.1080/01969727308545852?journalCode=ucbs19|Tsypkin 1972 - A Class of Learning Systems]]
    * [[https://www.ijcai.org/Proceedings/77-1/Papers/048.pdf|Mitchell 1977 - Version Spaces: A Candidate Elimination Approach to Rule Learning]]
    * [[https://www.cs.cmu.edu/~tom/pubs/VERSION_SPACES_MitchellPhDthesis.pdf|Mitchell 1978 - Version Spaces: An Approach to Concept Learning]] PhD thesis
  * Naive-Bayes Classifier
    * Todo: When was Naive-Bayes classifier first introduced? When was the term "Naive-Bayes" first used to refer to this classifier?
    * See [[https://link.springer.com/content/pdf/10.1023/A:1007413511361.pdf|Domingos & Pazzani 1997 - On the Optimality of the Simple Bayesian Classifier under Zero-One Loss]] for a history. Says this very simple classifier has not been discussed much, presumably because of its simplicity.
    * [[https://ieeexplore.ieee.org/abstract/document/4082113|1968 - Adaptive Bayes Classification Model with Incompletely Specified Experiment]] Does this describe NB classifier? Don't have access.
    * [[https://link.springer.com/content/pdf/10.1007/BF00116835.pdf|Clark & Niblett 1989 - The CN2 Induction Algorithm]] Perhaps the first description of the Naive-Bayes classifier in the machine learning literature? Calls it a "simple Bayes classifier."
    * I believe early versions of Duda and Hart may talk about NB


===== Theory (prior to 2000) =====

==== Early Theory (prior to 1980) ====
  * [[https://apps.dtic.mil/sti/pdfs/AD0298258.pdf|Novikoff  1963 - On convergence proofs for perceptrons]]
  * [[https://www.csee.umbc.edu/~lomonaco/f08/643/hwk643/Hoeffding.pdf|Hoeffding 1963 - Probability Inequalities for Sums of Bounded Random Variables]]
  * [[https://isl.stanford.edu/~cover/papers/transIT/0021cove.pdf|Cover & Hart 1967 - Nearest Neighbor Pattern Classification]] Shows that as the sample size goes to infinity, the KNN has has an error which is less than twice the Bayes optimal classifier error.
  * [[https://isl.stanford.edu/~cover/papers/transIT/0050cove.pdf|Cover 1968 - Estimation by the Nearest Neighbor Rule]]
  * [[https://epubs.siam.org/doi/10.1137/1116025|Vapnik & Chervonenkis 1971 - On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities]] [[https://courses.engr.illinois.edu/ece544na/fa2014/vapnik71.pdf|pdf]]
  * Vapnik & Chervonenkis 1974 - Theory of Pattern Recognition. (Book in Russian)
  * [[https://ieeexplore.ieee.org/document/1056018|Devroye & Wagner 1976 - A distribution-free performance bound in error estimation]] [[http://luc.devroye.org/devroye_wagner_1976_a_distribution_free_performance_bound_in_error_estimation.pdf|pdf]] Cited by Blumer 1986
  * [[https://www.tandfonline.com/doi/abs/10.1080/03081077808960690|Pearl 1978 - On the connection between the complexity and credibility of inferred models]] [[https://ftp.cs.ucla.edu/pub/stat_ser/pearl-1978-connection.pdf|pdf]] Cited by Blumer 1986

=== Theory (1980-2000) ===
  * [[http://www.cs.cmu.edu/~tom/pubs/NeedForBias_1980.pdf|Mitchell 1980 - The Need for Biases in Learning Generalizations]] Introduces the concept of bias in machine learning
  * [[https://dl.acm.org/doi/10.1145/1968.1972|Valiant 1984 - A theory of the learnable]] [[https://people.mpi-inf.mpg.de/~mehlhorn/SeminarEvolvability/ValiantLearnable.pdf|pdf]] Introduced PAC learning


=== VC Theory ====
  * [[https://dl.acm.org/doi/pdf/10.1145/12130.12158|Blumer et al 1986 - Classifying Learnable Geometric Concepts with the Vapnik-Chervonenkis Dimension]] First to apply VC dimension to PAC learning theory
  * {{papers:VC_MIT_notes.pdf|VC Theory notes from 1988}}  From [[https://www.ccs.neu.edu/home/vip/teach/MLcourse/8_advanced/learning_theory/VC_MIT_notes.pdf|here]]
  * [[http://www2.denizyuret.com/ref/blumer/ft_gateway.cfm.pdf|Blumer et al 1989 - Learnability and the Vapnik-Chervonenkis Dimension]]
  * [[https://authors.library.caltech.edu/11996/1/ABUnc89.pdf|Abu-Mostafa 1989 - The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning]]

===== Neural Networks (prior to 2010) =====
  * Overviews: History of Neural Networks
    * [[https://arxiv.org/pdf/1404.7828.pdf|Schmidhuber 2014 - Deep Learning in Neural Networks: An Overview]] (Comprehensive history)
  * Early Neural Models
    * Sometimes also called connectionist models or artificial neural networks
    * [[https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog0603_1|Feldman & Ballard 1982 - Connectionist Models and Their Properties]]
    * [[https://www.cs.utoronto.ca/~hinton/absps/pdp3.pdf|Hinton et al 1984 - Distributed Representations]] Based on Hinton 1982 CMU Tech Report CS-84-157. 
  * Backpropagation
    * [[https://people.idsia.ch/~juergen/who-invented-backpropagation-2014.html|Schmidhuber 2014 - Who invented backprop]] (see this [[https://arxiv.org/pdf/1404.7828.pdf|arxiv paper]] for references)
    * [[https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470231616.app7|Woodrow & Lehr 1990 - Thirty Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation]] "The first major extension of the feedforward neural network beyond Madaline I took place in 1971 when Werbos developed a backpropagation training algorithm which, in 1974, he first published in his doctoral dissertation [39]. Unfortunately, Werbos’s work remained almost unknown in the scientific community. In 1982, Parker rediscovered the technique [41] and in 1985, published a report on it at M.I.T. [42]. Not long after Parker published his findings, Rumelhart, Hinton, and Williams [43,44] also rediscovered the technique and, largely as a result of the clear framework within which they presented their ideas, they finally succeeded in making it widely known."
    * [[http://e.guigon.free.fr/rsc/book/BrysonHo75.pdf|Bryson & Ho 1969 - Applied Optimal Control]] Russell & Norvig (AI: A Modern Approach) cites this for an example of early work that used the concept of backprop
    * {{papers:werbos_thesis_1974.pdf|Werbos 1974 thesis}} Invented backprop, according to [[https://www.werbos.com/AD2004.pdf|Werbos 2004]] and Schmidhuber 2014
    * {{papers:Werbos-1981.pdf|Werbos 1981 - Applications of advances in nonlinear sensitivity analysis}} ([[https://link.springer.com/chapter/10.1007/BFb0006203|Springer]]) First paper to suggest backprop for neural nets (according to Schmidhuber 2014).
      * For a history, see [[https://www.werbos.com/AD2004.pdf|Werbos 2004 - Backwards Differentiation in AD and Neural Nets: Past Links and New Opportunities]] (Fascinating retrospective history)
    * Parker, 1985
    * LeCun, 1985
    * The paper that popularized it: [[http://www.cs.toronto.edu/~hinton/absps/naturebp.pdf|Rumelhart, Hinton & Williams 1986 - Learning Representations by Back-Propagating Errors (Letters to Nature)]]
      * [[https://apps.dtic.mil/sti/pdfs/ADA164453.pdf|1985 tech report]] (published in [[http://cognet.mit.edu/book/parallel-distributed-processing-volume-1|PDP vol 1 1986]]) Talks about recurrent neural networks
  * Universal approximation theorem
    * [[https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.441.7873&rep=rep1&type=pdf|Cybenko 1989 - Approximation by superpositions of a sigmoidal function]]  First paper to show that neural networks can approximate any function
  * **Resources**
    * [[https://www.cs.utoronto.ca/~hinton/papers.html|Hinton's publications]] Contains papers that are hard to find.
  * **Old conferences and journals**
      * [[https://www.inns.org/ijcnn-past-conferences|International Joint Conference on Neural Network (IJCNN) since 1987]]
      * Workshop on Natural Language Processing and Neural Networks (NLPNN): [[https://web.archive.org/web/20030814225052/https://www.math.ryukoku.ac.jp/~qma/activity/NLPNN99/index.html|1999]]
      * [[https://link.springer.com/journal/10559/volumes-and-issues|Cybernetics]] Translated from Kibernetika (in Russian). See [[http://www.kibernetika.org/aboutEU.html|here]].

===== Neural Networks in NLP =====
See [[nlp:History of NLP#Neural Networks in NLP]]

===== Misc Topics =====

===== Other Methods (Prior to 2010) =====
  * Support Vector Machines
    * {{papers:multi-classsupportvectormachines.pdf|Weston & Watkins 1998 - Multi-class Support Vector Machines}} Tech Report CSD-TR-98-04 [[https://sites.google.com/site/jeisongutierrez/Multi-ClassSupportVectorMachines.pdf|link]]


===== Optimization (prior to 1980) =====
Focusing on papers related to machine learning.

  * Stochastic Gradient Descent ([[https://en.wikipedia.org/wiki/Stochastic_gradient_descent|SGD]])
    * Classic overview in [[https://leon.bottou.org/publications/pdf/mlss-2003.pdf|Bottou 2003 - Stochastic Learning]]
    * **Note**: //early Russian papers use M instead of E to denote expectation.//
    * **History**: As far as I can tell, SGD as we know it was first introduced by [[https://link.springer.com/article/10.1007/BF01074504|Ermol'ev & Nekrylova 1966]], and stochastic sub-gradient descent (SSGD) was introduced by [[https://link.springer.com/article/10.1007/BF01071091|Ermol'ev 1969]]. These works built upon the stochastic approximation method that was proposed in [[https://projecteuclid.org/euclid.aoms/1177729586|Robins & Monroe 1951]]. This was extended to maximization problems of one dimension in [[https://www.jstor.org/stable/pdf/2236690.pdf?refreqid=excelsior%3Ab51dff976a6dd1adf20d42f807e627d1&ab_segments=&origin=|Kiefer & Wolfowitz 1952]], which was extended to the multidimensional case in [[https://www.jstor.org/stable/pdf/2236657.pdf?casa_token=4A2-9PtFvfkAAAAA:eq2HNs8YA-VVzr1G6QTMfXbfG3FBPlDMXzcuCOVri9lFbafMK6LuS1JDed_1ploaH7DEd36QkEt2G_XCKpfKzcwnbB2rk3gZQIWH_aXnmX_bG1j1NAY|Blum 1954]] and [[https://www.ams.org/journals/proc/1958-009-03/S0002-9939-1958-0098426-7/S0002-9939-1958-0098426-7.pdf|Blum 1958]]. However, these methods used a finite-difference approximation to computing the stochastic gradient, in contrast to [[https://link.springer.com/article/10.1007/BF01074504|Ermol'ev & Nekrylova 1966]] (History from [[https://link.springer.com/article/10.1007/BF01071403|Ermol'ev  1966 Sec 12]]).
    * [[https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-23/issue-3/Stochastic-Estimation-of-the-Maximum-of-a-Regression-Function/10.1214/aoms/1177729392.full|Kiefer & Wolfowitz 1952 - Stochastic Estimation of the Maximum of a Regression Function]] [[https://www.jstor.org/stable/pdf/2236690.pdf?refreqid=excelsior%3Ab51dff976a6dd1adf20d42f807e627d1&ab_segments=&origin=|pdf]]
    * [[https://www.jstor.org/stable/pdf/2236657.pdf?casa_token=4A2-9PtFvfkAAAAA:eq2HNs8YA-VVzr1G6QTMfXbfG3FBPlDMXzcuCOVri9lFbafMK6LuS1JDed_1ploaH7DEd36QkEt2G_XCKpfKzcwnbB2rk3gZQIWH_aXnmX_bG1j1NAY|Blum 1954 - Multidimensional Stochastic Approximation Methods]] Not quite SGD (computes a finite-differences approx to SGD in a random direction). Does not assume access to a noisy gradient. See sect 6.
    * [[https://www.ams.org/journals/proc/1958-009-03/S0002-9939-1958-0098426-7/S0002-9939-1958-0098426-7.pdf|Blum 1958 - A note on stochastic approximation]]
    * [[http://www.mathnet.ru/php/archive.phtml?wshow=paper&jrnid=tvp&paperid=523&option_lang=eng|Gladyshev 1965 - On the stochastic approximation]] [[http://www.mathnet.ru/links/146dcb1adeb625489b4a7704733b0920/tvp523.pdf|pdf]] In Russian. Cited by [[https://link.springer.com/article/10.1007/BF01074504|Ermol'ev & Nekrylova  1966]]. Perhaps this is SGD applied to least-squares regression?
    * [[https://link.springer.com/article/10.1007/BF01071403|Ermol'ev  1966 - Methods of solution of nonlinear extremal problems]] [[https://link.springer.com/content/pdf/10.1007/BF01071403.pdf|pdf]] See section 12 for the early history of SGD. Overview of Blum 1954 shows that it is not quite SGD (computes a finite-differences approx to SGD).
    * **[[https://link.springer.com/article/10.1007/BF01074504|Ermol'ev & Nekrylova  1966 - Some methods of stochastic optimization]] [[https://link.springer.com/content/pdf/10.1007/BF01074504.pdf|pdf]]** Stochastic gradient descent for differentiable, convex functions. $\xi^{(s)}$ is the stochastic gradient, $\rho_s$ is the stepsize. Cited by [[https://link.springer.com/article/10.1007/BF01069037|Nekrylova 1974]]. I believe this is the introduction of SGD, since Blum 1954 uses a finite-differences method.
    * [[https://www.researchgate.net/publication/265400868_A_General_Method_for_Solving_Extremum_Problems|Polyak 1967 - A General Method for Solving Extremum Problems]] Talks about sub-gradient descent. Calls sub-gradients "support functionals." Cited by [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971]].
    * Ermoliev, Yu.M., and Z.V. Nekrylova. 1967. The Method Stochastic Subgradients and Its Applications. Notes, Seminar on the Theory of Optimal Solution. Academy of Sciences of the U.S.S.R., Kiev. (Cited [1] in [[http://pure.iiasa.ac.at/id/eprint/1759/1/WP-81-002.pdf|Ermoliev 1981]].)
    * [[https://link.springer.com/article/10.1007/BF01074757|Ermol'ev & Shor 1968 - Method of random walk for the two-stage problem of stochastic programming and its generalization]] [[https://link.springer.com/content/pdf/10.1007/BF01074757.pdf|pdf]]
    * [[https://link.springer.com/article/10.1007/BF01071485|Ermol'ev Tuniev 1968 - Direct methods of solving some stochastic programming problems]] [[https://link.springer.com/content/pdf/10.1007/BF01071485.pdf|pdf]] Cited by [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971]] Minimization of linear program with stochastic constraints. 
    * **[[https://link.springer.com/article/10.1007/BF01071091|Ermol'ev 1969 - On the method of generalized stochastic gradients and quasi-Féjer sequences]]** [[https://link.springer.com/content/pdf/10.1007/BF01071091.pdf|pdf]] I believe this may be the introduction of SSGD as we know it (stochastic sub-gradient descent). Calls sub-gradients "generalized gradient vectors," and calls the stochastic sub-gradient a "generalized stochastic gradient vector, or briefly, the stochastic quasi-gradient vector." Assumes convexity, since assumes sub-gradient. Cited by [[https://link.springer.com/article/10.1007/BF01071541|Nurminskii 1974]] and [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971]].
    * [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971 - Convergence rate of the method of generalized stochastic gradients]] [[https://link.springer.com/content/pdf/10.1007/BF01071048.pdf|pdf]] First proof of rate of convergence of an SGD-like algorithm. (Have to look closer, I'm not sure if it's actually SGD). Assumes convexity. Notes: g(x) is the subgradient of f(x), called reference function. It is called the support functional in [[https://www.researchgate.net/publication/265400868_A_General_Method_for_Solving_Extremum_Problems|Polyak 1967]] (citation [7]).
    * [[https://link.springer.com/article/10.1007/BF01071541|Nurminskii 1974 - Minimization of nondifferentiable functions in the presence of noise]] [[https://link.springer.com/content/pdf/10.1007/BF01071541.pdf|pdf]]
    * [[https://link.springer.com/article/10.1007/BF01069037|Nekrylova 1974 - Stochastic gradient method in infinite-dimensional space]] [[https://link.springer.com/content/pdf/10.1007/BF01069037.pdf|pdf]]
    * Stochastic Approximation Method
      * [[https://projecteuclid.org/euclid.aoms/1177729586|Robins & Monroe 1951 - A Stochastic Approximation Method]] {{papers:Robbins_Monro-1951.pdf|local copy}} This paper is often cited as the paper that introduced SGD (see for example [[https://leon.bottou.org/publicatiSGDons/pdf/mlss-2003.pdf|Bottou 2003]]). However, I would advocate against citing this paper as the originator of SGD, since it has the following limitations: it only treats 1 dimensional problems, it makes a monotonicity assumption, it is a root finding method, and they only apply it to the minimization problem of linear regression with least squares.  This is a long way away from SGD as formulated by [[https://leon.bottou.org/publications/pdf/nimes-1991.pdf|Bottou 1991]]. I think there must be a much better citation than this for the origination of SGD and I suggest [[https://leon.bottou.org/publications/pdf/nimes-1991.pdf|Bottou 1991]] or [[http://pure.iiasa.ac.at/id/eprint/1759/1/WP-81-002.pdf|Ermoliev 1981]] as perhaps a better citation.
      * [[https://projecteuclid.org/proceedings/berkeley-symposium-on-mathematical-statistics-and-probability/Proceedings-of-the-Third-Berkeley-Symposium-on-Mathematical-Statistics-and/Chapter/On-Stochastic-Approximation/bsmsp/1200501645|Dvoretzky 1956 - On Stochastic Approximation]] {{papers:Dvoretzky-1956.pdf|local copy}} More general proofs of SAM
    * [[https://leon.bottou.org/publications/pdf/nimes-1991.pdf|Bottou 1991 - Stochastic Gradient Learning in Neural Networks]] Gives general proofs that SGD converges for non-convex problems such as neural networks
    * There is earlier work on the "stochastic quasi-gradient method," which is a generalization of SGD
      * Quasi-gradient method [[https://link.springer.com/article/10.1007/BF01068677|Nurminskii - 1973]]
      * [[http://pure.iiasa.ac.at/id/eprint/1759/1/WP-81-002.pdf|Ermoliev 1981 - Stochastic quasi-gradient method]], see [[https://web.archive.org/web/20171126210238/https://core.ac.uk/download/pdf/33893934.pdf|Ermoliev & Gaivoronski - 1984]]  The most general version of SGD (that Jeff knows about), works with constrained problems and biased gradient estimates.  Reduces to SGD under normal circumstances.
      * See also overview here: [[https://link.springer.com/article/10.1007/s10559-011-9363-x|2011 - Some scientific results of Yu. M. Ermoliev and his school in modern stochastic optimization theory]] [[https://link.springer.com/content/pdf/10.1007/s10559-011-9363-x.pdf|pdf]]
  * Second-order methods
    * {{papers:firstsecondordermethodsforlearning.pdf|Battiti 1992 - First- and second-order methods for learning: Between steepest descent and newton’s method}}
    * [[http://yann.lecun.com/exdb/publis/pdf/becker-lecun-89.pdf|Becker & Le Cun 1988 - Improving the convergence of back-propagation learning with second-order methods]]


===== Related Pages =====
  * [[nlp:History of NLP]]