First use of the term “machine learning”: Arthur Samuel1959 “Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed.” Caveat: see here
See also Machine Learning: An Artificial Intelligence Approach vol 1 by Michalski, Carbonell, and Mitchell Chapter 1, 1.4 and bibliography p. 20-23 and Comprehensive Bibliography of ML p. 511-556.
Fix & Hodges 1951 - Discriminatory Analysis - Nonparametric Discrimination - Consistency Propertiespdf Introduces the binary classification task (with unknown non-parametric distributions) into the statistics literature. According to the abstract of Fix 1952, which uses the same setup, “A classification procedure is worked out for the following situations. Two large samples, one from each of two populations, have been observed. An individual of unknown origin is to be classified…”
Oettinger 1952 - Programming a digital computer to learnSemantic Scholarpdf (UCSC only) Perhaps one of the first experiments with a learning algorithm on a computer (besides Fix 1952). Experiments on the EDASC computer. The idea was suggested to the author by Wilks. Introduces a “response-learning s-machine” (Sec 3) which is a reinforcement learning machine in the multi-armed bandit setting. The equations governing the machine are given in Sec 5, and trade off between exploration and exploitation. Limitations: there are no input observations into the machine beside the reinforcement learning signal, so there is very limited learning. I believe there may be earlier experiments with electro-mechanical machines that do a similar thing (see the references).
Bush, R., and Hosteller, F., “Stochastic Models for Learning”. John Wiley and Sons, 1955. Cited as [8] in Fu 1969.
Andrews, A. M., “Learning Machines”, Proceedings of the Symposium on the Mechanization of Thoughts Processes, H.M. Stationary Office, London, England, 1959. vol1vol2 Symposium held Nov 24-27, 1959. Cited by Carbonell p. 515.
Woodrow & Lehr 1990 - Thirty Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation “The first major extension of the feedforward neural network beyond Madaline I took place in 1971 when Werbos developed a backpropagation training algorithm which, in 1974, he first published in his doctoral dissertation [39]. Unfortunately, Werbos’s work remained almost unknown in the scientific community. In 1982, Parker rediscovered the technique [41] and in 1985, published a report on it at M.I.T. [42]. Not long after Parker published his findings, Rumelhart, Hinton, and Williams [43,44] also rediscovered the technique and, largely as a result of the clear framework within which they presented their ideas, they finally succeeded in making it widely known.”
Note: early Russian papers use M instead of E to denote expectation.
History: As far as I can tell, SGD as we know it was first introduced by Ermol'ev & Nekrylova 1966, and stochastic sub-gradient descent (SSGD) was introduced by Ermol'ev 1969. These works built upon the stochastic approximation method that was proposed in Robins & Monroe 1951. This was extended to maximization problems of one dimension in Kiefer & Wolfowitz 1952, which was extended to the multidimensional case in Blum 1954 and Blum 1958. However, these methods used a finite-difference approximation to computing the stochastic gradient, in contrast to Ermol'ev & Nekrylova 1966 (History from Ermol'ev 1966 Sec 12).
Ermoliev, Yu.M., and Z.V. Nekrylova. 1967. The Method Stochastic Subgradients and Its Applications. Notes, Seminar on the Theory of Optimal Solution. Academy of Sciences of the U.S.S.R., Kiev. (Cited [1] in Ermoliev 1981.)
Robins & Monroe 1951 - A Stochastic Approximation Methodlocal copy This paper is often cited as the paper that introduced SGD (see for example Bottou 2003). However, I would advocate against citing this paper as the originator of SGD, since it has the following limitations: it only treats 1 dimensional problems, it makes a monotonicity assumption, it is a root finding method, and they only apply it to the minimization problem of linear regression with least squares. This is a long way away from SGD as formulated by Bottou 1991. I think there must be a much better citation than this for the origination of SGD and I suggest Bottou 1991 or Ermoliev 1981 as perhaps a better citation.