Differences

This shows you the differences between two versions of the page.

--- ml:history_of_ml [2022/06/30 20:43] – [Early History of ML (prior to 1980)] jmflanig
+++ ml:history_of_ml [2024/02/14 09:07] (current) – [Early History of ML (prior to 1980)] jmflanig
@@ Line 12: / Line 12: @@
     * [[https://mitpress.mit.edu/books/perceptrons|Minsky & Papert 1969 - Perceptrons: An Introduction to Computational Geometry]]  [[https://mitpress.mit.edu/books/perceptrons-expanded-edition|1987 Expanded edition]]
     * [[https://www.amazon.com/Pattern-Recognition-N-Bongard/dp/0333116666|Bongard 1970 - Pattern Recognition]] [[http://neolobe.com/files/pattern_recognition-bongard.pdf|pdf]] (translated from Russian). Cited by Carbonell p. 518.
+    * [[https://www.google.com/books/edition/Machine_Learning/BSbSBwAAQBAJ?hl=en&gbpv=0|Carbonell et al 1986 - Machine Learning: A Guide to Current Research]] Duplicate [[https://www.google.com/books/edition/_/I5GcpyPx3RwC?hl=en&sa=X&ved=2ahUKEwjmnrO6hdb4AhVSFTQIHdWrAE0Qre8FegQIERAG|here]] ISBN: 0-89838-214-9
   * Papers
     * [[https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1469-1809.1936.tb02137.x|Fisher 1937 - ]] Introduced [[https://en.wikipedia.org/wiki/Linear_discriminant_analysis|Linear Discriminant Analysis (LDA)]] and the famous **[[https://archive.ics.uci.edu/ml/datasets/iris|Iris dataset]]** (see [[https://arxiv.org/pdf/1906.09436.pdf|Ghojogh 2019]]). This is the **classic supervised binary machine learning setup, where the features are assumed to be normally distributed**. More on LDA {{papers:lecture-note-lda.pdf|here}}.
@@ Line 26: / Line 27: @@
     * [[https://dl.acm.org/doi/10.1147/rd.21.0002|Friedberg 1958 - A Learning Machine: Part I]] January 1958.
     * Andrews, A. M., "Learning Machines", Proceedings of the Symposium on the Mechanization of Thoughts Processes, H.M. Stationary Office, London, England, 1959. [[https://www.google.com/books/edition/Mechanisation_of_Thought_Processes/3Kk5yAEACAAJ?hl=en|vol1]] [[https://www.google.com/books/edition/Mechanisation_of_Thought_Processes/Ip8hAQAAIAAJ?hl=en|vol2]] Symposium held Nov 24-27, 1959. Cited by Carbonell p. 515.
-    * [[http://gpbib.cs.ucl.ac.uk/gp-html/DBLP_conf_ifip_KilburnGS59.html|Kilburn, Grimsdale & Sumner 1959 - Experiments in Machine Learning and Thinking]] {{papers:Kilburn-1959.pdf|pdf}} (June 15-20, 1959). Cited by Carbonell p. 530. Work started in Feb 1957 on the Mark I computer. Perhaps the first mention of the term "machine learning" in a paper that has experiments on a computer. First instance of **genetic programming**?
+    * [[http://gpbib.cs.ucl.ac.uk/gp-html/DBLP_conf_ifip_KilburnGS59.html|Kilburn, Grimsdale & Sumner 1959 - Experiments in Machine Learning and Thinking]] {{papers:Kilburn-1959.pdf|pdf}} (June 15-20, 1959). Cited by Carbonell p. 530. Work started in Feb 1957 on the Mark I computer. Perhaps the first mention of the term "**machine learning**" in a paper that has experiments on a computer. First instance of **genetic programming**?
     * [[http://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf|Samuel 1959 - Some Studies in Machine Learning Using the Game of Checkers]] (July, 1959).
     * [[https://www.sciencedirect.com/science/article/pii/S0019995859800140|Martens 1959 - Two notes on machine “Learning”]] {{papers:Martens-1959.pdf|pdf}}
@@ Line 50: / Line 51: @@
   * [[https://isl.stanford.edu/~cover/papers/transIT/0021cove.pdf|Cover & Hart 1967 - Nearest Neighbor Pattern Classification]] Shows that as the sample size goes to infinity, the KNN has has an error which is less than twice the Bayes optimal classifier error.
   * [[https://isl.stanford.edu/~cover/papers/transIT/0050cove.pdf|Cover 1968 - Estimation by the Nearest Neighbor Rule]]
+  * [[https://epubs.siam.org/doi/10.1137/1116025|Vapnik & Chervonenkis 1971 - On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities]] [[https://courses.engr.illinois.edu/ece544na/fa2014/vapnik71.pdf|pdf]]
+  * Vapnik & Chervonenkis 1974 - Theory of Pattern Recognition. (Book in Russian)
   * [[https://ieeexplore.ieee.org/document/1056018|Devroye & Wagner 1976 - A distribution-free performance bound in error estimation]] [[http://luc.devroye.org/devroye_wagner_1976_a_distribution_free_performance_bound_in_error_estimation.pdf|pdf]] Cited by Blumer 1986
   * [[https://www.tandfonline.com/doi/abs/10.1080/03081077808960690|Pearl 1978 - On the connection between the complexity and credibility of inferred models]] [[https://ftp.cs.ucla.edu/pub/stat_ser/pearl-1978-connection.pdf|pdf]] Cited by Blumer 1986
@@ Line 95: / Line 98: @@
 ===== Misc Topics =====
+===== Other Methods (Prior to 2010) =====
+  * Support Vector Machines
+    * {{papers:multi-classsupportvectormachines.pdf|Weston & Watkins 1998 - Multi-class Support Vector Machines}} Tech Report CSD-TR-98-04 [[https://sites.google.com/site/jeisongutierrez/Multi-ClassSupportVectorMachines.pdf|link]]
 ===== Optimization (prior to 1980) =====
 Focusing on papers related to machine learning.
@@ Line 100: / Line 109: @@
   * Stochastic Gradient Descent ([[https://en.wikipedia.org/wiki/Stochastic_gradient_descent|SGD]])
     * Classic overview in [[https://leon.bottou.org/publications/pdf/mlss-2003.pdf|Bottou 2003 - Stochastic Learning]]
-    * History: Stochastic approximation method was proposed in [[https://projecteuclid.org/euclid.aoms/1177729586|Robins & Monroe 1951]]. This was extended to maximization problems of one dimension in [[https://www.jstor.org/stable/pdf/2236690.pdf?refreqid=excelsior%3Ab51dff976a6dd1adf20d42f807e627d1&ab_segments=&origin=|Kiefer & Wolfowitz 1952]], which was extended to the multidimensional case in [[https://www.jstor.org/stable/pdf/2236657.pdf?casa_token=4A2-9PtFvfkAAAAA:eq2HNs8YA-VVzr1G6QTMfXbfG3FBPlDMXzcuCOVri9lFbafMK6LuS1JDed_1ploaH7DEd36QkEt2G_XCKpfKzcwnbB2rk3gZQIWH_aXnmX_bG1j1NAY|Blum 1954]] and [[https://www.ams.org/journals/proc/1958-009-03/S0002-9939-1958-0098426-7/S0002-9939-1958-0098426-7.pdf|Blum 1958]]. (History from [[https://link.springer.com/article/10.1007/BF01071403|Ermol'ev  1966 Sec 12]]).
+    * **Note**: //early Russian papers use M instead of E to denote expectation.//
+    * **History**: As far as I can tell, SGD as we know it was first introduced by [[https://link.springer.com/article/10.1007/BF01074504|Ermol'ev & Nekrylova 1966]], and stochastic sub-gradient descent (SSGD) was introduced by [[https://link.springer.com/article/10.1007/BF01071091|Ermol'ev 1969]]. These works built upon the stochastic approximation method that was proposed in [[https://projecteuclid.org/euclid.aoms/1177729586|Robins & Monroe 1951]]. This was extended to maximization problems of one dimension in [[https://www.jstor.org/stable/pdf/2236690.pdf?refreqid=excelsior%3Ab51dff976a6dd1adf20d42f807e627d1&ab_segments=&origin=|Kiefer & Wolfowitz 1952]], which was extended to the multidimensional case in [[https://www.jstor.org/stable/pdf/2236657.pdf?casa_token=4A2-9PtFvfkAAAAA:eq2HNs8YA-VVzr1G6QTMfXbfG3FBPlDMXzcuCOVri9lFbafMK6LuS1JDed_1ploaH7DEd36QkEt2G_XCKpfKzcwnbB2rk3gZQIWH_aXnmX_bG1j1NAY|Blum 1954]] and [[https://www.ams.org/journals/proc/1958-009-03/S0002-9939-1958-0098426-7/S0002-9939-1958-0098426-7.pdf|Blum 1958]]. However, these methods used a finite-difference approximation to computing the stochastic gradient, in contrast to [[https://link.springer.com/article/10.1007/BF01074504|Ermol'ev & Nekrylova 1966]] (History from [[https://link.springer.com/article/10.1007/BF01071403|Ermol'ev  1966 Sec 12]]).
     * [[https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-23/issue-3/Stochastic-Estimation-of-the-Maximum-of-a-Regression-Function/10.1214/aoms/1177729392.full|Kiefer & Wolfowitz 1952 - Stochastic Estimation of the Maximum of a Regression Function]] [[https://www.jstor.org/stable/pdf/2236690.pdf?refreqid=excelsior%3Ab51dff976a6dd1adf20d42f807e627d1&ab_segments=&origin=|pdf]]
     * [[https://www.jstor.org/stable/pdf/2236657.pdf?casa_token=4A2-9PtFvfkAAAAA:eq2HNs8YA-VVzr1G6QTMfXbfG3FBPlDMXzcuCOVri9lFbafMK6LuS1JDed_1ploaH7DEd36QkEt2G_XCKpfKzcwnbB2rk3gZQIWH_aXnmX_bG1j1NAY|Blum 1954 - Multidimensional Stochastic Approximation Methods]] Not quite SGD (computes a finite-differences approx to SGD in a random direction). Does not assume access to a noisy gradient. See sect 6.
@@ Line 111: / Line 121: @@
     * [[https://link.springer.com/article/10.1007/BF01074757|Ermol'ev & Shor 1968 - Method of random walk for the two-stage problem of stochastic programming and its generalization]] [[https://link.springer.com/content/pdf/10.1007/BF01074757.pdf|pdf]]
     * [[https://link.springer.com/article/10.1007/BF01071485|Ermol'ev Tuniev 1968 - Direct methods of solving some stochastic programming problems]] [[https://link.springer.com/content/pdf/10.1007/BF01071485.pdf|pdf]] Cited by [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971]] Minimization of linear program with stochastic constraints.
-    * **[[https://link.springer.com/article/10.1007/BF01071091|Ermol'ev 1969 - On the method of generalized stochastic gradients and quasi-Féjer sequences]]** [[https://link.springer.com/content/pdf/10.1007/BF01071091.pdf|pdf]] I believe this may be the introduction of SSGD as we know it (stochastic sub-gradient descent). Assumes convexity, since assumes sub-gradient. Cited by [[https://link.springer.com/article/10.1007/BF01071541|Nurminskii 1974]] and [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971]].
+    * **[[https://link.springer.com/article/10.1007/BF01071091|Ermol'ev 1969 - On the method of generalized stochastic gradients and quasi-Féjer sequences]]** [[https://link.springer.com/content/pdf/10.1007/BF01071091.pdf|pdf]] I believe this may be the introduction of SSGD as we know it (stochastic sub-gradient descent). Calls sub-gradients "generalized gradient vectors," and calls the stochastic sub-gradient a "generalized stochastic gradient vector, or briefly, the stochastic quasi-gradient vector." Assumes convexity, since assumes sub-gradient. Cited by [[https://link.springer.com/article/10.1007/BF01071541|Nurminskii 1974]] and [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971]].
     * [[https://link.springer.com/article/10.1007/BF01071048|Guseva 1971 - Convergence rate of the method of generalized stochastic gradients]] [[https://link.springer.com/content/pdf/10.1007/BF01071048.pdf|pdf]] First proof of rate of convergence of an SGD-like algorithm. (Have to look closer, I'm not sure if it's actually SGD). Assumes convexity. Notes: g(x) is the subgradient of f(x), called reference function. It is called the support functional in [[https://www.researchgate.net/publication/265400868_A_General_Method_for_Solving_Extremum_Problems|Polyak 1967]] (citation [7]).
     * [[https://link.springer.com/article/10.1007/BF01071541|Nurminskii 1974 - Minimization of nondifferentiable functions in the presence of noise]] [[https://link.springer.com/content/pdf/10.1007/BF01071541.pdf|pdf]]