====== Experimental Method and Reproducibility ====== ===== Reproducibility ===== * [[https://www.aaai.org/GuideBook2018/17248-73943-GB.pdf|Gundersen & Kjensmo 2018 - State of the Art: Reproducibility in Artificial Intelligence]] * [[https://www.nature.com/articles/s41562-016-0021.pdf|2017 - A Manifesto for Reproducible Science]] Nice overview [[https://arxiv.org/pdf/2103.06944.pdf|here]] * [[https://arxiv.org/pdf/1909.03004.pdf|Dodge et al 2019 - Show Your Work: Improved Reporting of Experimental Results]] Introduces **reproducibility checklists**, see [[experimental_method#reproducibility_checklists_datasheets_and_model_cards|below]], and also give a procedure for estimating if one model is better than another at various hyper-parameter tuning budgets (**expected validation performance**). Their suggested estimator is a good one to use, see [[https://aclanthology.org/2021.findings-emnlp.342.pdf|Dodge et al 2021]]. Follow-up work in this area: * [[https://arxiv.org/pdf/2002.06305.pdf|Dodge et al 2020 - Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping]] The results can largely be mitigated by training for more epochs, see [[https://arxiv.org/pdf/2006.04884.pdf|Mosbach 2020]] * [[https://arxiv.org/pdf/2004.13705.pdf|Tang et al 2020 - Showing Your Work Doesn’t Always Work]] Don't use. Introduces an unbiased estimator which has a high variance, see the follow-up work: [[https://aclanthology.org/2021.findings-emnlp.342.pdf|Dodge 2021]] * [[https://aclanthology.org/2021.findings-emnlp.342.pdf|Dodge et al 2021 - Expected Validation Performance and Estimation of a Random Variable’s Maximum]] Compares to [[https://arxiv.org/pdf/2004.13705.pdf|Tang 2020]], and recommends using the procedure and estimator from [[https://arxiv.org/pdf/2002.06305.pdf|Dodge 2019]]. * [[https://arxiv.org/pdf/2103.06944.pdf|Miltenburg et al 2021 - Preregistering NLP research]] There are issues with this idea. * [[https://arxiv.org/pdf/2106.15195.pdf|Marie et al 2021 - Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers]] ===== Statistical Significance ===== See also [[stats:Statistical Tests]]. For an overview of applying tests of statistical significance to NLP, see: * **NLP 203 slides on statistical significance**: [[https://drive.google.com/file/d/1e4qtAgF_xAtMUSR7xLKzEdaRT9tOzizo/view|Spring 2021]] * Section 11.3 from [[http://www.phontron.com/class/mtandseq2seq2018/assets/slides/mt-fall2018.chapter11.pdf|here]] (applied to MT, but the same techniques are used elsewhere in NLP) * [[https://cs.stanford.edu/people/wmorgan/sigtest.pdf|Slides from Stanford NLP Group]] * [[https://www.aclweb.org/anthology/P18-1128.pdf|Dror et al 2018 - The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing]] and [[https://arxiv.org/pdf/1809.01448.pdf|Appendix - Recommended Statistical Significance Tests for NLP Tasks]] * Section 4.4.3 Classifier comparison and statistical significance from [[https://github.com/jacobeisenstein/gt-nlp-class/raw/master/notes/eisenstein-nlp-notes.pdf|Eisenstein's book]] * Appendix 3 from [[https://www.morganclaypool.com/doi/abs/10.2200/S00361ED1V01Y201105HLT013?journalCode=hlt&|Noah Smith's book]] (available for free through UCSC library) * [[https://aclanthology.org/P19-1266.pdf|Dror et al 2019 - Deep Dominance - How to Properly Compare Deep Neural Models]] Caveat: some researchers have advocated tuning the random seed as a hyperparameter, see [[nlp:experimental_method#Effects of the Random Seed]] * [[https://arxiv.org/pdf/2204.06815.pdf|Ulmer et al 2022 - Deep-Significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks]] * [[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html|Wilcoxon Signed-Rank Test Docs in SciPy]]. An issue to consider is how to include outcomes in which system A & B produce the same prediction/score. See the ''zero_method'' parameter and associated links. ==== Boostrap Resampling and Permutation Tests ==== * [[https://www.jeffreycjohnson.org/app/download/764734156/cimeth.PDF|Noreen 1989 - Computer Intensive Methods for Testing Hypotheses: An Introduction]] Excellent overview * [[https://statweb.stanford.edu/~tibs/stat315a/Supplements/bootstrap.pdf|Bootstrap Methods and Permutation Tests]] * [[http://www-stat.wharton.upenn.edu/~stine/research/spida_2005.pdf|Bootstrap Resampling Slides]] * Chapter 3 of [[https://web.stanford.edu/class/ee378a/books/book2.pdf|All of Nonparametric Statistics]] ==== Papers ==== See also [[https://www.aclweb.org/anthology/search/?q=statistical+significance|ACL Anthology - statistical significance]]. * [[https://www.aclweb.org/anthology/W04-3250.pdf|Koehn 2004 - Statistical Significance Tests for Machine Translation Evaluation]] Advocates resampling the test set to estimate statistical significance. Widely used in MT. * [[https://www.aclweb.org/anthology/P11-2031.pdf|Clark et al 2011 - Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability]] Great paper. Someone should redo this paper for the deep learning era (and take into account [[https://arxiv.org/pdf/1909.03004.pdf|Dodge 2019]]). * [[http://www.aclweb.org/anthology/D/D12/D12-1091.pdf|Berg-Kirkpatrick et al 2012 - An Empirical Investigation of Statistical Significance in NLP]] * [[https://www.aclweb.org/anthology/Q17-1033.pdf|Dror et al 2017 - Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets]] * [[https://www.aclweb.org/anthology/P18-1128.pdf|Dror et al 2018 - The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing]] and [[https://arxiv.org/pdf/1809.01448.pdf|Appendix - Recommended Statistical Significance Tests for NLP Tasks]] * [[https://www.aclweb.org/anthology/2020.aacl-demo.7.pdf|Zhu et al 2020 - NLPStatTest: A Toolkit for Comparing NLP System Performance]] ==== Software ==== * [[https://github.com/rtmdrr/testSignificanceNLP|testSignificanceNLP]] (Recommended) [[https://aclanthology.org/P18-1128.pdf|Paper]] and [[https://arxiv.org/pdf/1809.01448.pdf|Summary of recommended tests]] * [[https://nlpstats.ling.washington.edu/home|NLPStatTest]]. [[https://www.aclweb.org/anthology/2020.aacl-demo.7.pdf|Paper]] * [[https://github.com/allenai/HyBayes|HyBayes]]. [[https://arxiv.org/pdf/1911.03850.pdf|Paper]] ==== Below is from an email I sent to a student Jan 20, 2019 ==== It is recommended to use a non-parametric test, such as the permutation test or paired bootstrap, rather than a t-test, since they don't have distribution assumptions. An example of how to do this is (use the R-package at the mentioned at the end): [[https://thomasleeper.com/Rcourse/Tutorials/permutationtests.html]] Other references: [[https://cs.stanford.edu/people/wmorgan/sigtest.pdf]] [[http://www.aclweb.org/anthology/D/D12/D12-1091.pdf]] There are other tests which also re-sample the test data, which is necessary if the test data is small. A script to do all this is: [[https://github.com/mgormley/sigtest ]] You only need 3-5 different runs for each experiments. If you don't get significance but want to show it, you can do more runs. Significance testing can be daunting since there are so many methods. To keep it simple, I recommend just doing 3-5 runs for each experiment, and using the permutation test in the first link. You can also report the sample standard deviation as error bars in the table (can to this with just 3-5 samples). ===== Reproducibility Checklists, Datasheets and Model Cards ===== * Reproducibility Checklists * ACL conferences now require a Reproducibility checklist when submitting papers. See for example [[https://2021.emnlp.org/call-for-papers#reproducibility-criteria|EMNLP 2021]]. * [[https://aclanthology.org/D19-1224.pdf|Dodge et al 2019 - Show Your Work: Improved Reporting of Experimental Results]] * [[https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf|The Machine Learning Reproducibility Checklist]] * Datasheets (aka data cards) * [[https://arxiv.org/pdf/1803.09010.pdf|Gebru et al 2018 - Datasheets for Datasets]] * Examples: [[https://quac.ai/datasheet.pdf|QuAC Datasheet]], [[https://gem-benchmark.com/data_cards/WebNLG|WebNLG Data card]], [[https://gem-benchmark.com/data_cards|GEM Data cards]] * Model cards * [[https://arxiv.org/pdf/1810.03993.pdf|Mitchell et al 2018 - Model Cards for Model Reporting]] * Examples: An AllenNLP [[https://demo.allennlp.org/reading-comprehension/transformer-qa|model card]], InstructGPT [[https://github.com/openai/following-instructions-human-feedback/blob/main/model-card.md|model card]] ===== Other Topics in Experimental Design ===== ==== Effects of the Random Seed ===== For many common tasks and neural architectures, the choice of random seed has only a small effect on the accuracy or BLEU score (a standard deviation across random seeds of say .1-.5). For this reason, many software packages fix the random seed in advance. However, for some tasks or models, it is possible for the random seed to have a larger effect. For example, Rongwen has found it has a large effect on neural models for [[Compositional Generalization]].\\ **Overview**: [[https://openreview.net/forum?id=0GzHjrL4Vq0|2021 - We Need to Talk About Random Seeds]] Advocates tuning the random seed * [[https://arxiv.org/pdf/1909.10447.pdf|Madhyastha & Jain 2019 - On Model Stability as a Function of Random Seed]] * [[https://link.springer.com/chapter/10.1007/978-3-030-64580-9_8|2021 - Effects of Random Seeds on the Accuracy of Convolutional Neural Networks]] * [[https://arxiv.org/pdf/2103.04514.pdf|Summers & Dinneen 2021 - Nondeterminism and Instability in Neural Network Optimization]] ===== Resources and Tutorials ===== * Tutorials * ACL 2022 Tutorial: [[https://underline.io/events/284/sessions?eventSessionId=10736|Towards Reproducible Machine Learning Research in Natural Language Processing]] (link to conference video) ===== Related Pages ===== * [[Evaluation]] * [[ml:Hyperparameter Tuning]] * [[stats:Statistical Tests]]