Differences

This shows you the differences between two versions of the page.

--- nlp:experimental_method [2022/05/25 06:06] – [Reproducibility] jmflanig
+++ nlp:experimental_method [2023/06/15 07:36] (current) – external edit 127.0.0.1
@@ Line 4: / Line 4: @@
   * [[https://www.aaai.org/GuideBook2018/17248-73943-GB.pdf|Gundersen & Kjensmo 2018 - State of the Art: Reproducibility in Artificial Intelligence]]
   * [[https://www.nature.com/articles/s41562-016-0021.pdf|2017 - A Manifesto for Reproducible Science]] Nice overview [[https://arxiv.org/pdf/2103.06944.pdf|here]]
-  * [[https://arxiv.org/pdf/1909.03004.pdf|Dodge et al 2019 - Show Your Work: Improved Reporting of Experimental Results]] Introduces reproducibility checklists, see [[experimental_method#reproducibility_checklists_datasheets_and_model_cards|below]], and also give a procedure for estimating if one model is better than another at various hyper-parameter tuning budgets.  This estimator is a good one to use, see [[https://aclanthology.org/2021.findings-emnlp.342.pdf|Dodge et al 2021]].  Follow-up work in this area:
+  * [[https://arxiv.org/pdf/1909.03004.pdf|Dodge et al 2019 - Show Your Work: Improved Reporting of Experimental Results]] Introduces **reproducibility checklists**, see [[experimental_method#reproducibility_checklists_datasheets_and_model_cards|below]], and also give a procedure for estimating if one model is better than another at various hyper-parameter tuning budgets (**expected validation performance**).  Their suggested estimator is a good one to use, see [[https://aclanthology.org/2021.findings-emnlp.342.pdf|Dodge et al 2021]].  Follow-up work in this area:
     * [[https://arxiv.org/pdf/2002.06305.pdf|Dodge et al 2020 - Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping]] The results can largely be mitigated by training for more epochs, see [[https://arxiv.org/pdf/2006.04884.pdf|Mosbach 2020]]
     * [[https://arxiv.org/pdf/2004.13705.pdf|Tang et al 2020 - Showing Your Work Doesn’t Always Work]] Don't use. Introduces an unbiased estimator which has a high variance, see the follow-up work: [[https://aclanthology.org/2021.findings-emnlp.342.pdf|Dodge 2021]]
@@ Line 16: / Line 16: @@
 For an overview of applying tests of statistical significance to NLP, see:
+  * **NLP 203 slides on statistical significance**: [[https://drive.google.com/file/d/1e4qtAgF_xAtMUSR7xLKzEdaRT9tOzizo/view|Spring 2021]]
   * Section 11.3 from [[http://www.phontron.com/class/mtandseq2seq2018/assets/slides/mt-fall2018.chapter11.pdf|here]] (applied to MT, but the same techniques are used elsewhere in NLP)
   * [[https://cs.stanford.edu/people/wmorgan/sigtest.pdf|Slides from Stanford NLP Group]]
@@ Line 21: / Line 22: @@
   * Section 4.4.3 Classifier comparison and statistical significance from [[https://github.com/jacobeisenstein/gt-nlp-class/raw/master/notes/eisenstein-nlp-notes.pdf|Eisenstein's book]]
   * Appendix 3 from [[https://www.morganclaypool.com/doi/abs/10.2200/S00361ED1V01Y201105HLT013?journalCode=hlt&|Noah Smith's book]] (available for free through UCSC library)
+  * [[https://aclanthology.org/P19-1266.pdf|Dror et al 2019 - Deep Dominance - How to Properly Compare Deep Neural Models]] Caveat: some researchers have advocated tuning the random seed as a hyperparameter, see [[nlp:experimental_method#Effects of the Random Seed]]
+  * [[https://arxiv.org/pdf/2204.06815.pdf|Ulmer et al 2022 - Deep-Significance - Easy and Meaningful Statistical Significance Testing in the Age of Neural Networks]]
+  * [[https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wilcoxon.html|Wilcoxon Signed-Rank Test Docs in SciPy]]. An issue to consider is how to include outcomes in which system A & B produce the same prediction/score. See the ''zero_method'' parameter and associated links.
 ==== Boostrap Resampling and Permutation Tests ====
@@ Line 70: / Line 74: @@
   * Model cards
     * [[https://arxiv.org/pdf/1810.03993.pdf|Mitchell et al 2018 - Model Cards for Model Reporting]]
-    * Examples: An AllenNLP [[https://demo.allennlp.org/reading-comprehension/transformer-qa|model card]]
+    * Examples: An AllenNLP [[https://demo.allennlp.org/reading-comprehension/transformer-qa|model card]], InstructGPT [[https://github.com/openai/following-instructions-human-feedback/blob/main/model-card.md|model card]]
 ===== Other Topics in Experimental Design =====
@@ Line 76: / Line 80: @@
 ==== Effects of the Random Seed =====
 For many common tasks and neural architectures, the choice of random seed has only a small effect on the accuracy or BLEU score (a standard deviation across random seeds of say .1-.5).  For this reason, many software packages fix the random seed in advance.  However, for some tasks or models, it is possible for the random seed to have a larger effect.  For example, Rongwen has found it has a large effect on neural models for [[Compositional Generalization]].\\
-**Overview**: [[https://openreview.net/attachment?id=0GzHjrL4Vq0&name=previous_PDF|2021 - We Need to Talk About Random Seeds]] Advocates tuning the random seed
+**Overview**: [[https://openreview.net/forum?id=0GzHjrL4Vq0|2021 - We Need to Talk About Random Seeds]] Advocates tuning the random seed
   * [[https://arxiv.org/pdf/1909.10447.pdf|Madhyastha & Jain 2019 - On Model Stability as a Function of Random Seed]]
   * [[https://link.springer.com/chapter/10.1007/978-3-030-64580-9_8|2021 - Effects of Random Seeds on the Accuracy of Convolutional Neural Networks]]
   * [[https://arxiv.org/pdf/2103.04514.pdf|Summers & Dinneen 2021 - Nondeterminism and Instability in Neural Network Optimization]]
+===== Resources and Tutorials =====
+  * Tutorials
+    * ACL 2022 Tutorial: [[https://underline.io/events/284/sessions?eventSessionId=10736|Towards Reproducible Machine Learning Research in Natural Language Processing]] (link to conference video)
 ===== Related Pages =====