Experimental Method and Reproducibility

Reproducibility

Gundersen & Kjensmo 2018 - State of the Art: Reproducibility in Artificial Intelligence
2017 - A Manifesto for Reproducible Science Nice overview here
Dodge et al 2019 - Show Your Work: Improved Reporting of Experimental Results Introduces reproducibility checklists, see below, and also give a procedure for estimating if one model is better than another at various hyper-parameter tuning budgets (expected validation performance). Their suggested estimator is a good one to use, see Dodge et al 2021. Follow-up work in this area:
- Dodge et al 2020 - Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping The results can largely be mitigated by training for more epochs, see Mosbach 2020
- Tang et al 2020 - Showing Your Work Doesn’t Always Work Don't use. Introduces an unbiased estimator which has a high variance, see the follow-up work: Dodge 2021
- Dodge et al 2021 - Expected Validation Performance and Estimation of a Random Variable’s Maximum Compares to Tang 2020, and recommends using the procedure and estimator from Dodge 2019.
Miltenburg et al 2021 - Preregistering NLP research There are issues with this idea.
Marie et al 2021 - Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

Statistical Significance

Below is from an email I sent to a student Jan 20, 2019

It is recommended to use a non-parametric test, such as the permutation test or paired bootstrap, rather than a t-test, since they don't have distribution assumptions. An example of how to do this is (use the R-package at the mentioned at the end):

https://thomasleeper.com/Rcourse/Tutorials/permutationtests.html

Other references: https://cs.stanford.edu/people/wmorgan/sigtest.pdf http://www.aclweb.org/anthology/D/D12/D12-1091.pdf

There are other tests which also re-sample the test data, which is necessary if the test data is small. A script to do all this is:

https://github.com/mgormley/sigtest

You only need 3-5 different runs for each experiments. If you don't get significance but want to show it, you can do more runs.

Significance testing can be daunting since there are so many methods. To keep it simple, I recommend just doing 3-5 runs for each experiment, and using the permutation test in the first link. You can also report the sample standard deviation as error bars in the table (can to this with just 3-5 samples).

Reproducibility Checklists, Datasheets and Model Cards

Reproducibility Checklists
- ACL conferences now require a Reproducibility checklist when submitting papers. See for example EMNLP 2021.
- Dodge et al 2019 - Show Your Work: Improved Reporting of Experimental Results
- The Machine Learning Reproducibility Checklist
Datasheets (aka data cards)
- Gebru et al 2018 - Datasheets for Datasets
- Examples: QuAC Datasheet, WebNLG Data card, GEM Data cards
Model cards
- Mitchell et al 2018 - Model Cards for Model Reporting
- Examples: An AllenNLP model card, InstructGPT model card

Resources and Tutorials

Tutorials
- ACL 2022 Tutorial: Towards Reproducible Machine Learning Research in Natural Language Processing (link to conference video)

NLP Wiki

Table of Contents

Experimental Method and Reproducibility

Reproducibility

Statistical Significance

Boostrap Resampling and Permutation Tests

Papers

Software

Below is from an email I sent to a student Jan 20, 2019

Reproducibility Checklists, Datasheets and Model Cards

Other Topics in Experimental Design

Effects of the Random Seed

Resources and Tutorials

Related Pages