====== NLP Datasets ====== See also [[http://nlpprogress.com/|NLP Progress]], [[https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research#Text_data|Wikipedia List of datasets]], and [[https://github.com/niderhoff/nlp-datasets|nlp-datasets]]. Also [[data preparation]]. ===== Language Modeling Corpora ===== * BNC corpus * Gigaword * Common crawl * [[https://arxiv.org/pdf/1506.06724.pdf|Bookcorpus]] (Used in BERT) ===== General Benchmarks or Multi-Task Benchmarks ===== * [[https://gluebenchmark.com/|GLUE]]: [[https://arxiv.org/pdf/1804.07461.pdf|paper]] - Warning: Has an issue with QQP and WNLI due to dev and tests sets not coming from the same distribution. See FAQ [[https://gluebenchmark.com/faq/|here]]. * [[https://super.gluebenchmark.com/|SuperGLUE]]: [[https://arxiv.org/pdf/1905.00537.pdf|paper]] - A more difficult version of GLUE. * [[https://www.cluebenchmarks.com/en/index.html|CLUE]]: [[https://arxiv.org/pdf/2004.05986.pdf|paper]] - Like GLUE, but for Chinese * [[https://github.com/hendrycks/test|MMMLU]]: [[https://arxiv.org/pdf/2009.03300.pdf|Hendrycks et al 2020 - Measuring Massive Multitask Language Understanding]] This dataset is a popular dataset for LLMs to evaluate on (for example GPT-4, etc). However, it has two serious issues. 1) the test set is available on the web, which means LLMs are likely contaminated, and 2) the datasets has no in-domain training data, and can only be evaluated in a few-shot manner. This make is impossible to properly compare to prior fine-tuned methods. ===== Multilingual ===== * Survey on Multilingual NLP Datasets: [[https://multilingual-dataset-survey.github.io/full-survey/|List of Datasets]] and [[https://arxiv.org/pdf/2211.15649.pdf|Paper]] ===== Dialog ===== ===== Semantic Parsing ===== ===== Machine Translation ===== ===== Question Answering ===== ===== Summarization ===== ===== Multimodal ===== ===== Natural Language Inference ===== ===== Seq2seq ===== Some standard seq2seq datasets. ===== Compositional Generalization ===== ===== Commonsense Reasoning ===== ===== Paraphrase ===== * [[https://www.microsoft.com/en-us/download/details.aspx?id=52398|Microsoft Research Paraphrase Corpus]], [[https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/I05-50025B15D.pdf|paper]]