nlp:datasets
Table of Contents
NLP Datasets
See also NLP Progress, Wikipedia List of datasets, and nlp-datasets. Also data preparation.
Language Modeling Corpora
- BNC corpus
- Gigaword
- Common crawl
- Bookcorpus (Used in BERT)
General Benchmarks or Multi-Task Benchmarks
- MMMLU: Hendrycks et al 2020 - Measuring Massive Multitask Language Understanding This dataset is a popular dataset for LLMs to evaluate on (for example GPT-4, etc). However, it has two serious issues. 1) the test set is available on the web, which means LLMs are likely contaminated, and 2) the datasets has no in-domain training data, and can only be evaluated in a few-shot manner. This make is impossible to properly compare to prior fine-tuned methods.
Multilingual
- Survey on Multilingual NLP Datasets: List of Datasets and Paper
Dialog
Semantic Parsing
Machine Translation
Question Answering
Summarization
Multimodal
Natural Language Inference
Seq2seq
Some standard seq2seq datasets.
Compositional Generalization
Commonsense Reasoning
Paraphrase
nlp/datasets.txt · Last modified: 2023/11/29 21:14 by jmflanig