User Tools

Site Tools


nlp:datasets

NLP Datasets

Language Modeling Corpora

  • BNC corpus
  • Gigaword
  • Common crawl
  • Bookcorpus (Used in BERT)

General Benchmarks or Multi-Task Benchmarks

  • GLUE: paper - Warning: Has an issue with QQP and WNLI due to dev and tests sets not coming from the same distribution. See FAQ here.
  • SuperGLUE: paper - A more difficult version of GLUE.
  • CLUE: paper - Like GLUE, but for Chinese
  • MMMLU: Hendrycks et al 2020 - Measuring Massive Multitask Language Understanding This dataset is a popular dataset for LLMs to evaluate on (for example GPT-4, etc). However, it has two serious issues. 1) the test set is available on the web, which means LLMs are likely contaminated, and 2) the datasets has no in-domain training data, and can only be evaluated in a few-shot manner. This make is impossible to properly compare to prior fine-tuned methods.

Multilingual

Dialog

Semantic Parsing

Machine Translation

Question Answering

Summarization

Multimodal

Natural Language Inference

Seq2seq

Some standard seq2seq datasets.

Compositional Generalization

Commonsense Reasoning

Paraphrase

nlp/datasets.txt · Last modified: 2023/11/29 21:14 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki