User Tools

Site Tools


nlp:datasets

This is an old revision of the document!


NLP Datasets

Language Modeling Corpora

  • BNC corpus
  • Gigaword
  • Common crawl
  • Bookcorpus (Used in BERT)

Multi-Task

  • GLUE: paper. Warning: Has an issue with QQP and WNLI due to dev and tests sets not coming from the same distribution. See FAQ here.

Dialog

Semantic Parsing

Machine Translation

Question Answering

Summarization

Multimodal

Natural Language Inference

Seq2seq

Some standard seq2seq datasets.

Compositional Generalization

Commonsense Reasoning

Paraphrase

nlp/datasets.1646249537.txt.gz · Last modified: 2023/06/15 07:36 (external edit)

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki