This is an old revision of the document!
Table of Contents
Language Models
Traditional definition of a language model (LM): a language model is a probability distribution over sentences, that is, it assigns probabilities to sentences. Language models can usually compute the probability of the next word given a sequence of words (autoregressive language models), or in the case of masked language models, the probability of a word given a surrounding context.
Note: unlike autoregressive language models, masked language models usually can't be used to compute the probability of a sentence, and so they aren't really “language models” in the traditional sense.
To experiment with an autoregressive language model or masked language model, see online demos below.
Overviews
- Introductory Material
- Basic intro, and n-gram language modeling
- Language modeling by Mike Collins
- Probabilistic Language Models by Noah Smith
- Neural language models
- Section 7.5 of Chapter 7 of Speech and Language Processing
- Large language models
- 2021 - AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing Comprehensive overview at the time
- For another nice introduction, see related work of Taylor 2022 (p. 3)
- 2025 - International AI Safety Report (Has a good non-technical overview of AI, ML & LLMs)
- Language models in the news, etc
- Twitter GPT-3 code example (Sharif Shameem) “I only had to write 2 samples to give GPT-3 context for what I wanted it to do. It then properly formatted all of the other samples… If I wanted it to write output plain HTML/CSS instead of JSX, all I would have to do would be to re-write my 2 initial samples in HTML/CSS. Then all of GPT-3's outputs would be in plain HTML/CSS.”
- We Might See A 100T Language Model In 2022 archive.org Nice overview of some large language models in 2022
- Bibliographies
Papers
- n-gram Models: Old classic papers, and recent papers
- Fill-In-the-Middle
- See also Starcoder Fill-In-The-Middle
- Memory, Cache and Retrieval-Augmented Language Models
- Wu et al 2022 - Memorizing Transformers Uses k-NN lookup with fixed embeddings to retrieve relevant examples
Large Language Models
See also Ecosystem Graphs for a more complete list.
This is a list of large, GPT-style autoregressive LMs. See also pretraining for another list of large language models and GPT-3 alternatives.
- Jozefowicz et al 2016 - Exploring the Limits of Language Modeling It's interesting to see how far we've come since 2016.
- GPT-2: Radford et al 2019 - Language Models are Unsupervised Multitask Learners original github Annotated GPT-2 Illustrated GPT-2 Interestingly, GPT-2 does not include a bias term in the final linear layer for the vocab, see here and here.
- Chinchilla: Hoffmann et al 2022 - Training Compute-Optimal Large Language Models Says most LLMs are undertrained, and trains a compute budget optimal size language model using the same dataset as Gopher. blog1 blog2
- GPT-NeoX-20B: Black et al 2022 - GPT-NeoX-20B: An Open-Source Autoregressive Language Model Has an interesting description of the hardware they used
| Model | Year | Parameters | Training Data | Public? | Link |
|---|---|---|---|---|---|
| GPT | 2018 | BooksCorpus | Yes | github Huggingface | |
| GPT-2 | 2019 | 1.5B | Webtext (closed, see datasets below) | Yes | github Huggingface |
| GPT-3 | 2020 | 175B | CommonCrawl, Webtext2, Books 1&2, Wikipedia | API | OpenAI cookbook |
| MoE | 2021 | 1.1T (13B) | CC100, CC-News, CC-Stories, OpenWebText, BookCorpus, Wikipedia | Yes | github HuggingFace |
| Gopher | 2021 | 280B | MassiveText | No | blog |
| Megatron-Turing NLG | 2022 | 530B | Pile, CommonCrawl, Realnews, CC-Stories | Researcher access | blog1 blog2 github |
| Chinchilla | 2022 | 70B | MassiveText | No | blog |
| GPT-NeoX-20B | 2022 | 20B | Pile | Yes | github |
| Jurassic-1 | 2022 | 178B | API | AI21 studio | |
| YaLM 100B | 2022 | 100B | Pile + lots of Russian text | Yes | github HuggingFace |
| PaLM | 2022 | 540B | Social media, web, books, Github, Wikipedia | No? | blog |
| OPT | 2022 | 66B, 175B | Pile subset: CommonCrawl, OpenWebtext2, Gutenberg, Wikipedia | Yes | demo models |
| UL2 | 2022 | 20B | Yes | blog github | |
| Bloom | 2022 | 176B | Multilingual BigScienceCorpus paper | Yes | HuggingFace demo |
| GLM-130B | 2022 | 130B | Pile, Chinese WudaoCorpora, more | Yes | github |
| Galactica | 2022 | 120B | Scientific papers, code, reference material, prompts | Yes | github HuggingFace |
| ChatGPT | 2022 | ? | API | demo ShareGPT | |
| LLaMA | 2023 | 65B | CommonCrawl, C4, Github, Wikipedia, Books3, ArXiv, StackExchange | Yes | blog github |
| GPT-4 | 2023 | ? | ? (multi-modal) | API | website |
| Alpaca | 2023 | 7B | 52k instructions from Self-Instruct w/ text-davinci-003 | Yes | github demo |
| Vicuna | 2023 | 7B/13B | (Chatbot) | Yes | github demo |
| Koala | 2023 | 13B | Yes | github demo | |
| StackLLaMA | 2023 | 7B | Yes | demo | |
| LIMA | 2023 | 65B | |||
| PaLM 2 | 2023 | 14.7B | API | website api | |
| LLama 2 | 2023 | 70B | Yes | website blog | |
| Mistral 7B, Mixtral 8X7B | 2023 | 7B | Yes, API | ||
| Orca 2 | 2023 | ||||
| OLMo | 2024 | 7B | dolma | Yes, open data | blog github huggingface |
| Gemma | 2024 | 7B, 2B | Yes | blog | |
| Jamba | 2024 | 52B | Yes | blog HuggingFace | |
| OpenELM | 2024 | 1.1B | Yes | ||
| Kimi K2 | 2025 | 1T | Yes | ||
Abilities and Analysis of LLMs
- ChatGPT
- For ChatGPT, see also ChatGPT.
- Creativity
- Self-Correction
- Use of Context
- Khandelwal et al 2018 - Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context (Old, no longer applies to transformer models)
-
Origin of Capabilities
- Machine Translation
Evaluation of LLMs and Benchmarks
- Overviews
- lm-evaluation-harness: LM Evaluation Harness (EleutherAI) (Released May 2021)
- Small-scale Evaluations
- Effects of Length and Irrelevant Context
Limitations of Current LLMs
Questions and Critiques of LLMs
Adapting Language Models
To Domains
To Other Languages
- Language Adaptive Fine-Tuning (LAFT):
- Recycling: Vries & Nissim 2021 - As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages This one works the best so far. Retrains the embeddings keeping the Transformer layers fixed
Temporal Language Modeling
Extracting Knowledge from Language Models
See also Dense Document Retrieval with LLMs.
- Extracting Training Data
- Membership Inference for Training Data
- (Decide if some sample data is in the training data or not)
- Related page: Membership Inference Attacks
Knowledge Editing
Personalization
LLM Personality and Writing Style
- Personality
- Vocabulary Overuse
Detecting Generated Text
See also Fake News Detection.
Adversarial Attacks
Steering
Applications
- Evaluation, see Evaluation with Large Language Models
- Creating Data or replacement for crowdsourcing, see Data Augmentation (Synthetic Data Generation)
Copyright Issues
See Copyright Issues.
Theoretical and Foundational Papers
Emergent Abilities
See also Scaling Laws - Emergent Abilities
Acceleration and Efficiency
See paper list here. See also Model Compression.
- Overviews
- Wan et al 2023 - Efficient Large Language Models: A Survey Updated continuously. See paper list here
Economics of LLMs
Miscellaneous
Concept or Semantic LLMs
Consciousness of LLMs
Historical Papers
Historical papers that may or may not be applicable today.
- Melis et al 2017 - On the State of the Art of Evaluation in Neural Language Models Shows that LSTMS, when properly tuned, outperform other models (as of 2017, so before the Transformer)
Datasets
- Standard benchmark datasets
- PG-19 Uses books from before 1919. Good for long sequences.
- Large datasets
- Common Crawl
-
- WebText: Introduced in GPT-2 (paper).
- OpenWebText2 Open re-implementation, widely used. Use this one. On HuggingFace here.
- The Pile Diverse set of data for building language models. Individual components, see readme. Older individual components Paper: The Pile: An 800GB Dataset of Diverse Text for Language Modeling Pile check tool
- RedPajama-Data An Open Source Recipe to Reproduce LLaMA training dataset
- Code datasets
- Small datasets
- Minipile: paper Huggingface
Software and Demos
- Training and/or Inference Frameworks for LLMs
- For an overview, see table 4 and section 3.7 of Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference
- (Historical) n-gram LM toolkits
- Deep learning toolkits
- GPT Neo An open-source implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
- NVidia's Megatron-LM Used for example, by BLOOM
- Online demos
- AI2's Jurassic language model Jurassic-1
- GPT-3: web interface is free after signing up
- Hugging Face Fill Mask demo: Fill-Mask Demo Text Generation Demo