Traditional definition of a language model (LM): a language model is a probability distribution over sentences, that is, it assigns probabilities to sentences. Language models can usually compute the probability of the next word given a sequence of words (autoregressive language models), or in the case of masked language models, the probability of a word given a surrounding context.
Note: unlike autoregressive language models, masked language models usually can't be used to compute the probability of a sentence, and so they aren't really “language models” in the traditional sense.
To experiment with an autoregressive language model or masked language model, see online demos below.
See also Ecosystem Graphs for a more complete list.
This is a list of large, GPT-style autoregressive LMs. See also pretraining for another list of large language models and GPT-3 alternatives.
| Model | Year | Parameters | Training Data | Public? | Link |
|---|---|---|---|---|---|
| GPT | 2018 | BooksCorpus | Yes | github Huggingface | |
| GPT-2 | 2019 | 1.5B | Webtext (closed, see datasets below) | Yes | github Huggingface |
| GPT-3 | 2020 | 175B | CommonCrawl, Webtext2, Books 1&2, Wikipedia | API | OpenAI cookbook |
| MoE | 2021 | 1.1T (13B) | CC100, CC-News, CC-Stories, OpenWebText, BookCorpus, Wikipedia | Yes | github HuggingFace |
| Gopher | 2021 | 280B | MassiveText | No | blog |
| Megatron-Turing NLG | 2022 | 530B | Pile, CommonCrawl, Realnews, CC-Stories | Researcher access | blog1 blog2 github |
| Chinchilla | 2022 | 70B | MassiveText | No | blog |
| GPT-NeoX-20B | 2022 | 20B | Pile | Yes | github |
| Jurassic-1 | 2022 | 178B | API | AI21 studio | |
| YaLM 100B | 2022 | 100B | Pile + lots of Russian text | Yes | github HuggingFace |
| PaLM | 2022 | 540B | Social media, web, books, Github, Wikipedia | No? | blog |
| OPT | 2022 | 66B, 175B | Pile subset: CommonCrawl, OpenWebtext2, Gutenberg, Wikipedia | Yes | demo models |
| UL2 | 2022 | 20B | Yes | blog github | |
| Bloom | 2022 | 176B | Multilingual BigScienceCorpus paper | Yes | HuggingFace demo |
| GLM-130B | 2022 | 130B | Pile, Chinese WudaoCorpora, more | Yes | github |
| Galactica | 2022 | 120B | Scientific papers, code, reference material, prompts | Yes | github HuggingFace |
| ChatGPT | 2022 | ? | API | demo ShareGPT | |
| LLaMA | 2023 | 65B | CommonCrawl, C4, Github, Wikipedia, Books3, ArXiv, StackExchange | Yes | blog github |
| GPT-4 | 2023 | ? | ? (multi-modal) | API | website |
| Alpaca | 2023 | 7B | 52k instructions from Self-Instruct w/ text-davinci-003 | Yes | github demo |
| Vicuna | 2023 | 7B/13B | (Chatbot) | Yes | github demo |
| Koala | 2023 | 13B | Yes | github demo | |
| StackLLaMA | 2023 | 7B | Yes | demo | |
| LIMA | 2023 | 65B | |||
| PaLM 2 | 2023 | 14.7B | API | website api | |
| LLama 2 | 2023 | 70B | Yes | website blog | |
| Mistral 7B, Mixtral 8X7B | 2023 | 7B | Yes, API | ||
| Orca 2 | 2023 | ||||
| OLMo | 2024 | 7B | dolma | Yes, open data | blog github huggingface |
| Gemma | 2024 | 7B, 2B | Yes | blog | |
| Jamba | 2024 | 52B | Yes | blog HuggingFace | |
| OpenELM | 2024 | 1.1B | Yes | ||
| Kimi K2 | 2025 | 1T | Yes | ||
See also Chained or Tool-based Prompting.
See also Dense Document Retrieval with LLMs.
See also Fake News Detection.
See Copyright Issues.
See also Scaling Laws - Emergent Abilities
See paper list here. See also Model Compression.
Historical papers that may or may not be applicable today.