====== Language Models =======
Traditional definition of a language model (LM): //a language model is a probability distribution over sentences//, that is, it assigns probabilities to sentences.  Language models can usually compute the probability of the next word given a sequence of words (//autoregressive language models//), or in the case of //masked language models//, the probability of a word given a surrounding context.

Note: unlike autoregressive language models, masked language models usually can't be used to compute the probability of a sentence, and so they aren't really "language models" in the traditional sense.

To experiment with an autoregressive language model or masked language model, see **online demos** below.

===== Overviews =====
  * **Introductory Material**
    * Basic intro, and n-gram language modeling
      * [[http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/lm.pdf|Language modeling]] by Mike Collins 
      * [[https://homes.cs.washington.edu/~nasmith/papers/plm.17.pdf|Probabilistic Language Models]] by Noah Smith 
      * [[https://web.stanford.edu/~jurafsky/slp3/3.pdf|Chapter 3]] of [[https://web.stanford.edu/~jurafsky/slp3/|Speech and Language Processing]]
    * Neural language models
      * Section 7.5 of [[https://web.stanford.edu/~jurafsky/slp3/7.pdf|Chapter 7]] of [[https://web.stanford.edu/~jurafsky/slp3/|Speech and Language Processing]]
    * Large language models
      * [[https://arxiv.org/pdf/2108.05542|2021 - AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing]] Comprehensive overview at the time
      * **[[https://arxiv.org/pdf/2303.05759.pdf|Wei et al 2023 - An Overview on Language Models: Recent Developments and Outlook]]**
      * **[[https://arxiv.org/pdf/2401.02038.pdf|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]]** Wow, really good
      * **[[https://arxiv.org/pdf/2307.06435|Naveed et al 2024 - A Comprehensive Overview of Large Language Models]]**
      * For another nice introduction, see related work of [[https://arxiv.org/pdf/2211.09085.pdf|Taylor 2022]] (p. 3)
      * [[https://arxiv.org/pdf/2304.00612.pdf|Bowman 2023 - Eight Things to Know about Large Language Models]]
      * **[[https://arxiv.org/pdf/2402.06196|Minaee et al 2024 - Large Language Models: A Survey]]**
      * [[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]]
      * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]
      * **[[https://arxiv.org/pdf/2303.18223.pdf|Zhao et al 2023 - A Survey of Large Language Models]]**
      * [[https://arxiv.org/pdf/2404.09022|Weng 2024 - Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies]]
      * **[[https://arxiv.org/pdf/2501.17805|2025 - International AI Safety Report]]** (Has a good non-technical overview of AI, ML & LLMs)
      * [[https://arxiv.org/pdf/2601.02907|Gan et al 2026 - Beyond the Black Box: A Survey on the Theory and Mechanism of Large Language Models]]
  * **Language models in the news, etc**
    * [[https://www.wired.com/story/ai-text-generator-gpt-3-learning-language-fitfully/|Wired - GPT-3]]
    * [[https://twitter.com/sharifshameem/status/1282676454690451457|Twitter GPT-3 code example]] (Sharif Shameem) "I only had to write 2 samples to give GPT-3 context for what I wanted it to do. It then properly formatted all of the other samples... If I wanted it to write output plain HTML/CSS instead of JSX, all I would have to do would be to re-write my 2 initial samples in HTML/CSS. Then all of GPT-3's outputs would be in plain HTML/CSS."
    * [[https://analyticsindiamag.com/we-might-see-a-100t-language-model-in-2022/|We Might See A 100T Language Model In 2022]] [[https://web.archive.org/web/20220814084814/https://analyticsindiamag.com/we-might-see-a-100t-language-model-in-2022/|archive.org]] Nice overview of some large language models in 2022
  * **Bibliographies**
    * **[[https://github.com/Hannibal046/Awesome-LLM|Awesome-LLM]]**
===== Papers =====
  * n-gram Models: Old classic papers, and recent papers
    * [[https://www.stats.ox.ac.uk/~teh/research/compling/hpylm.pdf|Teh 2006 - A Bayesian Interpretation of Interpolated Kneser-Ney]]
    * [[https://aclanthology.org/2024.naacl-long.382.pdf|Malagutti et al 2024 - The Role of n-gram Smoothing in the Age of Neural Networks]]
  * Fill-In-the-Middle
    * [[https://arxiv.org/pdf/2207.14255.pdf|Bavarian et al 2022 - Efficient Training of Language Models to Fill in the Middle]]
    * See also [[https://huggingface.co/bigcode/starcoder#fill-in-the-middle|Starcoder Fill-In-The-Middle]]
  * Memory, Cache and Retrieval-Augmented Language Models
     * [[https://arxiv.org/pdf/1911.00172.pdf|Khandelwal et al 2019 - Generalization through Memorization: Nearest Neighbor Language Models]]
     * [[https://arxiv.org/pdf/2102.02557.pdf|Yogatama et al 2021 - Adaptive Semiparametric Language Models]]
     * [[https://arxiv.org/pdf/2112.04426.pdf|2021 - Improving language models by retrieving from trillions of tokens]] ([[https://vaclavkosar.com/ml/DeepMinds-RETRO-Transformer-Model|blog]])
     * [[https://arxiv.org/pdf/2203.08913.pdf|Wu et al 2022 - Memorizing Transformers]] Uses k-NN lookup with fixed embeddings to retrieve relevant examples
  * [[https://arxiv.org/pdf/2202.01169.pdf|Clark et al 2022 - Unified Scaling Laws for Routed Language Models]]


===== Large Language Models =====
See also [[https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table|Ecosystem Graphs]] for a more complete list.

This is a list of large, GPT-style autoregressive LMs.  See also [[pretraining]] for another list of large language models and [[https://gpt3demo.com/category/alternative-language-models|GPT-3 alternatives]].

  * [[https://arxiv.org/pdf/1602.02410.pdf|Jozefowicz et al 2016 - Exploring the Limits of Language Modeling]] It's interesting to see how far we've come since 2016.
  * GPT: [[https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf|Radford et al 2018 - Improving Language Understanding by Generative Pre-Training]] [[https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf|old link]]
  * GPT-2: [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|Radford et al 2019 - Language Models are Unsupervised Multitask Learners]] [[https://github.com/openai/gpt-2|original github]] [[https://amaarora.github.io/2020/02/18/annotatedGPT2.html|Annotated GPT-2]] [[https://jalammar.github.io/illustrated-gpt2/|Illustrated GPT-2]] Interestingly, GPT-2 does //not// include a bias term in the final linear layer for the vocab, see [[https://github.com/openai/gpt-2/blob/master/src/model.py#L171|here]] and [[https://github.com/huggingface/transformers/blob/v4.19.2/src/transformers/models/gpt2/modeling_gpt2.py#L951|here]].
  * GPT-3: [[https://arxiv.org/pdf/2005.14165.pdf|Brown et al 2019 - Language Models are Few-Shot Learners]] [[https://beta.openai.com/|OpenAI]] [[https://github.com/openai/openai-cookbook|cookbook]]
  * [[https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/|Turing-NLG: A 17-billion-parameter language model by Microsoft]]
  * [[https://arxiv.org/pdf/2101.03961.pdf|Fedus et al 2021 - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity]]
  * Gopher: [[https://arxiv.org/pdf/2112.11446.pdf|Rae et al 2021 - Scaling Language Models: Methods, Analysis & Insights from Training Gopher]] [[https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval|blog]]
  * Jurassic-1: [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Lieber et al 2022 - Jurassic-1: Technical Details and Evaluation]] [[https://studio.ai21.com/docs/jurassic1-language-models/|model]] [[https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1|blog]]
  * Megatron-Turing NLG: [[https://arxiv.org/pdf/2201.11990.pdf|Smith et al 2022 - Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]] [[https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|Microsoft blog]] [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|NVidia blog]] [[https://developer.nvidia.com/megatron-turing-natural-language-generation|Researcher access]] [[https://github.com/NVIDIA/Megatron-LM|code]] [[https://github.com/NVIDIA/Megatron-LM#downloading-checkpoints|models]]
  * Chinchilla: [[https://arxiv.org/pdf/2203.15556.pdf|Hoffmann et al 2022 - Training Compute-Optimal Large Language Models]] Says most LLMs are undertrained, and trains a compute budget optimal size language model using the same dataset as [[https://arxiv.org/pdf/2112.11446.pdf|Gopher]]. [[https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training|blog1]] [[https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b-greatly-outperforms-gpt-3-175b-and-gopher-280b-408b9b4510}|blog2]]
  * PaLM: [[https://arxiv.org/pdf/2204.02311.pdf|Chowdhery et al 2022 - PaLM: Scaling Language Modeling with Pathways]] [[https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html|blog]]
  * GPT-NeoX-20B: [[https://arxiv.org/pdf/2204.06745.pdf|Black et al 2022 - GPT-NeoX-20B: An Open-Source Autoregressive Language Model]] Has an interesting description of the hardware they used
  * OPT: [[https://arxiv.org/pdf/2205.01068.pdf|Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models]] [[https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/|blog]] [[https://github.com/facebookresearch/metaseq/tree/main/projects/OPT|models]]
  * [[https://huggingface.co/docs/transformers/model_doc/bloom|Bloom]]: [[https://huggingface.co/bigscience/bloom|model card]] [[https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme|Training readme]] [[https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs|Tensorboard log]]
  * [[https://arxiv.org/pdf/2210.02414.pdf|Zeng et al 2022 - GLM-130B: An Open Bilingual Pre-trained Model]]
  * [[https://arxiv.org/pdf/2309.03852.pdf|Li et al 2023 - FLM-101B: An Open LLM and How to Train It with $100K Budget]]

^ Model ^ Year ^ Parameters ^ Training Data ^ Public? ^ Link ^
| [[https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf|GPT]] | 2018 | | BooksCorpus | Yes | [[https://github.com/openai/finetune-transformer-lm|github]] [[https://huggingface.co/openai-gpt|Huggingface]] |
| [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|GPT-2]] | 2019 | 1.5B | Webtext (closed, see [[language_model#datasets]] below) | Yes | [[https://github.com/openai/gpt-2|github]] [[https://huggingface.co/gpt2|Huggingface]] |
| [[https://arxiv.org/pdf/2005.14165.pdf|GPT-3]] | 2020 | 175B | CommonCrawl, Webtext2, Books 1&2, Wikipedia | API | [[https://beta.openai.com/|OpenAI]] [[https://github.com/openai/openai-cookbook|cookbook]] |
| [[https://arxiv.org/pdf/2112.10684.pdf|MoE]] | 2021 | 1.1T (13B) | CC100, CC-News, CC-Stories, OpenWebText, BookCorpus, Wikipedia | Yes | [[https://github.com/facebookresearch/fairseq/tree/main/examples/moe_lm|github]] [[https://huggingface.co/KoboldAI/fairseq-dense-6.7B|HuggingFace]] |
| [[https://arxiv.org/pdf/2112.11446.pdf|Gopher]] | 2021 | 280B | [[https://arxiv.org/pdf/2112.11446.pdf|MassiveText]] | No | [[https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval|blog]] |
| [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turing NLG]] | 2022 | 530B | [[https://pile.eleuther.ai/|Pile]], CommonCrawl, Realnews, CC-Stories | [[https://developer.nvidia.com/megatron-turing-natural-language-generation|Researcher access]] | [[https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog1]] [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog2]] [[https://github.com/NVIDIA/Megatron-LM|github]] |
| [[https://arxiv.org/pdf/2203.15556.pdf|Chinchilla]] | 2022 | 70B | [[https://arxiv.org/pdf/2112.11446.pdf|MassiveText]] | No | [[https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training|blog]] |
| [[https://arxiv.org/pdf/2204.06745.pdf|GPT-NeoX-20B]] | 2022 | 20B | [[https://pile.eleuther.ai/|Pile]] | Yes | [[https://github.com/EleutherAI/gpt-neox|github]] |
| [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Jurassic-1]] | 2022 | 178B | | API | [[https://studio.ai21.com/docs/jurassic1-language-models/|AI21 studio]] |
| [[https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6|YaLM 100B]] | 2022 | 100B | [[https://pile.eleuther.ai/|Pile]] + lots of Russian text | Yes | [[https://github.com/yandex/YaLM-100B|github]] [[https://huggingface.co/yandex/yalm-100b|HuggingFace]] |
| [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]] | 2022 | 540B | Social media, web, books, Github, Wikipedia | No? | [[https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html|blog]] |
| [[https://arxiv.org/pdf/2205.01068.pdf|OPT]] | 2022 | 66B, 175B | Pile subset: CommonCrawl, OpenWebtext2, Gutenberg, Wikipedia  | Yes | [[https://opt.alpa.ai/|demo]] [[https://github.com/facebookresearch/metaseq/tree/main/projects/OPT|models]] |
| [[https://arxiv.org/pdf/2205.05131.pdf|UL2]] | 2022 | 20B | | Yes | [[https://ai.googleblog.com/2022/10/ul2-20b-open-source-unified-language.html|blog]] [[https://github.com/google-research/google-research/tree/master/ul2|github]] |
| [[https://arxiv.org/pdf/2211.05100.pdf|Bloom]] | 2022 | 176B | Multilingual [[https://huggingface.co/spaces/bigscience/BigScienceCorpus|BigScienceCorpus]] [[https://openreview.net/pdf?id=UoEw6KigkUn|paper]] | Yes | [[https://huggingface.co/docs/transformers/model_doc/bloom|HuggingFace]] [[https://huggingface.co/bigscience/bloom|demo]] |
| [[https://arxiv.org/pdf/2210.02414.pdf|GLM-130B]] | 2022 | 130B | Pile, Chinese WudaoCorpora, more | Yes | [[https://github.com/THUDM/GLM-130B|github]] |
|[[https://arxiv.org/pdf/2211.09085.pdf|Galactica]] | 2022 | 120B | Scientific papers, code, reference material, prompts | Yes |[[https://github.com/paperswithcode/galai|github]] [[https://huggingface.co/models?other=galactica|HuggingFace]] |
| [[https://openai.com/blog/chatgpt|ChatGPT]] | 2022 | ? | | API | [[https://chat.openai.com/|demo]] [[https://sharegpt.com/|ShareGPT]] |
| [[https://arxiv.org/pdf/2302.13971.pdf|LLaMA]] | 2023 | 65B | CommonCrawl, C4, Github, Wikipedia, Books3, ArXiv, StackExchange | Yes | [[https://ai.facebook.com/blog/large-language-model-llama-meta-ai/|blog]] [[https://github.com/facebookresearch/llama|github]] |
| [[https://arxiv.org/pdf/2303.08774.pdf|GPT-4]] | 2023 | ? | ? (multi-modal) | API | [[https://openai.com/research/gpt-4|website]] |
| [[https://crfm.stanford.edu/2023/03/13/alpaca.html|Alpaca]] | 2023 | 7B | [[https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json|52k instructions]] from [[https://arxiv.org/pdf/2212.10560.pdf|Self-Instruct]] w/ text-davinci-003 | Yes | [[https://github.com/tatsu-lab/stanford_alpaca|github]] [[https://crfm.stanford.edu/alpaca/|demo]] |
| [[https://vicuna.lmsys.org/|Vicuna]] | 2023 | 7B/13B | (Chatbot) | Yes | [[https://github.com/lm-sys/FastChat|github]] [[https://chat.lmsys.org/|demo]] |
| [[https://bair.berkeley.edu/blog/2023/04/03/koala/|Koala]] | 2023 | 13B | | Yes | [[https://github.com/young-geng/EasyLM|github]] [[https://chat.lmsys.org/?model=koala-13b/|demo]] |
| [[https://huggingface.co/blog/stackllama| StackLLaMA]] | 2023 | 7B | | Yes |[[https://huggingface.co/spaces/trl-lib/stack-llama|demo]] |
| [[https://arxiv.org/pdf/2305.11206.pdf|LIMA]] | 2023 | 65B | | |  |
| [[https://ai.google/static/documents/palm2techreport.pdf|PaLM 2]] | 2023 | 14.7B | | API | [[https://ai.google/discover/palm2|website]] [[https://developers.generativeai.google/|api]] |
| [[https://arxiv.org/pdf/2307.09288.pdf|LLama 2]] | 2023 | 70B | | Yes | [[https://ai.meta.com/llama/|website]] [[https://about.fb.com/news/2023/07/llama-2/|blog]] |
| [[https://arxiv.org/pdf/2310.06825.pdf|Mistral 7B]], [[https://mistral.ai/news/mixtral-of-experts/|Mixtral 8X7B]] | 2023 | 7B | | Yes, API | |
| [[https://arxiv.org/pdf/2311.11045.pdf|Orca 2]] | 2023 | | | | |
| [[https://arxiv.org/pdf/2402.00838.pdf|OLMo]] | 2024 | 7B | [[https://huggingface.co/datasets/allenai/dolma|dolma]] | Yes, open data | [[https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e7359222|blog]] [[https://github.com/allenai/OLMo|github]] [[https://huggingface.co/allenai/OLMo-7B|huggingface]] |
| [[https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf|Gemma]] | 2024 | 7B, 2B | | Yes | [[https://blog.google/technology/developers/gemma-open-models/|blog]] |
| [[https://arxiv.org/pdf/2403.19887|Jamba]] | 2024 | 52B | | Yes | [[https://www.ai21.com/blog/announcing-jamba|blog]] [[https://huggingface.co/ai21labs/Jamba-v0.1|HuggingFace]] |
| [[https://arxiv.org/pdf/2404.14619|OpenELM]] | 2024 | 1.1B | | Yes | |
| [[https://arxiv.org/pdf/2507.20534|Kimi K2]] | 2025 | 1T | | Yes | |
| | | | | | |

===== Abilities and Analysis of LLMs =====
  * **ChatGPT**
    * For ChatGPT, see also [[ChatGPT]].
    * **[[https://arxiv.org/pdf/2305.18486.pdf|Laskar et al 2023 - A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets]]**
    * [[https://arxiv.org/pdf/2311.04939.pdf|Ronan & Schneider - Can Chat GPT solve a Linguistics Exam?]]
    * [[https://arxiv.org/pdf/2301.13867|Frieder et al 2023 - Mathematical Capabilities of ChatGPT]]
  * **Creativity**
    * [[https://arxiv.org/pdf/2401.12491.pdf|Zhao et al 2024 - Assessing and Understanding Creativity in Large Language Models]]
    * [[https://arxiv.org/pdf/2311.09682|Tian et al 2024 - MacGyver: Are Large Language Models Creative Problem Solvers?]]
  * **Self-Correction**
    * [[https://arxiv.org/pdf/2406.15673|Liu et al 2024 - Large Language Models have Intrinsic Self-Correction Ability]]
  * **Use of Context**
    * [[https://arxiv.org/pdf/1805.04623.pdf|Khandelwal et al 2018 - Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context]] (Old, no longer applies to transformer models)
    * [[https://arxiv.org/pdf/2307.03172|Liu 2023 - Lost in the Middle: How Language Models Use Long Contexts]]
      * [[https://link.springer.com/chapter/10.1007/978-3-031-88708-6_16|Hutter et al 2025 - Lost but Not Only in the Middle]]

==== Origin of Capabilities ====
  * [[https://arxiv.org/pdf/2505.23323|Madabushi et al 2025 - Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors]]
  * **Machine Translation**
    * [[https://arxiv.org/pdf/2305.10266|Briakou et al 2023 - Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability]]
    * [[https://arxiv.org/pdf/2505.23548|Balashov 2025 - Translation in the Wild]]

===== Evaluation of LLMs and Benchmarks =====
  * **Overviews**
    * [[https://arxiv.org/pdf/2307.03109|Chang et al 2023 - A Survey on Evaluation of Large Language Models]]
    * For common evaluation datasets for LLMs, see recent LLM system description papers such as the [[https://arxiv.org/pdf/2407.21783|LLama 3 paper]] (table 2) or [[https://www.anthropic.com/news/claude-sonnet-4-5|Claude Sonnet 4.5]] (evaluation table).
  * lm-evaluation-harness: [[https://github.com/EleutherAI/lm-evaluation-harness|LM Evaluation Harness (EleutherAI)]] (Released May 2021)
  * [[https://arxiv.org/pdf/2401.00595|Mizrahi et al 2024 - State of What Art? A Call for Multi-Prompt LLM Evaluation]]
  * lm-eval: [[https://arxiv.org/pdf/2405.14782|Biderman et al 2024 - Lessons from the Trenches on Reproducible Evaluation of Language Models]]
  * **Small-scale Evaluations**
    * [[https://arxiv.org/pdf/2402.14992|Polo et al 2024 - tinyBenchmarks: evaluating LLMs with fewer examples]]
  * **Effects of Length and Irrelevant Context**
    * [[https://arxiv.org/pdf/2402.14848|Levy et al 2024 - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models]]

===== Tool-Use in LLMs =====
See also [[prompting#Chained or Tool-based Prompting]].
  * **Overviews and Background**
    * [[https://modelcontextprotocol.io/docs/getting-started/intro|Model Contex Protocol]]

===== Retrieval-Augmented Generation (RAG) =====
See [[Retrieval-Augmented Methods]].

===== Limitations of Current LLMs =====
  * [[https://aclanthology.org/2025.acl-long.1016.pdf|Shaikh et al 2025 - Navigating Rifts in Human-LLM Grounding: Study and Benchmark]]

===== Questions and Critiques of LLMs =====
  * [[https://s10251.pcdn.co/pdf/2021-bender-parrots.pdf|Bender et al 2021 - On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?]]
  * [[https://arxiv.org/pdf/2308.07120|Rogers & Luccioni 2023 - Position: Key Claims in LLM Research Have a Long Tail of Footnotes]]

===== Adapting Language Models =====
==== To Domains ====
  * [[https://arxiv.org/pdf/2302.03169.pdf|Xie et al 2023 - Data Selection for Language Models via Importance Resampling]]
==== To Other Languages ====
  * **Language Adaptive Fine-Tuning (LAFT)**:
    * [[https://arxiv.org/pdf/1910.11856.pdf|Artetxe et al 2019 - On the Cross-lingual Transferability of Monolingual Representations]]
    * [[https://arxiv.org/pdf/2012.15562.pdf|Pfeiffer et al 2020 - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts]]
    * Recycling: **[[https://aclanthology.org/2021.findings-acl.74.pdf|Vries & Nissim 2021 - As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages]]** This one works the best so far. Retrains the embeddings keeping the Transformer layers fixed
    * [[https://aclanthology.org/2021.emnlp-main.672.pdf|Zhao & Schütze - Discrete and Soft Prompting for Multilingual Models]]
    * Multi-lingual version: [[https://arxiv.org/pdf/2204.06487.pdf|Alabi et al 2022 - Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning]]
    * [[https://aclanthology.org/2022.emnlp-main.616.pdf|Lin et al 2022 - Few-shot Learning with Multilingual Generative Language Models]]
    * [[https://arxiv.org/pdf/2210.03057.pdf|Shi et al 2022 - Language Models are Multilingual Chain-of-Thought Reasoners]]
    * [[https://arxiv.org/pdf/2212.10503.pdf|Marchisio et al 2022 - Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training]]
    * [[https://arxiv.org/pdf/2304.01922.pdf|Michal Štefánik et al 2023 - Resources and Few-shot Learners for In-context Learning in Slavic Languages]] (Dataset)
    * [[https://arxiv.org/pdf/2401.01055.pdf|Zhao et al 2024 - LLaMA Beyond English: An Empirical Study on Language Capability Transfer]]

==== Temporal Language Modeling ====
  * [[https://arxiv.org/pdf/2102.01951.pdf|Lazaridou et al 2021 - Mind the Gap: Assessing Temporal Generalization in Neural Language Models]]
  * [[https://arxiv.org/pdf/2404.10297|Li & Flanigan 2024 - Future Language Modeling from Temporal Document History]]
  * [[https://arxiv.org/pdf/2404.18543|Drinkall et al 2024 - Time Machine GPT]]

===== Extracting Knowledge from Language Models =====
See also [[nlp:information_retrieval#Dense Document Retrieval with LLMs]].

  * Extracting Training Data
    * [[https://arxiv.org/pdf/2012.07805.pdf|Carlini et al 2020 - Extracting Training Data from Large Language Models]] [[https://github.com/ftramer/LM_Memorization|github]]
    * [[https://arxiv.org/pdf/2601.02671|Ahmed et al 2026 - Extracting Books from Production Language Models]]
  * Membership Inference for Training Data
    * (Decide if some sample data is in the training data or not)
    * Related page: [[ml:Privacy#Membership Inference Attacks]]
    * [[https://arxiv.org/pdf/1811.00513.pdf|Song & Shmatikov 2018 - Auditing Data Provenance in Text-Generation Models]]
  * [[https://arxiv.org/pdf/1909.01066.pdf|Language Models as Knowledge Bases?]]
  * [[https://arxiv.org/pdf/1909.00505.pdf|Feldman et al 2019 - Commonsense Knowledge Mining from Pretrained Models]]
  * [[https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00324|Jiang et al 2020 - How Can We Know What Language Models Know?]]
  * [[https://arxiv.org/pdf/2106.09231.pdf|Cao et al 2021 - Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases]]
  * [[https://arxiv.org/pdf/2110.08387.pdf|Liu et al 2022 - Generated Knowledge Prompting for Commonsense Reasoning]]
  * [[https://arxiv.org/pdf/2201.07207.pdf|Huang et al 2022 - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents]]
  * [[https://arxiv.org/pdf/2205.11482.pdf|Akyürek et al 2022 - Tracing Knowledge in Language Models Back to the Training Data]]
  * [[https://arxiv.org/pdf/2404.15146|Schwarzschild et al 2024 - Rethinking LLM Memorization through the Lens of Adversarial Compression]]


===== Knowledge Editing =====
See [[Knowledge Editing]] and [[ml:Model Editing and Unlearning]].

===== Personalization =====
  * [[https://arxiv.org/pdf/2304.11406|Salemi et al 2023 - LaMP: When Large Language Models Meet Personalization]]
  * [[https://arxiv.org/pdf/2401.05459|Li et al 2024 - Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security]]

===== LLM Personality and Writing Style =====
  * **Personality**
    * [[https://arxiv.org/pdf/2305.02547|Jiang et al 2023 - PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits]]
    * [[https://arxiv.org/pdf/2307.16180|Pan & Zeng 2023 - Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models]]
    * [[https://arxiv.org/pdf/2310.01386|Huang et al 2023 - Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench]]
    * [[https://arxiv.org/pdf/2310.02168|Mao et al 2023 - Editing Personality for Large Language Models]]
    * [[https://arxiv.org/pdf/2311.05297|Suhr et al 2023 - Challenging the Validity of Personality Tests for Large Language Models]]
  * **Vocabulary Overuse**
    * [[https://arxiv.org/pdf/2406.07016|Kobak et al 2024 - Delving into ChatGPT usage in academic writing through excess vocabulary]]

===== Detecting Generated Text =====
See also [[nlp:automatic_fact_checking#Fake News Detection]].

  * [[https://aclanthology.org/2022.naacl-main.88.pdf|Rodriguez - Cross-Domain Detection of GPT-2-Generated Technical Text]]
  * [[https://arxiv.org/abs/2301.11305|Mitchell et al 2023 - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature]]
  * [[https://arxiv.org/pdf/2305.09859|Mireshghallah et al 2023 - Smaller Language Models are Better Black-box Machine-Generated Text Detectors]]

===== Adversarial Attacks =====
  * [[https://arxiv.org/pdf/2311.04235.pdf|Mu et al 2023 - Can LLMs Follow Simple Rules?]]

===== Steering =====
  * [[https://arxiv.org/pdf/2501.17148|Wu et al 2025 - AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders]]
  * [[https://arxiv.org/pdf/2505.20809|Wu et al 2025 - Improved Representation Steering for Language Models]]

===== Applications =====
  * **Evaluation**, see [[Evaluation#Evaluation with Large Language Models]]
  * **Creating Data** or replacement for crowdsourcing, see [[Data Augmentation]] (Synthetic Data Generation)

===== Copyright Issues =====
See [[Copyright Issues]].
  * [[https://arxiv.org/pdf/2303.15715.pdf|Henderson et al 2023 - Foundation Models and Fair Use]]
  * [[https://arxiv.org/pdf/2310.13771.pdf|Karamolegkou et al 2023 - Copyright Violations and Large Language Models]]

===== Theoretical and Foundational Papers =====
See also [[Prompting#Analysis of In-Context-Learning]] and [[Language Model#Origin of Capabilities|Language Model - Origin of Capabilities]].

=== Emergent Abilities ===
See also [[ml:Scaling Laws#Emergent Abilities|Scaling Laws - Emergent Abilities]]
  * [[https://arxiv.org/pdf/2309.01809.pdf|Lu et al 2023 - Are Emergent Abilities in Large Language Models just In-Context Learning?]]

===== Acceleration and Efficiency =====
See paper list [[https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey|here]].  See also [[ml:Model Compression]].
  * **Overviews**
    * [[https://arxiv.org/pdf/2202.07105|Xu & McAuley et al 2022 - A Survey on Model Compression and Acceleration for Pretrained Language Models]]
    * **[[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]]** Updated continuously.  **See paper list [[https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey|here]]**

===== Economics of LLMs =====
  * [[https://arxiv.org/pdf/2306.07402|Howell et al 2023 - The Economic Trade-offs of Large Language Models: A Case Study]]

===== Miscellaneous =====

==== Concept or Semantic LLMs ====
  * [[https://arxiv.org/pdf/2412.08821|Meta 2024 - Large Concept Models: Language Modeling in a Sentence Representation Space]]
    * [[https://arxiv.org/pdf/2501.05487|Ahmad & Goel 2025 - The Future of AI: Exploring the Potential of Large Concept Models]]

==== Consciousness of LLMs ====
  * **Overviews**
    * [[https://arxiv.org/pdf/2505.19806|Chen et al 2025 - Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks]]

===== Historical Papers =====
Historical papers that may or may not be applicable today.

  * [[https://www.stats.ox.ac.uk/~teh/research/compling/hpylm.pdf|Teh 2003 - A Bayesian Interpretation of Interpolated Kneser-Ney]]
  * [[https://arxiv.org/pdf/1707.05589.pdf|Melis et al 2017 - On the State of the Art of Evaluation in Neural Language Models]] Shows that LSTMS, when properly tuned, outperform other models (as of 2017, so before the Transformer)
  * [[https://arxiv.org/pdf/1805.04623.pdf|Khandelwal et al 2018 - Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context]]

===== Datasets =====
  * Standard benchmark datasets
    * [[https://developer.ibm.com/exchanges/data/all/wikitext-103/|Wikitext 103]]
    * [[https://catalog.ldc.upenn.edu/LDC99T42|Penn Treebank]]
    * [[https://github.com/deepmind/pg19|PG-19]] Uses books from before 1919.  Good for long sequences.
  * Large datasets
    * [[https://github.com/soskek/bookcorpus|Bookcorpus]], also reproduced in the Pile, see [[https://github.com/soskek/bookcorpus/issues/27|here]]. [[https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz|reproduction]] [[https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2|original dataset]]
    * Common Crawl
    * [[https://openwebtext2.readthedocs.io/en/latest/background/|WebText and OpenWebText]]:
      * WebText: Introduced in GPT-2 ([[https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf|paper]]).
      * OpenWebText: Various implementations [[https://github.com/jcpeterson/openwebtext|here]] and [[https://github.com/yet-another-account/openwebtext|here]] and [[https://skylion007.github.io/OpenWebTextCorpus/|here]] (on HuggingFace [[https://huggingface.co/datasets/Skylion007/openwebtext|here]]). Used in MegatronLM [[https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext|here]].
      * [[https://openwebtext2.readthedocs.io/|OpenWebText2]] Open re-implementation, widely used. Use this one. On HuggingFace [[https://huggingface.co/datasets/the_pile_openwebtext2|here]].
    * [[https://www.tensorflow.org/datasets/catalog/c4|Colossal Clean Crawled Corpus (C4)]]: [[https://arxiv.org/pdf/1910.10683.pdf|paper]] AI2 reimplementation at[[https://huggingface.co/datasets/c4|HuggingFace]]
    * [[https://pile.eleuther.ai/|The Pile]] Diverse set of data for building language models. [[https://the-eye.eu/public/AI/pile_neox/|Individual components]], see readme. [[https://the-eye.eu/public/AI/pile_preliminary_components/|Older individual components]] Paper: [[https://arxiv.org/pdf/2101.00027.pdf|The Pile: An 800GB Dataset of Diverse Text for Language Modeling]] [[https://pile.dataportraits.org/|Pile check tool]]
      * The pile has been removed due to this [[https://actionnetwork.org/petitions/authors-guild-open-letter-to-generative-ai-leaders|letter]], see [[https://huggingface.co/datasets/EleutherAI/pile/discussions/15|here]] and [[https://techcrunch.com/2023/07/18/thousands-of-authors-sign-letter-urging-ai-makers-to-stop-stealing-books/|here]]
    * [[https://huggingface.co/spaces/bigscience/BigScienceCorpus|BigScienceCorpus]]: [[https://openreview.net/pdf?id=UoEw6KigkUn|2022 - The BigScience Corpus:
A 1.6TB Composite Multilingual Dataset]]
    * [[https://github.com/togethercomputer/RedPajama-Data|RedPajama-Data]] An Open Source Recipe to Reproduce LLaMA training dataset
  * Code datasets
    * [[https://huggingface.co/datasets/bigcode/the-stack|The Stack]] Used in [[https://huggingface.co/blog/starcoder|StarCoder]]. Has two membership test websites: [[https://huggingface.co/spaces/bigcode/in-the-stack|Am I in the stack]] and [[https://stack.dataportraits.org/|DataPortraits]]
  * Small datasets
    * [[https://babylm.github.io/|BabyLM Challenge]]
    * Tinystories: [[https://arxiv.org/pdf/2305.07759.pdf|paper]] [[https://huggingface.co/datasets/roneneldan/TinyStories|dataset]]
    * Minipile: [[https://arxiv.org/pdf/2304.08442.pdf|paper]] [[https://huggingface.co/datasets/JeanKaddour/minipile|Huggingface]]

===== Software and Demos =====
  * **Training and/or Inference Frameworks for LLMs**
    * For an overview, see table 4 and section 3.7 of [[https://arxiv.org/pdf/2401.02038|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]]
  * (Historical) n-gram LM toolkits
    * The best, highly optimized toolkit: [[https://kheafield.com/code/kenlm/|KenLM]]
    * Industry standard toolkit with many options: [[http://www.speech.sri.com/projects/srilm/download.html|SRILM]]
    * [[https://www.nltk.org/|NLTK]] also implements n-gram LMs
  * Deep learning toolkits
    * [[https://github.com/EleutherAI/gpt-neo|GPT Neo]] An open-source implementation of model & data parallel GPT3-like models using the mesh-tensorflow library.
    * [[https://github.com/NVIDIA/Megatron-LM|NVidia's Megatron-LM]] Used for example, by BLOOM
  * **Online demos**
    * AI2's Jurassic language model [[https://studio.ai21.com/docs/jurassic1-language-models/|Jurassic-1]]
    * GPT-3: [[https://openai.com/api/|web interface]] is free after signing up
    * Hugging Face Fill Mask demo: [[https://huggingface.co/tasks/fill-mask|Fill-Mask Demo]] [[https://huggingface.co/tasks/text-generation|Text Generation Demo]]


===== Related Pages =====
  * [[Autonomous Language Agents]]
  * [[BERT and Friends]]
  * [[ChatGPT]]
  * [[nlp:information_retrieval#Dense Document Retrieval with LLMs]]
  * [[Hallucination and Factivity]]
  * [[Instruction-Tuning]]
  * [[Large Reasoning Models]] (such as OpenAI o1 or DeepSeek R1)
  * [[ml:Mixture of Expert Models]]
  * [[Perplexity]]
  * [[Pretraining]]
  * [[Prompting]]
  * [[ml:Scaling Laws]]
  * [[Supertasks]]