====== Language Models ======= Traditional definition of a language model (LM): //a language model is a probability distribution over sentences//, that is, it assigns probabilities to sentences. Language models can usually compute the probability of the next word given a sequence of words (//autoregressive language models//), or in the case of //masked language models//, the probability of a word given a surrounding context. Note: unlike autoregressive language models, masked language models usually can't be used to compute the probability of a sentence, and so they aren't really "language models" in the traditional sense. To experiment with an autoregressive language model or masked language model, see **online demos** below. ===== Overviews ===== * **Introductory Material** * Basic intro, and n-gram language modeling * [[http://www.cs.columbia.edu/~mcollins/courses/nlp2011/notes/lm.pdf|Language modeling]] by Mike Collins * [[https://homes.cs.washington.edu/~nasmith/papers/plm.17.pdf|Probabilistic Language Models]] by Noah Smith * [[https://web.stanford.edu/~jurafsky/slp3/3.pdf|Chapter 3]] of [[https://web.stanford.edu/~jurafsky/slp3/|Speech and Language Processing]] * Neural language models * Section 7.5 of [[https://web.stanford.edu/~jurafsky/slp3/7.pdf|Chapter 7]] of [[https://web.stanford.edu/~jurafsky/slp3/|Speech and Language Processing]] * Large language models * [[https://arxiv.org/pdf/2108.05542|2021 - AMMUS : A Survey of Transformer-based Pretrained Models in Natural Language Processing]] Comprehensive overview at the time * **[[https://arxiv.org/pdf/2303.05759.pdf|Wei et al 2023 - An Overview on Language Models: Recent Developments and Outlook]]** * **[[https://arxiv.org/pdf/2401.02038.pdf|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]]** Wow, really good * **[[https://arxiv.org/pdf/2307.06435|Naveed et al 2024 - A Comprehensive Overview of Large Language Models]]** * For another nice introduction, see related work of [[https://arxiv.org/pdf/2211.09085.pdf|Taylor 2022]] (p. 3) * [[https://arxiv.org/pdf/2304.00612.pdf|Bowman 2023 - Eight Things to Know about Large Language Models]] * **[[https://arxiv.org/pdf/2402.06196|Minaee et al 2024 - Large Language Models: A Survey]]** * [[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]] * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]] * **[[https://arxiv.org/pdf/2303.18223.pdf|Zhao et al 2023 - A Survey of Large Language Models]]** * [[https://arxiv.org/pdf/2404.09022|Weng 2024 - Navigating the Landscape of Large Language Models: A Comprehensive Review and Analysis of Paradigms and Fine-Tuning Strategies]] * **[[https://arxiv.org/pdf/2501.17805|2025 - International AI Safety Report]]** (Has a good non-technical overview of AI, ML & LLMs) * **Language models in the news, etc** * [[https://www.wired.com/story/ai-text-generator-gpt-3-learning-language-fitfully/|Wired - GPT-3]] * [[https://twitter.com/sharifshameem/status/1282676454690451457|Twitter GPT-3 code example]] (Sharif Shameem) "I only had to write 2 samples to give GPT-3 context for what I wanted it to do. It then properly formatted all of the other samples... If I wanted it to write output plain HTML/CSS instead of JSX, all I would have to do would be to re-write my 2 initial samples in HTML/CSS. Then all of GPT-3's outputs would be in plain HTML/CSS." * [[https://analyticsindiamag.com/we-might-see-a-100t-language-model-in-2022/|We Might See A 100T Language Model In 2022]] [[https://web.archive.org/web/20220814084814/https://analyticsindiamag.com/we-might-see-a-100t-language-model-in-2022/|archive.org]] Nice overview of some large language models in 2022 * **Bibliographies** * **[[https://github.com/Hannibal046/Awesome-LLM|Awesome-LLM]]** ===== Papers ===== * n-gram Models: Old classic papers, and recent papers * [[https://www.stats.ox.ac.uk/~teh/research/compling/hpylm.pdf|Teh 2006 - A Bayesian Interpretation of Interpolated Kneser-Ney]] * [[https://aclanthology.org/2024.naacl-long.382.pdf|Malagutti et al 2024 - The Role of n-gram Smoothing in the Age of Neural Networks]] * Fill-In-the-Middle * [[https://arxiv.org/pdf/2207.14255.pdf|Bavarian et al 2022 - Efficient Training of Language Models to Fill in the Middle]] * See also [[https://huggingface.co/bigcode/starcoder#fill-in-the-middle|Starcoder Fill-In-The-Middle]] * Memory, Cache and Retrieval-Augmented Language Models * [[https://arxiv.org/pdf/1911.00172.pdf|Khandelwal et al 2019 - Generalization through Memorization: Nearest Neighbor Language Models]] * [[https://arxiv.org/pdf/2102.02557.pdf|Yogatama et al 2021 - Adaptive Semiparametric Language Models]] * [[https://arxiv.org/pdf/2112.04426.pdf|2021 - Improving language models by retrieving from trillions of tokens]] ([[https://vaclavkosar.com/ml/DeepMinds-RETRO-Transformer-Model|blog]]) * [[https://arxiv.org/pdf/2203.08913.pdf|Wu et al 2022 - Memorizing Transformers]] Uses k-NN lookup with fixed embeddings to retrieve relevant examples * [[https://arxiv.org/pdf/2202.01169.pdf|Clark et al 2022 - Unified Scaling Laws for Routed Language Models]] ===== Large Language Models ===== See also [[https://crfm.stanford.edu/ecosystem-graphs/index.html?mode=table|Ecosystem Graphs]] for a more complete list. This is a list of large, GPT-style autoregressive LMs. See also [[pretraining]] for another list of large language models and [[https://gpt3demo.com/category/alternative-language-models|GPT-3 alternatives]]. * [[https://arxiv.org/pdf/1602.02410.pdf|Jozefowicz et al 2016 - Exploring the Limits of Language Modeling]] It's interesting to see how far we've come since 2016. * GPT: [[https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf|Radford et al 2018 - Improving Language Understanding by Generative Pre-Training]] [[https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf|old link]] * GPT-2: [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|Radford et al 2019 - Language Models are Unsupervised Multitask Learners]] [[https://github.com/openai/gpt-2|original github]] [[https://amaarora.github.io/2020/02/18/annotatedGPT2.html|Annotated GPT-2]] [[https://jalammar.github.io/illustrated-gpt2/|Illustrated GPT-2]] Interestingly, GPT-2 does //not// include a bias term in the final linear layer for the vocab, see [[https://github.com/openai/gpt-2/blob/master/src/model.py#L171|here]] and [[https://github.com/huggingface/transformers/blob/v4.19.2/src/transformers/models/gpt2/modeling_gpt2.py#L951|here]]. * GPT-3: [[https://arxiv.org/pdf/2005.14165.pdf|Brown et al 2019 - Language Models are Few-Shot Learners]] [[https://beta.openai.com/|OpenAI]] [[https://github.com/openai/openai-cookbook|cookbook]] * [[https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/|Turing-NLG: A 17-billion-parameter language model by Microsoft]] * [[https://arxiv.org/pdf/2101.03961.pdf|Fedus et al 2021 - Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity]] * Gopher: [[https://arxiv.org/pdf/2112.11446.pdf|Rae et al 2021 - Scaling Language Models: Methods, Analysis & Insights from Training Gopher]] [[https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval|blog]] * Jurassic-1: [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Lieber et al 2022 - Jurassic-1: Technical Details and Evaluation]] [[https://studio.ai21.com/docs/jurassic1-language-models/|model]] [[https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1|blog]] * Megatron-Turing NLG: [[https://arxiv.org/pdf/2201.11990.pdf|Smith et al 2022 - Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model]] [[https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|Microsoft blog]] [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|NVidia blog]] [[https://developer.nvidia.com/megatron-turing-natural-language-generation|Researcher access]] [[https://github.com/NVIDIA/Megatron-LM|code]] [[https://github.com/NVIDIA/Megatron-LM#downloading-checkpoints|models]] * Chinchilla: [[https://arxiv.org/pdf/2203.15556.pdf|Hoffmann et al 2022 - Training Compute-Optimal Large Language Models]] Says most LLMs are undertrained, and trains a compute budget optimal size language model using the same dataset as [[https://arxiv.org/pdf/2112.11446.pdf|Gopher]]. [[https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training|blog1]] [[https://towardsdatascience.com/a-new-ai-trend-chinchilla-70b-greatly-outperforms-gpt-3-175b-and-gopher-280b-408b9b4510}|blog2]] * PaLM: [[https://arxiv.org/pdf/2204.02311.pdf|Chowdhery et al 2022 - PaLM: Scaling Language Modeling with Pathways]] [[https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html|blog]] * GPT-NeoX-20B: [[https://arxiv.org/pdf/2204.06745.pdf|Black et al 2022 - GPT-NeoX-20B: An Open-Source Autoregressive Language Model]] Has an interesting description of the hardware they used * OPT: [[https://arxiv.org/pdf/2205.01068.pdf|Zhang et al 2022 - OPT: Open Pre-trained Transformer Language Models]] [[https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/|blog]] [[https://github.com/facebookresearch/metaseq/tree/main/projects/OPT|models]] * [[https://huggingface.co/docs/transformers/model_doc/bloom|Bloom]]: [[https://huggingface.co/bigscience/bloom|model card]] [[https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml#readme|Training readme]] [[https://huggingface.co/tensorboard/bigscience/tr11-176B-ml-logs|Tensorboard log]] * [[https://arxiv.org/pdf/2210.02414.pdf|Zeng et al 2022 - GLM-130B: An Open Bilingual Pre-trained Model]] * [[https://arxiv.org/pdf/2309.03852.pdf|Li et al 2023 - FLM-101B: An Open LLM and How to Train It with $100K Budget]] ^ Model ^ Year ^ Parameters ^ Training Data ^ Public? ^ Link ^ | [[https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf|GPT]] | 2018 | | BooksCorpus | Yes | [[https://github.com/openai/finetune-transformer-lm|github]] [[https://huggingface.co/openai-gpt|Huggingface]] | | [[https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf|GPT-2]] | 2019 | 1.5B | Webtext (closed, see [[language_model#datasets]] below) | Yes | [[https://github.com/openai/gpt-2|github]] [[https://huggingface.co/gpt2|Huggingface]] | | [[https://arxiv.org/pdf/2005.14165.pdf|GPT-3]] | 2020 | 175B | CommonCrawl, Webtext2, Books 1&2, Wikipedia | API | [[https://beta.openai.com/|OpenAI]] [[https://github.com/openai/openai-cookbook|cookbook]] | | [[https://arxiv.org/pdf/2112.10684.pdf|MoE]] | 2021 | 1.1T (13B) | CC100, CC-News, CC-Stories, OpenWebText, BookCorpus, Wikipedia | Yes | [[https://github.com/facebookresearch/fairseq/tree/main/examples/moe_lm|github]] [[https://huggingface.co/KoboldAI/fairseq-dense-6.7B|HuggingFace]] | | [[https://arxiv.org/pdf/2112.11446.pdf|Gopher]] | 2021 | 280B | [[https://arxiv.org/pdf/2112.11446.pdf|MassiveText]] | No | [[https://www.deepmind.com/blog/language-modelling-at-scale-gopher-ethical-considerations-and-retrieval|blog]] | | [[https://arxiv.org/pdf/2201.11990.pdf|Megatron-Turing NLG]] | 2022 | 530B | [[https://pile.eleuther.ai/|Pile]], CommonCrawl, Realnews, CC-Stories | [[https://developer.nvidia.com/megatron-turing-natural-language-generation|Researcher access]] | [[https://www.microsoft.com/en-us/research/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog1]] [[https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/|blog2]] [[https://github.com/NVIDIA/Megatron-LM|github]] | | [[https://arxiv.org/pdf/2203.15556.pdf|Chinchilla]] | 2022 | 70B | [[https://arxiv.org/pdf/2112.11446.pdf|MassiveText]] | No | [[https://www.deepmind.com/blog/an-empirical-analysis-of-compute-optimal-large-language-model-training|blog]] | | [[https://arxiv.org/pdf/2204.06745.pdf|GPT-NeoX-20B]] | 2022 | 20B | [[https://pile.eleuther.ai/|Pile]] | Yes | [[https://github.com/EleutherAI/gpt-neox|github]] | | [[https://uploads-ssl.webflow.com/60fd4503684b466578c0d307/61138924626a6981ee09caf6_jurassic_tech_paper.pdf|Jurassic-1]] | 2022 | 178B | | API | [[https://studio.ai21.com/docs/jurassic1-language-models/|AI21 studio]] | | [[https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6|YaLM 100B]] | 2022 | 100B | [[https://pile.eleuther.ai/|Pile]] + lots of Russian text | Yes | [[https://github.com/yandex/YaLM-100B|github]] [[https://huggingface.co/yandex/yalm-100b|HuggingFace]] | | [[https://arxiv.org/pdf/2204.02311.pdf|PaLM]] | 2022 | 540B | Social media, web, books, Github, Wikipedia | No? | [[https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html|blog]] | | [[https://arxiv.org/pdf/2205.01068.pdf|OPT]] | 2022 | 66B, 175B | Pile subset: CommonCrawl, OpenWebtext2, Gutenberg, Wikipedia | Yes | [[https://opt.alpa.ai/|demo]] [[https://github.com/facebookresearch/metaseq/tree/main/projects/OPT|models]] | | [[https://arxiv.org/pdf/2205.05131.pdf|UL2]] | 2022 | 20B | | Yes | [[https://ai.googleblog.com/2022/10/ul2-20b-open-source-unified-language.html|blog]] [[https://github.com/google-research/google-research/tree/master/ul2|github]] | | [[https://arxiv.org/pdf/2211.05100.pdf|Bloom]] | 2022 | 176B | Multilingual [[https://huggingface.co/spaces/bigscience/BigScienceCorpus|BigScienceCorpus]] [[https://openreview.net/pdf?id=UoEw6KigkUn|paper]] | Yes | [[https://huggingface.co/docs/transformers/model_doc/bloom|HuggingFace]] [[https://huggingface.co/bigscience/bloom|demo]] | | [[https://arxiv.org/pdf/2210.02414.pdf|GLM-130B]] | 2022 | 130B | Pile, Chinese WudaoCorpora, more | Yes | [[https://github.com/THUDM/GLM-130B|github]] | |[[https://arxiv.org/pdf/2211.09085.pdf|Galactica]] | 2022 | 120B | Scientific papers, code, reference material, prompts | Yes |[[https://github.com/paperswithcode/galai|github]] [[https://huggingface.co/models?other=galactica|HuggingFace]] | | [[https://openai.com/blog/chatgpt|ChatGPT]] | 2022 | ? | | API | [[https://chat.openai.com/|demo]] [[https://sharegpt.com/|ShareGPT]] | | [[https://arxiv.org/pdf/2302.13971.pdf|LLaMA]] | 2023 | 65B | CommonCrawl, C4, Github, Wikipedia, Books3, ArXiv, StackExchange | Yes | [[https://ai.facebook.com/blog/large-language-model-llama-meta-ai/|blog]] [[https://github.com/facebookresearch/llama|github]] | | [[https://arxiv.org/pdf/2303.08774.pdf|GPT-4]] | 2023 | ? | ? (multi-modal) | API | [[https://openai.com/research/gpt-4|website]] | | [[https://crfm.stanford.edu/2023/03/13/alpaca.html|Alpaca]] | 2023 | 7B | [[https://raw.githubusercontent.com/tatsu-lab/stanford_alpaca/main/alpaca_data.json|52k instructions]] from [[https://arxiv.org/pdf/2212.10560.pdf|Self-Instruct]] w/ text-davinci-003 | Yes | [[https://github.com/tatsu-lab/stanford_alpaca|github]] [[https://crfm.stanford.edu/alpaca/|demo]] | | [[https://vicuna.lmsys.org/|Vicuna]] | 2023 | 7B/13B | (Chatbot) | Yes | [[https://github.com/lm-sys/FastChat|github]] [[https://chat.lmsys.org/|demo]] | | [[https://bair.berkeley.edu/blog/2023/04/03/koala/|Koala]] | 2023 | 13B | | Yes | [[https://github.com/young-geng/EasyLM|github]] [[https://chat.lmsys.org/?model=koala-13b/|demo]] | | [[https://huggingface.co/blog/stackllama| StackLLaMA]] | 2023 | 7B | | Yes |[[https://huggingface.co/spaces/trl-lib/stack-llama|demo]] | | [[https://arxiv.org/pdf/2305.11206.pdf|LIMA]] | 2023 | 65B | | | | | [[https://ai.google/static/documents/palm2techreport.pdf|PaLM 2]] | 2023 | 14.7B | | API | [[https://ai.google/discover/palm2|website]] [[https://developers.generativeai.google/|api]] | | [[https://arxiv.org/pdf/2307.09288.pdf|LLama 2]] | 2023 | 70B | | Yes | [[https://ai.meta.com/llama/|website]] [[https://about.fb.com/news/2023/07/llama-2/|blog]] | | [[https://arxiv.org/pdf/2310.06825.pdf|Mistral 7B]], [[https://mistral.ai/news/mixtral-of-experts/|Mixtral 8X7B]] | 2023 | 7B | | Yes, API | | | [[https://arxiv.org/pdf/2311.11045.pdf|Orca 2]] | 2023 | | | | | | [[https://arxiv.org/pdf/2402.00838.pdf|OLMo]] | 2024 | 7B | [[https://huggingface.co/datasets/allenai/dolma|dolma]] | Yes, open data | [[https://blog.allenai.org/hello-olmo-a-truly-open-llm-43f7e7359222|blog]] [[https://github.com/allenai/OLMo|github]] [[https://huggingface.co/allenai/OLMo-7B|huggingface]] | | [[https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf|Gemma]] | 2024 | 7B, 2B | | Yes | [[https://blog.google/technology/developers/gemma-open-models/|blog]] | | [[https://arxiv.org/pdf/2403.19887|Jamba]] | 2024 | 52B | | Yes | [[https://www.ai21.com/blog/announcing-jamba|blog]] [[https://huggingface.co/ai21labs/Jamba-v0.1|HuggingFace]] | | [[https://arxiv.org/pdf/2404.14619|OpenELM]] | 2024 | 1.1B | | Yes | | | [[https://arxiv.org/pdf/2507.20534|Kimi K2]] | 2025 | 1T | | Yes | | | | | | | | | ===== Abilities and Analysis of LLMs ===== * **ChatGPT** * For ChatGPT, see also [[ChatGPT]]. * **[[https://arxiv.org/pdf/2305.18486.pdf|Laskar et al 2023 - A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets]]** * [[https://arxiv.org/pdf/2311.04939.pdf|Ronan & Schneider - Can Chat GPT solve a Linguistics Exam?]] * [[https://arxiv.org/pdf/2301.13867|Frieder et al 2023 - Mathematical Capabilities of ChatGPT]] * **Creativity** * [[https://arxiv.org/pdf/2401.12491.pdf|Zhao et al 2024 - Assessing and Understanding Creativity in Large Language Models]] * [[https://arxiv.org/pdf/2311.09682|Tian et al 2024 - MacGyver: Are Large Language Models Creative Problem Solvers?]] * **Self-Correction** * [[https://arxiv.org/pdf/2406.15673|Liu et al 2024 - Large Language Models have Intrinsic Self-Correction Ability]] * **Use of Context** * [[https://arxiv.org/pdf/1805.04623.pdf|Khandelwal et al 2018 - Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context]] (Old, no longer applies to transformer models) * [[https://arxiv.org/pdf/2307.03172|Liu 2023 - Lost in the Middle: How Language Models Use Long Contexts]] * [[https://link.springer.com/chapter/10.1007/978-3-031-88708-6_16|Hutter et al 2025 - Lost but Not Only in the Middle]] ==== Origin of Capabilities ==== * [[https://arxiv.org/pdf/2505.23323|Madabushi et al 2025 - Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors]] * **Machine Translation** * [[https://arxiv.org/pdf/2305.10266|Briakou et al 2023 - Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability]] * [[https://arxiv.org/pdf/2505.23548|Balashov 2025 - Translation in the Wild]] ===== Evaluation of LLMs and Benchmarks ===== * **Overviews** * [[https://arxiv.org/pdf/2307.03109|Chang et al 2023 - A Survey on Evaluation of Large Language Models]] * For common evaluation datasets for LLMs, see recent LLM system description papers such as the [[https://arxiv.org/pdf/2407.21783|LLama 3 paper]] (table 2) or [[https://www.anthropic.com/news/claude-sonnet-4-5|Claude Sonnet 4.5]] (evaluation table). * lm-evaluation-harness: [[https://github.com/EleutherAI/lm-evaluation-harness|LM Evaluation Harness (EleutherAI)]] (Released May 2021) * [[https://arxiv.org/pdf/2401.00595|Mizrahi et al 2024 - State of What Art? A Call for Multi-Prompt LLM Evaluation]] * lm-eval: [[https://arxiv.org/pdf/2405.14782|Biderman et al 2024 - Lessons from the Trenches on Reproducible Evaluation of Language Models]] * **Small-scale Evaluations** * [[https://arxiv.org/pdf/2402.14992|Polo et al 2024 - tinyBenchmarks: evaluating LLMs with fewer examples]] * **Effects of Length and Irrelevant Context** * [[https://arxiv.org/pdf/2402.14848|Levy et al 2024 - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models]] ===== Tool-Use in LLMs ===== See also [[prompting#Chained or Tool-based Prompting]]. * **Overviews and Background** * [[https://modelcontextprotocol.io/docs/getting-started/intro|Model Contex Protocol]] ===== Retrieval-Augmented Generation (RAG) ===== See [[Retrieval-Augmented Methods]]. ===== Limitations of Current LLMs ===== * [[https://aclanthology.org/2025.acl-long.1016.pdf|Shaikh et al 2025 - Navigating Rifts in Human-LLM Grounding: Study and Benchmark]] ===== Questions and Critiques of LLMs ===== * [[https://s10251.pcdn.co/pdf/2021-bender-parrots.pdf|Bender et al 2021 - On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?]] * [[https://arxiv.org/pdf/2308.07120|Rogers & Luccioni 2023 - Position: Key Claims in LLM Research Have a Long Tail of Footnotes]] ===== Adapting Language Models ===== ==== To Domains ==== * [[https://arxiv.org/pdf/2302.03169.pdf|Xie et al 2023 - Data Selection for Language Models via Importance Resampling]] ==== To Other Languages ==== * **Language Adaptive Fine-Tuning (LAFT)**: * [[https://arxiv.org/pdf/1910.11856.pdf|Artetxe et al 2019 - On the Cross-lingual Transferability of Monolingual Representations]] * [[https://arxiv.org/pdf/2012.15562.pdf|Pfeiffer et al 2020 - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts]] * Recycling: **[[https://aclanthology.org/2021.findings-acl.74.pdf|Vries & Nissim 2021 - As Good as New. How to Successfully Recycle English GPT-2 to Make Models for Other Languages]]** This one works the best so far. Retrains the embeddings keeping the Transformer layers fixed * [[https://aclanthology.org/2021.emnlp-main.672.pdf|Zhao & Schütze - Discrete and Soft Prompting for Multilingual Models]] * Multi-lingual version: [[https://arxiv.org/pdf/2204.06487.pdf|Alabi et al 2022 - Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning]] * [[https://aclanthology.org/2022.emnlp-main.616.pdf|Lin et al 2022 - Few-shot Learning with Multilingual Generative Language Models]] * [[https://arxiv.org/pdf/2210.03057.pdf|Shi et al 2022 - Language Models are Multilingual Chain-of-Thought Reasoners]] * [[https://arxiv.org/pdf/2212.10503.pdf|Marchisio et al 2022 - Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training]] * [[https://arxiv.org/pdf/2304.01922.pdf|Michal Štefánik et al 2023 - Resources and Few-shot Learners for In-context Learning in Slavic Languages]] (Dataset) * [[https://arxiv.org/pdf/2401.01055.pdf|Zhao et al 2024 - LLaMA Beyond English: An Empirical Study on Language Capability Transfer]] ==== Temporal Language Modeling ==== * [[https://arxiv.org/pdf/2102.01951.pdf|Lazaridou et al 2021 - Mind the Gap: Assessing Temporal Generalization in Neural Language Models]] * [[https://arxiv.org/pdf/2404.10297|Li & Flanigan 2024 - Future Language Modeling from Temporal Document History]] * [[https://arxiv.org/pdf/2404.18543|Drinkall et al 2024 - Time Machine GPT]] ===== Extracting Knowledge from Language Models ===== See also [[nlp:information_retrieval#Dense Document Retrieval with LLMs]]. * Extracting Training Data * [[https://arxiv.org/pdf/2012.07805.pdf|Carlini et al 2020 - Extracting Training Data from Large Language Models]] [[https://github.com/ftramer/LM_Memorization|github]] * [[https://arxiv.org/pdf/2601.02671|Ahmed et al 2026 - Extracting Books from Production Language Models]] * Membership Inference for Training Data * (Decide if some sample data is in the training data or not) * Related page: [[ml:Privacy#Membership Inference Attacks]] * [[https://arxiv.org/pdf/1811.00513.pdf|Song & Shmatikov 2018 - Auditing Data Provenance in Text-Generation Models]] * [[https://arxiv.org/pdf/1909.01066.pdf|Language Models as Knowledge Bases?]] * [[https://arxiv.org/pdf/1909.00505.pdf|Feldman et al 2019 - Commonsense Knowledge Mining from Pretrained Models]] * [[https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00324|Jiang et al 2020 - How Can We Know What Language Models Know?]] * [[https://arxiv.org/pdf/2106.09231.pdf|Cao et al 2021 - Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases]] * [[https://arxiv.org/pdf/2110.08387.pdf|Liu et al 2022 - Generated Knowledge Prompting for Commonsense Reasoning]] * [[https://arxiv.org/pdf/2201.07207.pdf|Huang et al 2022 - Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents]] * [[https://arxiv.org/pdf/2205.11482.pdf|Akyürek et al 2022 - Tracing Knowledge in Language Models Back to the Training Data]] * [[https://arxiv.org/pdf/2404.15146|Schwarzschild et al 2024 - Rethinking LLM Memorization through the Lens of Adversarial Compression]] ===== Knowledge Editing ===== See [[Knowledge Editing]] and [[ml:Model Editing and Unlearning]]. ===== Personalization ===== * [[https://arxiv.org/pdf/2304.11406|Salemi et al 2023 - LaMP: When Large Language Models Meet Personalization]] * [[https://arxiv.org/pdf/2401.05459|Li et al 2024 - Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security]] ===== LLM Personality and Writing Style ===== * **Personality** * [[https://arxiv.org/pdf/2305.02547|Jiang et al 2023 - PersonaLLM: Investigating the Ability of Large Language Models to Express Personality Traits]] * [[https://arxiv.org/pdf/2307.16180|Pan & Zeng 2023 - Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models]] * [[https://arxiv.org/pdf/2310.01386|Huang et al 2023 - Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench]] * [[https://arxiv.org/pdf/2310.02168|Mao et al 2023 - Editing Personality for Large Language Models]] * [[https://arxiv.org/pdf/2311.05297|Suhr et al 2023 - Challenging the Validity of Personality Tests for Large Language Models]] * **Vocabulary Overuse** * [[https://arxiv.org/pdf/2406.07016|Kobak et al 2024 - Delving into ChatGPT usage in academic writing through excess vocabulary]] ===== Detecting Generated Text ===== See also [[nlp:automatic_fact_checking#Fake News Detection]]. * [[https://aclanthology.org/2022.naacl-main.88.pdf|Rodriguez - Cross-Domain Detection of GPT-2-Generated Technical Text]] * [[https://arxiv.org/abs/2301.11305|Mitchell et al 2023 - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature]] * [[https://arxiv.org/pdf/2305.09859|Mireshghallah et al 2023 - Smaller Language Models are Better Black-box Machine-Generated Text Detectors]] ===== Adversarial Attacks ===== * [[https://arxiv.org/pdf/2311.04235.pdf|Mu et al 2023 - Can LLMs Follow Simple Rules?]] ===== Steering ===== * [[https://arxiv.org/pdf/2501.17148|Wu et al 2025 - AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders]] * [[https://arxiv.org/pdf/2505.20809|Wu et al 2025 - Improved Representation Steering for Language Models]] ===== Applications ===== * **Evaluation**, see [[Evaluation#Evaluation with Large Language Models]] * **Creating Data** or replacement for crowdsourcing, see [[Data Augmentation]] (Synthetic Data Generation) ===== Copyright Issues ===== See [[Copyright Issues]]. * [[https://arxiv.org/pdf/2303.15715.pdf|Henderson et al 2023 - Foundation Models and Fair Use]] * [[https://arxiv.org/pdf/2310.13771.pdf|Karamolegkou et al 2023 - Copyright Violations and Large Language Models]] ===== Theoretical and Foundational Papers ===== See also [[Prompting#Analysis of In-Context-Learning]] and [[Language Model#Origin of Capabilities|Language Model - Origin of Capabilities]]. === Emergent Abilities === See also [[ml:Scaling Laws#Emergent Abilities|Scaling Laws - Emergent Abilities]] * [[https://arxiv.org/pdf/2309.01809.pdf|Lu et al 2023 - Are Emergent Abilities in Large Language Models just In-Context Learning?]] ===== Acceleration and Efficiency ===== See paper list [[https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey|here]]. See also [[ml:Model Compression]]. * **Overviews** * [[https://arxiv.org/pdf/2202.07105|Xu & McAuley et al 2022 - A Survey on Model Compression and Acceleration for Pretrained Language Models]] * **[[https://arxiv.org/pdf/2312.03863|Wan et al 2023 - Efficient Large Language Models: A Survey]]** Updated continuously. **See paper list [[https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey|here]]** ===== Economics of LLMs ===== * [[https://arxiv.org/pdf/2306.07402|Howell et al 2023 - The Economic Trade-offs of Large Language Models: A Case Study]] ===== Miscellaneous ===== ==== Concept or Semantic LLMs ==== * [[https://arxiv.org/pdf/2412.08821|Meta 2024 - Large Concept Models: Language Modeling in a Sentence Representation Space]] * [[https://arxiv.org/pdf/2501.05487|Ahmad & Goel 2025 - The Future of AI: Exploring the Potential of Large Concept Models]] ==== Consciousness of LLMs ==== * **Overviews** * [[https://arxiv.org/pdf/2505.19806|Chen et al 2025 - Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks]] ===== Historical Papers ===== Historical papers that may or may not be applicable today. * [[https://www.stats.ox.ac.uk/~teh/research/compling/hpylm.pdf|Teh 2003 - A Bayesian Interpretation of Interpolated Kneser-Ney]] * [[https://arxiv.org/pdf/1707.05589.pdf|Melis et al 2017 - On the State of the Art of Evaluation in Neural Language Models]] Shows that LSTMS, when properly tuned, outperform other models (as of 2017, so before the Transformer) * [[https://arxiv.org/pdf/1805.04623.pdf|Khandelwal et al 2018 - Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context]] ===== Datasets ===== * Standard benchmark datasets * [[https://developer.ibm.com/exchanges/data/all/wikitext-103/|Wikitext 103]] * [[https://catalog.ldc.upenn.edu/LDC99T42|Penn Treebank]] * [[https://github.com/deepmind/pg19|PG-19]] Uses books from before 1919. Good for long sequences. * Large datasets * [[https://github.com/soskek/bookcorpus|Bookcorpus]], also reproduced in the Pile, see [[https://github.com/soskek/bookcorpus/issues/27|here]]. [[https://the-eye.eu/public/AI/pile_preliminary_components/books1.tar.gz|reproduction]] [[https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2|original dataset]] * Common Crawl * [[https://openwebtext2.readthedocs.io/en/latest/background/|WebText and OpenWebText]]: * WebText: Introduced in GPT-2 ([[https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf|paper]]). * OpenWebText: Various implementations [[https://github.com/jcpeterson/openwebtext|here]] and [[https://github.com/yet-another-account/openwebtext|here]] and [[https://skylion007.github.io/OpenWebTextCorpus/|here]] (on HuggingFace [[https://huggingface.co/datasets/Skylion007/openwebtext|here]]). Used in MegatronLM [[https://github.com/NVIDIA/Megatron-LM/tree/main/tools/openwebtext|here]]. * [[https://openwebtext2.readthedocs.io/|OpenWebText2]] Open re-implementation, widely used. Use this one. On HuggingFace [[https://huggingface.co/datasets/the_pile_openwebtext2|here]]. * [[https://www.tensorflow.org/datasets/catalog/c4|Colossal Clean Crawled Corpus (C4)]]: [[https://arxiv.org/pdf/1910.10683.pdf|paper]] AI2 reimplementation at[[https://huggingface.co/datasets/c4|HuggingFace]] * [[https://pile.eleuther.ai/|The Pile]] Diverse set of data for building language models. [[https://the-eye.eu/public/AI/pile_neox/|Individual components]], see readme. [[https://the-eye.eu/public/AI/pile_preliminary_components/|Older individual components]] Paper: [[https://arxiv.org/pdf/2101.00027.pdf|The Pile: An 800GB Dataset of Diverse Text for Language Modeling]] [[https://pile.dataportraits.org/|Pile check tool]] * The pile has been removed due to this [[https://actionnetwork.org/petitions/authors-guild-open-letter-to-generative-ai-leaders|letter]], see [[https://huggingface.co/datasets/EleutherAI/pile/discussions/15|here]] and [[https://techcrunch.com/2023/07/18/thousands-of-authors-sign-letter-urging-ai-makers-to-stop-stealing-books/|here]] * [[https://huggingface.co/spaces/bigscience/BigScienceCorpus|BigScienceCorpus]]: [[https://openreview.net/pdf?id=UoEw6KigkUn|2022 - The BigScience Corpus: A 1.6TB Composite Multilingual Dataset]] * [[https://github.com/togethercomputer/RedPajama-Data|RedPajama-Data]] An Open Source Recipe to Reproduce LLaMA training dataset * Code datasets * [[https://huggingface.co/datasets/bigcode/the-stack|The Stack]] Used in [[https://huggingface.co/blog/starcoder|StarCoder]]. Has two membership test websites: [[https://huggingface.co/spaces/bigcode/in-the-stack|Am I in the stack]] and [[https://stack.dataportraits.org/|DataPortraits]] * Small datasets * [[https://babylm.github.io/|BabyLM Challenge]] * Tinystories: [[https://arxiv.org/pdf/2305.07759.pdf|paper]] [[https://huggingface.co/datasets/roneneldan/TinyStories|dataset]] * Minipile: [[https://arxiv.org/pdf/2304.08442.pdf|paper]] [[https://huggingface.co/datasets/JeanKaddour/minipile|Huggingface]] ===== Software and Demos ===== * **Training and/or Inference Frameworks for LLMs** * For an overview, see table 4 and section 3.7 of [[https://arxiv.org/pdf/2401.02038|Liu et al 2024 - Understanding LLMs: A Comprehensive Overview from Training to Inference]] * (Historical) n-gram LM toolkits * The best, highly optimized toolkit: [[https://kheafield.com/code/kenlm/|KenLM]] * Industry standard toolkit with many options: [[http://www.speech.sri.com/projects/srilm/download.html|SRILM]] * [[https://www.nltk.org/|NLTK]] also implements n-gram LMs * Deep learning toolkits * [[https://github.com/EleutherAI/gpt-neo|GPT Neo]] An open-source implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. * [[https://github.com/NVIDIA/Megatron-LM|NVidia's Megatron-LM]] Used for example, by BLOOM * **Online demos** * AI2's Jurassic language model [[https://studio.ai21.com/docs/jurassic1-language-models/|Jurassic-1]] * GPT-3: [[https://openai.com/api/|web interface]] is free after signing up * Hugging Face Fill Mask demo: [[https://huggingface.co/tasks/fill-mask|Fill-Mask Demo]] [[https://huggingface.co/tasks/text-generation|Text Generation Demo]] ===== Related Pages ===== * [[Autonomous Language Agents]] * [[BERT and Friends]] * [[ChatGPT]] * [[nlp:information_retrieval#Dense Document Retrieval with LLMs]] * [[Hallucination and Factivity]] * [[Instruction-Tuning]] * [[Large Reasoning Models]] (such as OpenAI o1 or DeepSeek R1) * [[ml:Mixture of Expert Models]] * [[Perplexity]] * [[Pretraining]] * [[Prompting]] * [[ml:Scaling Laws]] * [[Supertasks]]