====== Prompting and In-Context Learning ======


===== Overviews =====
  * [[https://arxiv.org/pdf/2107.13586.pdf|Liu et al 2021 - Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing]]
  * [[https://arxiv.org/pdf/2301.00234.pdf|Dong et al 2022 - A Survey on In-context Learning]]
  * [[https://arxiv.org/pdf/2212.09597.pdf|Qiao et al 2022 - Reasoning with Language Model Prompting: A Survey]] Very good
  * [[https://arxiv.org/pdf/2402.07927|Sahoo et al 2024 - A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications]] Not that great
  * **Tutorials, Courses, Slides and Guides**
    * Guides
      * [[https://www.promptingguide.ai/|Prompt Engineering Guide]] This one is pretty good
    * Slides
      * UMass Amherst: [[https://people.cs.umass.edu/~miyyer/cs685/slides/prompt_learning.pdf|Prompt-based learning]]
      * Stanford: [[https://web.stanford.edu/class/cs224n/slides/cs224n-2023-lecture11-prompting-rlhf.pdf|Prompting, Instruction Finetuning, and RLHF]]
    * Blog: [[https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/|Lil'log Prompt Engineering]]
    * Github: [[https://github.com/brexhq/prompt-engineering|BREX's Prompt Engineering Guide]]
    * Github: [[https://github.com/dair-ai/Prompt-Engineering-Guide|DAIR AI's Prompt Engineering Guide]]
    * Course: [[https://learnprompting.org/docs/intro|learnprompting.org]]

===== Prompting Language Models =====

==== Zero-shot ====
  * [[https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf|Radford et al 2019 - Language Models Are Unsupervised Multitask Learners]] [[https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf|old link]] GPT-2
  * [[https://arxiv.org/pdf/2109.01652.pdf|Wei et al 2021 - Finetuned Language Models Are Zero-Shot Learners]]
  * [[https://arxiv.org/pdf/2212.09865.pdf|Lyu et al 2022 - Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations]]

==== Few-shot aka In-Context Learning ====
  * [[https://arxiv.org/pdf/2009.07118.pdf|Schick & Schütze 2020 - It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners]]
  * [[https://arxiv.org/pdf/2012.11926.pdf|Schick & Schütze 2020 - Few-Shot Text Generation with Natural Language Instructions]] GenPET, prompting for natural language generation
  * **[[https://arxiv.org/pdf/2001.07676.pdf|Schick & Schütze 2021 - Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference]]** Introduces PET, pre-dates GTP-3
  * [[https://arxiv.org/pdf/2005.14165.pdf|Brown et al 2021 - Language Models are Few-Shot Learners]] GPT-3
  * [[https://arxiv.org/pdf/2012.15723.pdf|Gao et al 2021 - Making Pre-trained Language Models Better Few-shot Learners]]

==== Many-Shot In-Context Learning ====
Prompting with a large context of many shots.

  * [[https://arxiv.org/pdf/2404.11018|Agarwal et al 2024 - Many-Shot In-Context Learning]]

==== Soft-Prompting, etc ====
  * See Soft-prompting overview on p.3 of [[https://aclanthology.org/2021.emnlp-main.672.pdf|Zhao & Schütze 2021]] 
  * [[https://arxiv.org/pdf/2010.15980.pdf|Shin et al 2020 - AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts]]
  * **Prefix-Tuning (aka P-Tuning)**: [[https://arxiv.org/pdf/2103.10385.pdf|Liu et al 2021 - GPT understands, too]] [[https://aclanthology.org/2021.emnlp-main.672.pdf|Zhao 2021]] finds this method to be the best.
  * [[https://arxiv.org/pdf/2104.06599.pdf|Qin & Eisner 2021 - Learning How to Ask: Querying LMs with Mixtures of Soft Prompts]]
  * **Prompt Tuning**: [[https://arxiv.org/pdf/2104.08691.pdf|Lester et al 2021 - The Power of Scale for Parameter-Efficient Prompt Tuning]] Can be seen as a "simplification of the recently proposed “prefix tuning” of Li and Liang (2021)"
  * [[https://aclanthology.org/2021.emnlp-main.672.pdf|Zhao & Schütze 2021 - Discrete and Soft Prompting for Multilingual Models]] They find that soft prompting with an LSTM like  [[https://arxiv.org/pdf/2103.10385.pdf|Liu et al 2021]] is best, both for English and cross-lingually.
  * [[https://arxiv.org/pdf/2110.07602.pdf|Liu et al 2021 - P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks]]
  * [[https://arxiv.org/pdf/2111.06719.pdf|Su et al 2021 - On Transferability of Prompt Tuning for Natural Language Processing]]
  * [[https://arxiv.org/pdf/2112.08348.pdf|Khashabi et al 2021 - Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts]]
  * [[https://arxiv.org/pdf/2201.08670.pdf|Tang et al 2022 - Context-Tuning: Learning Contextualized Prompts for Natural Language Generation]]
  * [[https://aclanthology.org/2022.acl-long.346.pdf|Vu et al 2022 - SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer]] - Multi-task, uses a library of learned soft prompts

Prompt tuning can be slower than fine-tuning.  See the figure below.\\
{{nlp:media:fine-tuning_vs_p-tuning.png?0x150}}\\
Figure from [[https://aclanthology.org/2022.naacl-main.290.pdf|Su et al 2022]].  See also figures 6-8 from [[https://arxiv.org/pdf/2203.06904.pdf|Ding et al 2022]].

==== Prompt Design / Prompt Engineering ====
See [[Prompt Engineering]].

==== Calibration and Scoring ====
  * [[https://arxiv.org/pdf/2104.08315.pdf|Holtzman et al 2021 - Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right]]
  * [[https://arxiv.org/pdf/2309.17249.pdf|Zhou et al 2023 - Batch Calibration: Rethinking Calibration for In-Context Learning and Prompt Engineering]]
==== Data-Augmentation Prompting ====
  * [[https://arxiv.org/pdf/2202.12499.pdf|Wang et al 2022 - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks]]

==== Chain of Thought Prompting ====
See also [[Reasoning#Reasoning Chains|Reasoning - Reasoning Chains]].

  * **Overviews**
    * [[https://arxiv.org/pdf/2309.15402.pdf|Chu et al 2023 - A Survey of Chain of Thought Reasoning: Advances, Frontiers and Future]]
    * [[https://arxiv.org/pdf/2401.14295.pdf|Besta et al 2024 - Topologies of Reasoning: Demystifying Chains, Trees, and Graphs of Thoughts]]
  * **[[https://arxiv.org/pdf/2201.11903.pdf|Wei et al 2022 - Chain of Thought Prompting Elicits Reasoning in Large Language Models]]** Introduced chain of thought prompting
  * [[https://arxiv.org/pdf/2205.11916.pdf|Kojima et al 2022 - Large Language Models are Zero-Shot Reasoners]] Introduced the prompt "Let's think step by step."
  * [[https://arxiv.org/pdf/2203.11171.pdf|Wang et al 2022 - Self-Consistency Improves Chain of Thought Reasoning in Language Models]] Sample multiple chain of thought reasonings, and take the majority vote for the answer
  * [[https://arxiv.org/pdf/2203.08383.pdf|Wang et al 2022 - Iteratively Prompt Pre-trained Language Models for Chain of Thought]]
  * [[https://arxiv.org/pdf/2203.14465.pdf|Zelikman et al 2023 - STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning]]
  * [[https://arxiv.org/pdf/2207.10342.pdf|Dohan et al 2022 - Language Model Cascades]]
  * [[https://arxiv.org/pdf/2209.07686.pdf|Madaan & Yazdanbakhsh et al 2022 - Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango]]
  * [[https://arxiv.org/pdf/2210.01240.pdf|Saparov & He 2022 - Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought]]
  * [[https://arxiv.org/pdf/2210.03629.pdf|Yao et al 2022 - ReAct: Synergizing Reasoning and Acting in Language Models]] - The basis of LangChain
  * **[[https://arxiv.org/pdf/2211.12588|Chen et al 2022 - Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks]]**
  * [[https://arxiv.org/pdf/2305.04091.pdf|Wang et 2023 - Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models]]
  * [[https://arxiv.org/pdf/2305.14992|Hao et al 2023 - Reasoning with Language Model is Planning with World Model]]
  * **Tree of Thought and Tree Search**
    * [[https://arxiv.org/pdf/2305.10601.pdf|Yao et al 2023 - Tree of Thoughts: Deliberate Problem Solving with Large Language Models]]
    * [[https://arxiv.org/pdf/2305.08291.pdf|Long 2023 - Large Language Model Guided Tree-of-Thought]]
    * [[https://arxiv.org/pdf/2309.17179.pdf|Feng et al 2023 - Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training]]
    * [[https://arxiv.org/pdf/2404.05966.pdf|Chi et al 2024 - THOUGHTSCULPT: Reasoning with Intermediate Revision and Search]]
  * [[https://arxiv.org/pdf/2306.14050.pdf|Li et al 2023 - Symbolic Chain-of-Thought Distillation: Small Models Can Also “Think” Step-by-Step]]
  * [[https://arxiv.org/abs/2308.05342|Wang & Zhao 2023 - Metacognitive Prompting Improves Understanding
in Large Language Models]]
  * **[[https://arxiv.org/pdf/2310.01714.pdf|Yasunaga et al 2023 - Large Language Models as Analogical Reasoners]]** Adds to the prompt "# Instruction: ## Recall relevant exemplars: ## Solve the initial problem:", which helps more than "Let's think step by step."
  * [[https://arxiv.org/pdf/2402.10200.pdf|Wang & Zhou et al 2024 - Chain-of-Thought Reasoning Without Prompting]]
  * [[https://arxiv.org/pdf/2403.02178|Chen et al 2024 - Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models]] Masks the CoT to get better results
  * [[https://arxiv.org/pdf/2502.15589|Zhang et al 2025 - LightThinker: Thinking Step-by-Step Compression]]
  * [[https://arxiv.org/pdf/2505.24217|Leng et al 2025 - Semi-structured LLM Reasoners Can Be Rigorously Audited]] William Cohen paper
  * **Analysis of Chain of Thought**
    * [[https://arxiv.org/pdf/2310.07923|Merrill & Sabharwal 2024 - The Expressive Power of Transformers with Chain of Thought]]
    * [[https://arxiv.org/pdf/2502.21212|Huang et al 2025 - Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought]] See related work section for more work

==== Cross-lingual Prompting ====
  * [[https://aclanthology.org/2021.emnlp-main.672.pdf|Zhao & Schütze 2021 - Discrete and Soft Prompting for Multilingual Models]]

==== Miscellaneous Promping Papers ====
  * [[https://arxiv.org/pdf/2103.08493.pdf|Scao & Rush 2021 - How Many Data Points is a Prompt Worth?]] Prompts are very helpful in small data regimes, and are worth 100's of datapoints.
  * [[https://arxiv.org/pdf/2112.08348.pdf|Khashabi et al 2021 - Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts]]. See also [[https://arxiv.org/pdf/2109.01247.pdf|Webson & Pavlick 2021]]
  * [[https://arxiv.org/pdf/2109.01247.pdf|Webson & Pavlick 2021 - Do Prompt-Based Models Really Understand the Meaning of Their Prompts?]]

==== Chained or Tool-based Prompting ====
For an overview see [[https://github.com/thunlp/ToolLearningPapers|Tool Learning Papers]]

  * **Overviews**
    * [[https://arxiv.org/pdf/2304.08354.pdf|Qin et al 2023 - Tool Learning with Foundation Models]]
    * [[https://modelcontextprotocol.io/docs/getting-started/intro|Model Contex Protocol]] A standard introduced by Anthropic in 2024
  * [[https://arxiv.org/pdf/2210.03629.pdf|Yao et al 2022 - ReAct: Synergizing Reasoning and Acting in Language Models]]. This kind of thing is implemented in [[https://github.com/hwchase17/langchain|LangChain]]
  * [[https://arxiv.org/abs/2302.04761|Schick et al 2023 - Toolformer: Language Models Can Teach Themselves to Use Tools]]
  * [[https://arxiv.org/pdf/2304.08354.pdf|Qin et al 2023 - Tool Learning with Foundation Models]]
  * [[https://arxiv.org/pdf/2307.16789.pdf|Qin et al 2023 - ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs]]
    * Uses [[https://rapidapi.com/|RapidAPI]]
  * [[https://arxiv.org/pdf/2402.01869|2024 - InferCept: Efficient Intercept Support for Augmented Large Language Model Inference]]
  * [[https://arxiv.org/pdf/2409.00920|Liu et al 2024 - ToolACE: Winning the Points of LLM Function Calling]]

==== Prompt Compression ====
  * [[https://arxiv.org/pdf/2304.08467|Mu et al 2024 - Learning to Compress Prompts with Gist Tokens]]

==== Retrieval-Based Methods (Retrieval-Augmented) ====
See [[Retrieval-Augmented Methods]].

==== Data Contamination Issues ====
See also [[ml: Membership Inference]].
  * **Overviews**
    * [[https://arxiv.org/pdf/2404.00699.pdf|Ravaut et al 2024 - How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library]]
  * [[https://arxiv.org/pdf/2305.10160.pdf|Jacovi, et al 2023 - Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks]]
  * [[https://arxiv.org/pdf/2312.16337|Li & Flanigan 2023 - Task Contamination: Language Models May Not Be Few-Shot Anymore]]
  * LLMSanitize: [[https://arxiv.org/pdf/2404.00699.pdf|Ravaut et al 2024 - How Much are LLMs Contaminated? A Comprehensive Survey and the LLMSanitize Library]]
  * [[https://arxiv.org/pdf/2404.18543|Drinkall et al 2024 - Time Machine GPT]]
  * GSM1k: [[https://arxiv.org/pdf/2405.00332|Zhang et al 2024 - A Careful Examination of Large Language Model Performance on Grade School Arithmetic]] Re-evaluates GSM8K with a new dataset

==== Dependence on Number of Examples ====
  * [[https://arxiv.org/pdf/2103.08493|Scao & Rush 2021 - How Many Data Points is a Prompt Worth?]]
  * [[https://arxiv.org/pdf/2404.11018|Agarwal et al 2024 - Many-Shot In-Context Learning]]

==== Comparison to Fine-Tuning ====
  * [[https://arxiv.org/pdf/2305.16938|Mosbach et al 2023 - Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation]]
  * [[https://arxiv.org/pdf/2401.08406.pdf|Balaguer et al 2024 - RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture]]

==== Analysis of In-Context-Learning ====
  * [[https://arxiv.org/pdf/2109.01247.pdf|Webson & Pavlick 2021 - Do Prompt-Based Models Really Understand the Meaning of Their Prompts?]]
  * [[https://arxiv.org/pdf/2202.12837.pdf|Min et al 2022 - Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?]]
  * [[https://arxiv.org/pdf/2208.01066.pdf|Garg et al 2022 - What Can Transformers Learn In-Context? A Case Study of Simple Function Classes]]
  * [[https://arxiv.org/pdf/2211.15661.pdf|Akyürek et al 2022 - What learning algorithm is in-context learning? Investigations with linear models]]
  * [[https://arxiv.org/pdf/2310.15916.pdf|Hendel et al 2023 - In-Context Learning Creates Task Vectors]]
  * [[https://arxiv.org/pdf/2505.05145|Hu et al 2025 - Understanding In-context Learning of Addition via Activation Subspaces]] Great paper. Fig 1 is awesome.
  * [[https://arxiv.org/pdf/2504.00132|Bakalova et al 2025 - Contextualize-then-Aggregate: Circuits for In-Context Learning in Gemma-2 2B]]

===== Datasets =====
  * Datasets with Prompts for Evaluating Language Models
    * **PromptSource**: [[https://github.com/bigscience-workshop/promptsource|github]] [[https://arxiv.org/pdf/2202.01279.pdf|Bach et al 2022 - PromptSource: An Integrated Development Environment and Repository for Natural Language Prompts]] 2,000 prompts for 170 datasets
    * **BIG-Bench**: [[https://github.com/google/BIG-bench|github]] [[https://arxiv.org/pdf/2206.04615.pdf|Srivastava et al 2022 - Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models]] Growing list of user-submitted tasks.  Contains languages other than English
    * **SuperNatural-Instructuctions**: [[https://arxiv.org/pdf/2204.07705.pdf|Wang et al 2022 - SUPER-NATURALINSTRUCTIONS: Generalization via Declarative Instructions on 1600+ NLP Tasks]] 1,600 instructions for 76 tasks across 55 languages
    * **BIG-Bench-Hard**
    * **LM-Evaluation Harness**: [[https://github.com/EleutherAI/lm-evaluation-harness|github]]

===== Software =====
  * [[https://github.com/hwchase17/langchain|LangChain]] Framework for building applications with prompting (chaining prompts, etc). This paper was the basis for it: [[https://arxiv.org/pdf/2210.03629.pdf|Yao et al 2022 - ReAct: Synergizing Reasoning and Acting in Language Models]]

===== Talks and Lectures =====
  * [[https://underline.io/events/122/sessions?eventSessionId=4313|Invited Talk @ NAACL 2021: Humans Learn From Task Descriptions and So Should Our Models - Hinrich Schütze]]

===== People =====
  * [[https://scholar.google.com/citations?user=k8CKy5UAAAAJ&hl=en|Timo Schick]]

===== Related Pages =====
  * [[Instruction-Tuning]]
  * [[ml:Few-Shot Learning]]
  * [[ml:Fine-Tuning]]
  * [[Language Model]]
  * [[ml:Meta-Learning]]
  * [[Pretraining]]
  * [[Prompt Engineering]]
  * [[Retrieval-Augmented Methods]]
  * [[Task Descriptions|Natural Language Task Descriptions]]
  * [[ml:Zero-Shot Learning]]