====== Instruction Tuning ======

===== Overviews =====
  * [[https://github.com/SinclairCoder/Instruction-Tuning-Papers|Instruction-Tuning Papers]]
  * [[https://arxiv.org/abs/2308.10792|Zhang et al 2023 - Instruction Tuning for Large Language Models: A Survey]]

===== Papers =====
  * [[https://aclanthology.org/2022.acl-long.244.pdf|Mishra et al 2021 - Cross-Task Generalization via Natural Language Crowdsourcing Instructions]]
  * [[https://arxiv.org/pdf/2109.01652.pdf|Wei et al 2021 - Finetuned Language Models Are Zero-Shot Learners]]
  * Multitask Prompted Training Enables Zero-Shot Task Generalization
  * ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
  * InstructGPT paper: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback]] This is essentially inverse-reinforcement learning (such as [[https://www.ri.cmu.edu/pub_files/2009/7/learch.pdf|this]]) applied to LMs
  * [[https://arxiv.org/pdf/2204.07705.pdf|Wong et al 2022 - Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks]]
  * FLAN-T5: [[https://arxiv.org/pdf/2210.11416.pdf|Chung et al 2022 - Scaling Instruction-Finetuned Language Models]]
  * **[[https://arxiv.org/pdf/2212.09689.pdf|Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor]]** - Problem with this paper: it might be extracting instructions that were used to train davinci-002, so it's actually using the human labor that was used to create the davinci-002 instructions.
  * [[https://arxiv.org/pdf/2301.13688.pdf|Longpre et al 2023 - The Flan Collection: Designing Data and Methods for Effective Instruction Tuning]] Two-column version [[https://openreview.net/pdf?id=ZX4uS605XV|here]]
  * [[https://arxiv.org/pdf/2305.11206.pdf|Zhou et al 2023 - LIMA: Less Is More for Alignment]] Demonstrates that strong performance can be achieved by fine-tuning on 1,000 carefully curated training examples.
  * **[[https://aclanthology.org/2023.acl-long.754.pdf|Wang et al 2023 - Self-Instruct: Aligning Language Models with Self-Generated Instructions]]**
  * RSO: [[https://arxiv.org/pdf/2309.06657|Liu et al 2023 - Statistical Rejection Sampling Improves Preference Optimization]] Uses rejection sampling with CE loss.  Sample outputs, and accept or reject them based on the reward.  Then fine-tune on the accepted ones use CE loss.  Very principled, easy to implement.  Says they get a benefit over DPO by using a reward model.
  * [[https://arxiv.org/pdf/2310.00492.pdf|Wu et al 2023 - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning]]
  * **[[https://arxiv.org/pdf/2310.05910.pdf|Sun et al 2023 - SALMON: Self-Alignment with Principle-Following Reward Models]]**
  * DPO: [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov et al 2023 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model]]
  * [[https://arxiv.org/pdf/2311.09677|Zhang et al 2023 - R-Tuning: Instructing Large Language Models to Say `I Don't Know']]
  * [[https://arxiv.org/pdf/2312.11456|Xiong et al 2023 - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint]] Talks about "it is also common to query human feedback during the training process. For instance, Bai et al. (2022); Touvron et al. (2023) typically iterate the RLHF process on a weekly cadence, where the fresh RLHF models are deployed to interact with crowdworkers and to collect new human preference data."
  * [[https://arxiv.org/pdf/2402.01306|Ethayarajh et al 2024 - KTO: Model Alignment as Prospect Theoretic Optimization]]
  * **[[https://aclanthology.org/2024.acl-long.662.pdf|Ahmadian et al 2024 - Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs]]** [[https://arxiv.org/pdf/2402.14740|arXiv version]] Shows that "PPO is not the right tool for doing RL in RLHF" and that "PPO is unnecessarily complicated for a pre-trained LLM environment."
  * [[https://arxiv.org/pdf/2404.10719|Xu et al 2024 - Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study]]
  * [[https://arxiv.org/pdf/2404.09656|Gorbatovski et al 2024 - Learn Your Reference Model for Real Good Alignment]] Says you can update the reference model even in DPO (making DPO similar to PPO)
  * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]
  * Group Relative Policy Optimization (GRPO): [[https://arxiv.org/pdf/2402.03300|Shao et al 2024 - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models]]
      * [[https://arxiv.org/pdf/2505.21178|Song & Zheng 2025 - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning]] Gives a nice overview of problems with GRPO, and some extensions
  * [[https://arxiv.org/pdf/2403.00409|Chowdhury 2024 - Provably Robust DPO: Aligning Language Models with Noisy Feedback]]
  * [[https://arxiv.org/pdf/2403.07691|Hong et al 2024 - ORPO: Monolithic Preference Optimization without Reference Model]] Similar to SimPO, below
  * [[https://arxiv.org/pdf/2405.14734|Meng et al 2024 - SimPO: Simple Preference Optimization with a Reference-Free Reward]] Similar to ORPO, above
  * [[https://arxiv.org/pdf/2406.09279|Ivison et al 2024 - Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback]]
  * [[https://arxiv.org/pdf/2407.18248|Wang et al 2024 - Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning]] Does iterative DPO training, [[https://arxiv.org/pdf/2407.21783|Llama 3.1]] does this as well (see post-training section 4, Figure 7)
  * [[https://arxiv.org/pdf/2401.10020|Yuan et al 2024 - Self-Rewarding Language Models]] From a seed instruction-tuned model, can create more instruction tuning data
  * [[https://arxiv.org/pdf/2505.20809|Wu et al 2025 - Improved Representation Steering for Language Models]] Called steering, but actually instruction tuning
  * **Multi-Dimensional Rewards**
    * [[https://arxiv.org/pdf/2311.09528|Wang et al 2023 - HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM]] Very high quality dataset (10k examples), better than 700K datasets that are not as good.
    * [[https://arxiv.org/pdf/2402.18571|Wang et al 2024 - Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards]]
  * **Analyzing, Filtering, or Improving Preference Data**
    * [[https://arxiv.org/pdf/2505.23114|Lee et al 2025 - Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data]] Applies dataset cartography ([[https://arxiv.org/pdf/2009.10795|Swayamdipta 2020]]) to preference data

===== Datasets =====
  * Alpaca: [[https://tatsu-lab.github.io/alpaca_eval/|Leaderboard]]
  * LIMA
  * Super-NaturalInstructions: [[https://instructions.apps.allenai.org/|website]]
  * Links to (almost) all instruction tuning datasets [[https://github.com/raunak-agarwal/instruction-datasets|website]]
  * Aya [[https://arxiv.org/abs/2402.06619|paper]]
  * ShareGPT: [[https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered|HuggingFace]]
  * OpenHermes: [[https://huggingface.co/datasets/teknium/openhermes|HuggingFace]]
  * Tulu (this one is really good): [[https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture|HuggingFace]]
===== Models =====
  * FLAN-T5
  * Alpaca
  * LIMA

===== People =====
  * [[https://scholar.google.com/citations?user=-7LK2SwAAAAJ&hl=en|Swaroop Mishra]]

===== Related Pages =====
  * [[Alignment]]
  * [[Human-In-The-Loop]]
  * [[ml:reinforcement_learning#Reinforcement Learning with Verifiable Rewards]]
  * [[human-in-the-loop#RLHF]]