nlp:instruction-tuning
Table of Contents
Instruction Tuning
Overviews
Papers
- Multitask Prompted Training Enables Zero-Shot Task Generalization
- ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
- InstructGPT paper: Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback This is essentially inverse-reinforcement learning (such as this) applied to LMs
- Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor - Problem with this paper: it might be extracting instructions that were used to train davinci-002, so it's actually using the human labor that was used to create the davinci-002 instructions.
- Zhou et al 2023 - LIMA: Less Is More for Alignment Demonstrates that strong performance can be achieved by fine-tuning on 1,000 carefully curated training examples.
- RSO: Liu et al 2023 - Statistical Rejection Sampling Improves Preference Optimization Uses rejection sampling with CE loss. Sample outputs, and accept or reject them based on the reward. Then fine-tune on the accepted ones use CE loss. Very principled, easy to implement. Says they get a benefit over DPO by using a reward model.
- Xiong et al 2023 - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint Talks about “it is also common to query human feedback during the training process. For instance, Bai et al. (2022); Touvron et al. (2023) typically iterate the RLHF process on a weekly cadence, where the fresh RLHF models are deployed to interact with crowdworkers and to collect new human preference data.”
- Ahmadian et al 2024 - Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs arXiv version Shows that “PPO is not the right tool for doing RL in RLHF” and that “PPO is unnecessarily complicated for a pre-trained LLM environment.”
- Gorbatovski et al 2024 - Learn Your Reference Model for Real Good Alignment Says you can update the reference model even in DPO (making DPO similar to PPO)
- Group Relative Policy Optimization (GRPO): Shao et al 2024 - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- Song & Zheng 2025 - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning Gives a nice overview of problems with GRPO, and some extensions
- Hong et al 2024 - ORPO: Monolithic Preference Optimization without Reference Model Similar to SimPO, below
- Meng et al 2024 - SimPO: Simple Preference Optimization with a Reference-Free Reward Similar to ORPO, above
- Wang et al 2024 - Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning Does iterative DPO training, Llama 3.1 does this as well (see post-training section 4, Figure 7)
- Yuan et al 2024 - Self-Rewarding Language Models From a seed instruction-tuned model, can create more instruction tuning data
- Wu et al 2025 - Improved Representation Steering for Language Models Called steering, but actually instruction tuning
- Multi-Dimensional Rewards
- Wang et al 2023 - HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM Very high quality dataset (10k examples), better than 700K datasets that are not as good.
- Analyzing, Filtering, or Improving Preference Data
- Lee et al 2025 - Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data Applies dataset cartography (Swayamdipta 2020) to preference data
Datasets
- Alpaca: Leaderboard
- LIMA
- Super-NaturalInstructions: website
- Links to (almost) all instruction tuning datasets website
- Aya paper
- ShareGPT: HuggingFace
- OpenHermes: HuggingFace
- Tulu (this one is really good): HuggingFace
Models
- FLAN-T5
- Alpaca
- LIMA
People
Related Pages
nlp/instruction-tuning.txt · Last modified: 2025/06/01 22:58 by jmflanig