Instruction Tuning
- Overviews
- Papers
- Datasets
- Models
- People
- Related Pages

Instruction Tuning

Overviews

Papers

Mishra et al 2021 - Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Wei et al 2021 - Finetuned Language Models Are Zero-Shot Learners
Multitask Prompted Training Enables Zero-Shot Task Generalization
ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
InstructGPT paper: Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback This is essentially inverse-reinforcement learning (such as this) applied to LMs
Wong et al 2022 - Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
FLAN-T5: Chung et al 2022 - Scaling Instruction-Finetuned Language Models
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor - Problem with this paper: it might be extracting instructions that were used to train davinci-002, so it's actually using the human labor that was used to create the davinci-002 instructions.
Longpre et al 2023 - The Flan Collection: Designing Data and Methods for Effective Instruction Tuning Two-column version here
Zhou et al 2023 - LIMA: Less Is More for Alignment Demonstrates that strong performance can be achieved by fine-tuning on 1,000 carefully curated training examples.
Wang et al 2023 - Self-Instruct: Aligning Language Models with Self-Generated Instructions
RSO: Liu et al 2023 - Statistical Rejection Sampling Improves Preference Optimization Uses rejection sampling with CE loss. Sample outputs, and accept or reject them based on the reward. Then fine-tune on the accepted ones use CE loss. Very principled, easy to implement. Says they get a benefit over DPO by using a reward model.
Wu et al 2023 - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
Sun et al 2023 - SALMON: Self-Alignment with Principle-Following Reward Models
DPO: Rafailov et al 2023 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Zhang et al 2023 - R-Tuning: Instructing Large Language Models to Say `I Don't Know'
Xiong et al 2023 - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint Talks about “it is also common to query human feedback during the training process. For instance, Bai et al. (2022); Touvron et al. (2023) typically iterate the RLHF process on a weekly cadence, where the fresh RLHF models are deployed to interact with crowdworkers and to collect new human preference data.”
Ethayarajh et al 2024 - KTO: Model Alignment as Prospect Theoretic Optimization
Ahmadian et al 2024 - Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs arXiv version Shows that “PPO is not the right tool for doing RL in RLHF” and that “PPO is unnecessarily complicated for a pre-trained LLM environment.”
Xu et al 2024 - Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Gorbatovski et al 2024 - Learn Your Reference Model for Real Good Alignment Says you can update the reference model even in DPO (making DPO similar to PPO)
Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
Group Relative Policy Optimization (GRPO): Shao et al 2024 - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- Song & Zheng 2025 - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning Gives a nice overview of problems with GRPO, and some extensions
Chowdhury 2024 - Provably Robust DPO: Aligning Language Models with Noisy Feedback
Hong et al 2024 - ORPO: Monolithic Preference Optimization without Reference Model Similar to SimPO, below
Meng et al 2024 - SimPO: Simple Preference Optimization with a Reference-Free Reward Similar to ORPO, above
Ivison et al 2024 - Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Wang et al 2024 - Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning Does iterative DPO training, Llama 3.1 does this as well (see post-training section 4, Figure 7)
Yuan et al 2024 - Self-Rewarding Language Models From a seed instruction-tuned model, can create more instruction tuning data
Wu et al 2025 - Improved Representation Steering for Language Models Called steering, but actually instruction tuning
Multi-Dimensional Rewards
- Wang et al 2023 - HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM Very high quality dataset (10k examples), better than 700K datasets that are not as good.
- Wang et al 2024 - Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards
Analyzing, Filtering, or Improving Preference Data
- Lee et al 2025 - Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data Applies dataset cartography (Swayamdipta 2020) to preference data

Datasets

Alpaca: Leaderboard
LIMA
Super-NaturalInstructions: website
Links to (almost) all instruction tuning datasets website
Aya paper
ShareGPT: HuggingFace
OpenHermes: HuggingFace
Tulu (this one is really good): HuggingFace

Models

FLAN-T5
Alpaca
LIMA

People

Swaroop Mishra

Table of Contents

Instruction Tuning

Overviews

Papers

Datasets

Models

People

Related Pages