This is an old revision of the document!

Instruction Tuning

Overviews

Papers

Mishra et al 2021 - Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Wei et al 2021 - Finetuned Language Models Are Zero-Shot Learners
Multitask Prompted Training Enables Zero-Shot Task Generalization
ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization
InstructGPT paper: Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback This is essentially inverse-reinforcement learning (such as this) applied to LMs
Wong et al 2022 - Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
FLAN-T5: Chung et al 2022 - Scaling Instruction-Finetuned Language Models
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor - Problem with this paper: it might be extracting instructions that were used to train davinci-002, so it's actually using the human labor that was used to create the davinci-002 instructions.
Longpre et al 2023 - The Flan Collection: Designing Data and Methods for Effective Instruction Tuning Two-column version here
Zhou et al 2023 - LIMA: Less Is More for Alignment
Wang et al 2023 - Self-Instruct: Aligning Language Models with Self-Generated Instructions
RSO: Liu et al 2023 - Statistical Rejection Sampling Improves Preference Optimization Uses rejection sampling with CE loss. Sample outputs, and accept or reject them based on the reward. Then fine-tune on the accepted ones use CE loss. Very principled, easy to implement. Says they get a benefit over DPO by using a reward model.
Wu et al 2023 - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
Sun et al 2023 - SALMON: Self-Alignment with Principle-Following Reward Models
DPO: Rafailov et al 2023 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Zhang et al 2023 - R-Tuning: Instructing Large Language Models to Say `I Don't Know'
Xiong et al 2023 - Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-constraint Talks about “it is also common to query human feedback during the training process. For instance, Bai et al. (2022); Touvron et al. (2023) typically iterate the RLHF process on a weekly cadence, where the fresh RLHF models are deployed to interact with crowdworkers and to collect new human preference data.”
Ethayarajh et al 2024 - KTO: Model Alignment as Prospect Theoretic Optimization
Xu et al 2024 - Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study
Gorbatovski et al 2024 - Learn Your Reference Model for Real Good Alignment Says you can update the reference model even in DPO (making DPO similar to PPO)
Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
Group Relative Policy Optimization (GRPO): Shao et al 2024 - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Chowdhury 2024 - Provably Robust DPO: Aligning Language Models with Noisy Feedback
Hong et al 2024 - ORPO: Monolithic Preference Optimization without Reference Model Similar to SimPO, below
Meng et al 2024 - SimPO: Simple Preference Optimization with a Reference-Free Reward Similar to ORPO, above
Ivison et al 2024 - Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Wang et al 2024 - Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning Does iterative DPO training, Llama 3.1 does this as well (see post-training section 4, Figure 7)
Yuan et al 2024 - Self-Rewarding Language Models From a seed instruction-tuned model, can create more instruction tuning data
Multi-Dimensional Rewards
- Wang et al 2023 - HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM Very high quality dataset (10k examples), better than 700K datasets that are not as good.
- Wang et al 2024 - Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards

Datasets

Alpaca: Leaderboard
LIMA
Super-NaturalInstructions: website
Links to (almost) all instruction tuning datasets website
Aya paper
ShareGPT: HuggingFace
OpenHermes: HuggingFace
Tulu (this one is really good): HuggingFace

NLP Wiki

Table of Contents

Instruction Tuning

Overviews

Papers

Datasets

Models

People

Related Pages