User Tools

Site Tools


nlp:instruction-tuning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:instruction-tuning [2025/05/05 06:55] jmflanignlp:instruction-tuning [2025/06/01 22:58] (current) – [Papers] jmflanig
Line 29: Line 29:
   * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]   * [[https://arxiv.org/pdf/2309.16583|Zheng et al 2023 - GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond]]
   * Group Relative Policy Optimization (GRPO): [[https://arxiv.org/pdf/2402.03300|Shao et al 2024 - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models]]   * Group Relative Policy Optimization (GRPO): [[https://arxiv.org/pdf/2402.03300|Shao et al 2024 - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models]]
 +      * [[https://arxiv.org/pdf/2505.21178|Song & Zheng 2025 - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning]] Gives a nice overview of problems with GRPO, and some extensions
   * [[https://arxiv.org/pdf/2403.00409|Chowdhury 2024 - Provably Robust DPO: Aligning Language Models with Noisy Feedback]]   * [[https://arxiv.org/pdf/2403.00409|Chowdhury 2024 - Provably Robust DPO: Aligning Language Models with Noisy Feedback]]
   * [[https://arxiv.org/pdf/2403.07691|Hong et al 2024 - ORPO: Monolithic Preference Optimization without Reference Model]] Similar to SimPO, below   * [[https://arxiv.org/pdf/2403.07691|Hong et al 2024 - ORPO: Monolithic Preference Optimization without Reference Model]] Similar to SimPO, below
Line 35: Line 36:
   * [[https://arxiv.org/pdf/2407.18248|Wang et al 2024 - Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning]] Does iterative DPO training, [[https://arxiv.org/pdf/2407.21783|Llama 3.1]] does this as well (see post-training section 4, Figure 7)   * [[https://arxiv.org/pdf/2407.18248|Wang et al 2024 - Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning]] Does iterative DPO training, [[https://arxiv.org/pdf/2407.21783|Llama 3.1]] does this as well (see post-training section 4, Figure 7)
   * [[https://arxiv.org/pdf/2401.10020|Yuan et al 2024 - Self-Rewarding Language Models]] From a seed instruction-tuned model, can create more instruction tuning data   * [[https://arxiv.org/pdf/2401.10020|Yuan et al 2024 - Self-Rewarding Language Models]] From a seed instruction-tuned model, can create more instruction tuning data
 +  * [[https://arxiv.org/pdf/2505.20809|Wu et al 2025 - Improved Representation Steering for Language Models]] Called steering, but actually instruction tuning
   * **Multi-Dimensional Rewards**   * **Multi-Dimensional Rewards**
     * [[https://arxiv.org/pdf/2311.09528|Wang et al 2023 - HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM]] Very high quality dataset (10k examples), better than 700K datasets that are not as good.     * [[https://arxiv.org/pdf/2311.09528|Wang et al 2023 - HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM]] Very high quality dataset (10k examples), better than 700K datasets that are not as good.
     * [[https://arxiv.org/pdf/2402.18571|Wang et al 2024 - Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards]]     * [[https://arxiv.org/pdf/2402.18571|Wang et al 2024 - Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards]]
 +  * **Analyzing, Filtering, or Improving Preference Data**
 +    * [[https://arxiv.org/pdf/2505.23114|Lee et al 2025 - Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data]] Applies dataset cartography ([[https://arxiv.org/pdf/2009.10795|Swayamdipta 2020]]) to preference data
  
 ===== Datasets ===== ===== Datasets =====
Line 59: Line 63:
   * [[Alignment]]   * [[Alignment]]
   * [[Human-In-The-Loop]]   * [[Human-In-The-Loop]]
 +  * [[ml:reinforcement_learning#Reinforcement Learning with Verifiable Rewards]]
   * [[human-in-the-loop#RLHF]]   * [[human-in-the-loop#RLHF]]
  
nlp/instruction-tuning.1746428129.txt.gz · Last modified: 2025/05/05 06:55 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki