User Tools

Site Tools


nlp:human-in-the-loop

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:human-in-the-loop [2024/08/06 23:44] – [RLHF] jmflanignlp:human-in-the-loop [2025/05/31 07:43] (current) – [RLHF] jmflanig
Line 12: Line 12:
   * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback]]   * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback]]
   * [[https://arxiv.org/pdf/2310.13522|Yu et al 2023 - Teaching Language Models to Self-Improve through Interactive Demonstrations]]   * [[https://arxiv.org/pdf/2310.13522|Yu et al 2023 - Teaching Language Models to Self-Improve through Interactive Demonstrations]]
 +  * **[[https://arxiv.org/pdf/2310.01627|VAL: Interactive Task Learning with GPT Dialog Parsing]]** This is in an HCI conference
  
 ==== Classification ==== ==== Classification ====
Line 30: Line 31:
  
 === Overviews === === Overviews ===
-  * Quick overview: section 3 of [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov 2023]].+  * Quick overview: section 3 of [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov 2023]] or section of 1.1 of [[https://arxiv.org/pdf/2403.00409|Chowdhury 2024]] 
 +  * [[https://lilianweng.github.io/posts/2018-04-08-policy-gradient/|Lil'Log 2018 - Policy Gradient Algorithms]]
   * [[https://arxiv.org/pdf/2312.14925.pdf|Kaufmann et al 2023 - A Survey of Reinforcement Learning from Human Feedback]]   * [[https://arxiv.org/pdf/2312.14925.pdf|Kaufmann et al 2023 - A Survey of Reinforcement Learning from Human Feedback]]
 +  * [[https://arxiv.org/pdf/2307.04964|2023 - Secrets of RLHF in Large Language Models Part I: PPO]]
 +  * [[https://arxiv.org/pdf/2401.06080|Wang et al 2024 - Secrets of RLHF in Large Language Models Part II: Reward Modeling]]
 +  * [[https://arxiv.org/pdf/2404.08555|Chaudhari et al 2024 - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs]] Really good explanation of PPO in section 6 (why each piece is necessary)
 +  * [[https://arxiv.org/pdf/2406.11191|Jiang et al 2024 - A Survey on Human Preference Learning for Large Language Models]]
 +  * [[https://rlhfbook.com/book.pdf|Lambert - Reinforcement Learning from Human Feedback]] Nathan's RLHF book. Very good [[https://rlhfbook.com/|website]]
  
 === Papers === === Papers ===
Line 38: Line 45:
   * [[https://arxiv.org/pdf/2009.01325.pdf|Stiennon et al 2020 - Learning to summarize from human feedback]]   * [[https://arxiv.org/pdf/2009.01325.pdf|Stiennon et al 2020 - Learning to summarize from human feedback]]
   * [[https://arxiv.org/pdf/2112.00861.pdf|Askell et al 2021 - A General Language Assistant as a Laboratory for Alignment]] Introduced the acronym RLHF   * [[https://arxiv.org/pdf/2112.00861.pdf|Askell et al 2021 - A General Language Assistant as a Laboratory for Alignment]] Introduced the acronym RLHF
-  * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training language models to follow instructions with human feedback]] This is essentially inverse-reinforcement learning (such as [[https://www.ri.cmu.edu/pub_files/2009/7/learch.pdf|this]]) applied to LMs.+  * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training language models to follow instructions with human feedback]] This is essentially inverse-reinforcement learning (such as [[https://www.ri.cmu.edu/pub_files/2009/7/learch.pdf|this]]) applied to LMs. Background papers:
     * Used PPO: [[https://arxiv.org/pdf/1707.06347.pdf|Schulman et al 2017 - Proximal Policy Optimization Algorithms]]     * Used PPO: [[https://arxiv.org/pdf/1707.06347.pdf|Schulman et al 2017 - Proximal Policy Optimization Algorithms]]
 +    * [[https://arxiv.org/pdf/1909.08593|Ziegler 2019 - Fine-Tuning Language Models from Human Preferences]]
 +    * [[https://arxiv.org/pdf/2009.01325|Stiennon et al 2020 - Learning to Summarize from Human Feedback]]
   * [[https://arxiv.org/pdf/2204.05862.pdf|Bai et al 2022 - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback]]   * [[https://arxiv.org/pdf/2204.05862.pdf|Bai et al 2022 - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback]]
   * [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov et al 2023 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model]]   * [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov et al 2023 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model]]
-  * [[https://arxiv.org/pdf/2306.01693.pdf|Wu et al 2023 - Fine-Grained Human Feedback Gives Better Rewards +  * [[https://arxiv.org/pdf/2306.01693.pdf|Wu et al 2023 - Fine-Grained Human Feedback Gives Better Rewards for Language Model Training]] 
-for Language Model Training]]+  * [[https://arxiv.org/pdf/2410.04612|Gao et al 2024 - Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF]] 
 +  * [[https://arxiv.org/pdf/2505.22338|Wang et al 2025 - Text2Grad: Reinforcement Learning from Natural Language Feedback]]
  
 === Crowdsourcing & Data Collection === === Crowdsourcing & Data Collection ===
Line 56: Line 66:
  
 ===== Related Pages ===== ===== Related Pages =====
 +  * [[HCI and NLP]]
   * [[Instruction-Tuning]]   * [[Instruction-Tuning]]
   * [[Lifelong Learning]]   * [[Lifelong Learning]]
nlp/human-in-the-loop.1722987840.txt.gz · Last modified: 2024/08/06 23:44 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki