User Tools

Site Tools


nlp:human-in-the-loop

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
nlp:human-in-the-loop [2025/05/01 10:47] – [RLHF] jmflanignlp:human-in-the-loop [2025/05/31 07:43] (current) – [RLHF] jmflanig
Line 36: Line 36:
   * [[https://arxiv.org/pdf/2307.04964|2023 - Secrets of RLHF in Large Language Models Part I: PPO]]   * [[https://arxiv.org/pdf/2307.04964|2023 - Secrets of RLHF in Large Language Models Part I: PPO]]
   * [[https://arxiv.org/pdf/2401.06080|Wang et al 2024 - Secrets of RLHF in Large Language Models Part II: Reward Modeling]]   * [[https://arxiv.org/pdf/2401.06080|Wang et al 2024 - Secrets of RLHF in Large Language Models Part II: Reward Modeling]]
-  * [[https://arxiv.org/pdf/2404.08555|Chaudhari et al 2024 - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs]]+  * [[https://arxiv.org/pdf/2404.08555|Chaudhari et al 2024 - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs]] Really good explanation of PPO in section 6 (why each piece is necessary)
   * [[https://arxiv.org/pdf/2406.11191|Jiang et al 2024 - A Survey on Human Preference Learning for Large Language Models]]   * [[https://arxiv.org/pdf/2406.11191|Jiang et al 2024 - A Survey on Human Preference Learning for Large Language Models]]
   * [[https://rlhfbook.com/book.pdf|Lambert - Reinforcement Learning from Human Feedback]] Nathan's RLHF book. Very good [[https://rlhfbook.com/|website]]   * [[https://rlhfbook.com/book.pdf|Lambert - Reinforcement Learning from Human Feedback]] Nathan's RLHF book. Very good [[https://rlhfbook.com/|website]]
Line 53: Line 53:
   * [[https://arxiv.org/pdf/2306.01693.pdf|Wu et al 2023 - Fine-Grained Human Feedback Gives Better Rewards for Language Model Training]]   * [[https://arxiv.org/pdf/2306.01693.pdf|Wu et al 2023 - Fine-Grained Human Feedback Gives Better Rewards for Language Model Training]]
   * [[https://arxiv.org/pdf/2410.04612|Gao et al 2024 - Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF]]   * [[https://arxiv.org/pdf/2410.04612|Gao et al 2024 - Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF]]
 +  * [[https://arxiv.org/pdf/2505.22338|Wang et al 2025 - Text2Grad: Reinforcement Learning from Natural Language Feedback]]
  
 === Crowdsourcing & Data Collection === === Crowdsourcing & Data Collection ===
nlp/human-in-the-loop.1746096441.txt.gz · Last modified: 2025/05/01 10:47 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki