Human-In-The-Loop, RLHF and Interactive Methods

Overviews

Wang et al 2021 - Putting Humans in the Natural Language Processing Loop: A Survey
Blog posts
- HuggingFace - Illustrating Reinforcement Learning from Human Feedback (RLHF)

General Papers

Ribeiro & Lundberg 2022 - Adaptive Testing and Debugging of NLP Models
Interactive AI Model Debugging and Correction (2022 Thesis) (talk)
InstructGPT: Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback
Yu et al 2023 - Teaching Language Models to Self-Improve through Interactive Demonstrations
VAL: Interactive Task Learning with GPT Dialog Parsing This is in an HCI conference

Classification

Lertvittayakumjorn et al 2020 - FIND: Human-in-the-Loop Debugging Deep Text Classifiers

Semantic Parsing

Machine Translation

Peris & Casacuberta 2018 - Active Learning for Interactive Neural Machine Translation of Data Streams

Evaluation Tasks

Khashabi et al 2021 - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation

RLHF

RLHF: Reinforcement Learning with Human Feedback. This definition is quite general, and would apply to any method of reinforcement learning with human feedback. Actually RLHF usually refers to a specific method of doing this. For a quick a overview, see section 3 of Rafailov 2023.

Overviews

Quick overview: section 3 of Rafailov 2023 or section of 1.1 of Chowdhury 2024
Lil'Log 2018 - Policy Gradient Algorithms
Kaufmann et al 2023 - A Survey of Reinforcement Learning from Human Feedback
2023 - Secrets of RLHF in Large Language Models Part I: PPO
Wang et al 2024 - Secrets of RLHF in Large Language Models Part II: Reward Modeling
Chaudhari et al 2024 - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs Really good explanation of PPO in section 6 (why each piece is necessary)
Jiang et al 2024 - A Survey on Human Preference Learning for Large Language Models
Lambert - Reinforcement Learning from Human Feedback Nathan's RLHF book. Very good website

Papers

Christiano et al 2017 - Deep Reinforcement Learning from Human Preferences Introduced the setup of RLHF
Ziegler et al 2019 - Fine-Tuning Language Models from Human Preferences Early RLHF paper, before it was called RLHF
Stiennon et al 2020 - Learning to summarize from human feedback
Askell et al 2021 - A General Language Assistant as a Laboratory for Alignment Introduced the acronym RLHF
InstructGPT: Ouyang et al 2022 - Training language models to follow instructions with human feedback This is essentially inverse-reinforcement learning (such as this) applied to LMs. Background papers:
Bai et al 2022 - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Rafailov et al 2023 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Wu et al 2023 - Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
Gao et al 2024 - Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF
Wang et al 2025 - Text2Grad: Reinforcement Learning from Natural Language Feedback

NLP Wiki

Table of Contents

Human-In-The-Loop, RLHF and Interactive Methods

Overviews

General Papers

Classification

Semantic Parsing

Machine Translation

Evaluation Tasks

RLHF

Overviews

Papers

Crowdsourcing & Data Collection

Conferences and Workshops

People

Related Pages