====== Human-In-The-Loop, RLHF and Interactive Methods ====== ===== Overviews ===== See also [[https://sites.google.com/view/internlp2021/background-references?authuser=0|Interactive NLP Workshop - References]] and [[https://github.com/opendilab/awesome-RLHF|Awesome RLHF]] * [[https://arxiv.org/pdf/2103.04044.pdf|Wang et al 2021 - Putting Humans in the Natural Language Processing Loop: A Survey]] * Blog posts * [[https://huggingface.co/blog/rlhf|HuggingFace - Illustrating Reinforcement Learning from Human Feedback (RLHF)]] ===== General Papers ===== * [[https://aclanthology.org/2022.acl-long.230.pdf|Ribeiro & Lundberg 2022 - Adaptive Testing and Debugging of NLP Models]] * Interactive AI Model Debugging and Correction (2022 Thesis) ([[https://www.cs.cmu.edu/~sherryw/assets/talks/2022-interactive-ai-debugging.pdf|talk]]) * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training Language Models to Follow Instructions with Human Feedback]] * [[https://arxiv.org/pdf/2310.13522|Yu et al 2023 - Teaching Language Models to Self-Improve through Interactive Demonstrations]] * **[[https://arxiv.org/pdf/2310.01627|VAL: Interactive Task Learning with GPT Dialog Parsing]]** This is in an HCI conference ==== Classification ==== * [[https://www.aclweb.org/anthology/2020.emnlp-main.24.pdf|Lertvittayakumjorn et al 2020 - FIND: Human-in-the-Loop Debugging Deep Text Classifiers]] ==== Semantic Parsing ==== * [[https://arxiv.org/pdf/1704.06956.pdf|Wang et al 2016 - Naturalizing a Programming Language via Interactive Learning]] * [[https://arxiv.org/pdf/2010.05190.pdf|Karamcheti et al 2020 - Learning Adaptive Language Interfaces through Decomposition]] * [[https://arxiv.org/pdf/2110.08345.pdf|Mo et al 2021 - Towards Transparent Interactive Semantic Parsing via Step-by-Step Correction]] ==== Machine Translation ==== * [[https://arxiv.org/pdf/1807.11243.pdf|Peris & Casacuberta 2018 - Active Learning for Interactive Neural Machine Translation of Data Streams]] ==== Evaluation Tasks ==== * [[https://arxiv.org/pdf/2101.06561.pdf|Khashabi et al 2021 - GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation]] ===== RLHF ===== RLHF: Reinforcement Learning with Human Feedback. This definition is quite general, and would apply to any method of reinforcement learning with human feedback. Actually RLHF usually refers to a specific method of doing this. For a quick a overview, see section 3 of [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov 2023]]. === Overviews === * Quick overview: section 3 of [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov 2023]] or section of 1.1 of [[https://arxiv.org/pdf/2403.00409|Chowdhury 2024]] * [[https://lilianweng.github.io/posts/2018-04-08-policy-gradient/|Lil'Log 2018 - Policy Gradient Algorithms]] * [[https://arxiv.org/pdf/2312.14925.pdf|Kaufmann et al 2023 - A Survey of Reinforcement Learning from Human Feedback]] * [[https://arxiv.org/pdf/2307.04964|2023 - Secrets of RLHF in Large Language Models Part I: PPO]] * [[https://arxiv.org/pdf/2401.06080|Wang et al 2024 - Secrets of RLHF in Large Language Models Part II: Reward Modeling]] * [[https://arxiv.org/pdf/2404.08555|Chaudhari et al 2024 - RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs]] Really good explanation of PPO in section 6 (why each piece is necessary) * [[https://arxiv.org/pdf/2406.11191|Jiang et al 2024 - A Survey on Human Preference Learning for Large Language Models]] * [[https://rlhfbook.com/book.pdf|Lambert - Reinforcement Learning from Human Feedback]] Nathan's RLHF book. Very good [[https://rlhfbook.com/|website]] === Papers === * [[https://arxiv.org/pdf/1706.03741.pdf|Christiano et al 2017 - Deep Reinforcement Learning from Human Preferences]] Introduced the setup of RLHF * [[https://arxiv.org/pdf/1909.08593.pdf|Ziegler et al 2019 - Fine-Tuning Language Models from Human Preferences]] Early RLHF paper, before it was called RLHF * [[https://arxiv.org/pdf/2009.01325.pdf|Stiennon et al 2020 - Learning to summarize from human feedback]] * [[https://arxiv.org/pdf/2112.00861.pdf|Askell et al 2021 - A General Language Assistant as a Laboratory for Alignment]] Introduced the acronym RLHF * InstructGPT: [[https://arxiv.org/pdf/2203.02155.pdf|Ouyang et al 2022 - Training language models to follow instructions with human feedback]] This is essentially inverse-reinforcement learning (such as [[https://www.ri.cmu.edu/pub_files/2009/7/learch.pdf|this]]) applied to LMs. Background papers: * Used PPO: [[https://arxiv.org/pdf/1707.06347.pdf|Schulman et al 2017 - Proximal Policy Optimization Algorithms]] * [[https://arxiv.org/pdf/1909.08593|Ziegler 2019 - Fine-Tuning Language Models from Human Preferences]] * [[https://arxiv.org/pdf/2009.01325|Stiennon et al 2020 - Learning to Summarize from Human Feedback]] * [[https://arxiv.org/pdf/2204.05862.pdf|Bai et al 2022 - Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback]] * [[https://arxiv.org/pdf/2305.18290.pdf|Rafailov et al 2023 - Direct Preference Optimization: Your Language Model is Secretly a Reward Model]] * [[https://arxiv.org/pdf/2306.01693.pdf|Wu et al 2023 - Fine-Grained Human Feedback Gives Better Rewards for Language Model Training]] * [[https://arxiv.org/pdf/2410.04612|Gao et al 2024 - Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF]] * [[https://arxiv.org/pdf/2505.22338|Wang et al 2025 - Text2Grad: Reinforcement Learning from Natural Language Feedback]] === Crowdsourcing & Data Collection === * [[https://www.surgehq.ai/|Surge AI]] ===== Conferences and Workshops ===== * [[https://sites.google.com/view/internlp2021/home|InterNLP @ ACL 2021]] ===== People ===== * [[https://scholar.google.com/citations?user=pouyVyUAAAAJ&hl=en|Percy Liang]] * [[https://scholar.google.com/citations?user=TPQVssUAAAAJ&hl=en|Sherry Tongshuang Wu]] ([[https://www.cs.cmu.edu/~sherryw/|homepage]]) ===== Related Pages ===== * [[HCI and NLP]] * [[Instruction-Tuning]] * [[Lifelong Learning]] * [[ml:Reinforcement Learning]] * [[ml:Self-Play]]