Differences

This shows you the differences between two versions of the page.

--- ml:reinforcement_learning [2025/01/25 03:26] – [Papers] jmflanig
+++ ml:reinforcement_learning [2025/07/14 05:40] (current) – [Reinforcement Learning with Verifiable Rewards] jmflanig
@@ Line 27: / Line 27: @@
     * [[https://aclanthology.org/2024.acl-long.510.pdf|Wang et al 2024 - Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations]]
     * [[https://aclanthology.org/2024.findings-emnlp.429.pdf|Wang et al 2024 - Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision]]
+    * [[https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf|DeepSeek 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]]
+===== NLP RL Papers =====
+(Some of the papers above should be moved to this section)
+  * **Applied to Text Games**
+    * [[https://arxiv.org/pdf/1506.08941|Narasimhan et al 2015 - Language Understanding for Text-based Games using Deep Reinforcement Learning]]
+===== Reinforcement Learning with Verifiable Rewards =====
+DeepSeek-R1-Zero-style reinforcement learning is sometimes called **"reinforcement learning (RL) on verifiable rewards"** (see for example [[https://arxiv.org/pdf/2505.21493|Zhou 2025]]) or **"RL with outcome supervision."**
+See also [[nlp:Large Reasoning Models]]
+    * [[https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf|DeepSeek 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]]
+    * [[https://arxiv.org/pdf/2505.21493|Zhou et al 2025 - Reinforcing General Reasoning without Verifiers]]
 ===== Datasets =====
@@ Line 37: / Line 54: @@
 ===== Inverse Reinforcement Learning (IRL) =====
 In inverse reinforcement learning (IRL), the agent learns the reward by watching example actions from optimal policies.
-  * Non-NLP papers
+  * **Overviews**
+    * [[https://arxiv.org/pdf/1806.06877|Arora & Doshi 2018 - A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress]]
+    * Blog post: [[https://dkasenberg.github.io/inverse-reinforcement-learning-rescue/|Inverse Reinforcement Learning]] (nice diagrams)
+  * **Non-NLP papers**
     * [[https://arxiv.org/pdf/1603.00448.pdf|Finn et al 2016 - Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization]]