ml:reinforcement_learning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
ml:reinforcement_learning [2025/01/25 03:26] – [Papers] jmflanigml:reinforcement_learning [2025/07/14 05:40] (current) – [Reinforcement Learning with Verifiable Rewards] jmflanig
Line 27: Line 27:
     * [[https://aclanthology.org/2024.acl-long.510.pdf|Wang et al 2024 - Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations]]     * [[https://aclanthology.org/2024.acl-long.510.pdf|Wang et al 2024 - Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations]]
     * [[https://aclanthology.org/2024.findings-emnlp.429.pdf|Wang et al 2024 - Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision]]     * [[https://aclanthology.org/2024.findings-emnlp.429.pdf|Wang et al 2024 - Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision]]
 +    * [[https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf|DeepSeek 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]]
 +
 +===== NLP RL Papers =====
 +(Some of the papers above should be moved to this section)
 +
 +  * **Applied to Text Games**
 +    * [[https://arxiv.org/pdf/1506.08941|Narasimhan et al 2015 - Language Understanding for Text-based Games using Deep Reinforcement Learning]]
 +
 +
 +===== Reinforcement Learning with Verifiable Rewards =====
 +DeepSeek-R1-Zero-style reinforcement learning is sometimes called **"reinforcement learning (RL) on verifiable rewards"** (see for example [[https://arxiv.org/pdf/2505.21493|Zhou 2025]]) or **"RL with outcome supervision."**
 +
 +See also [[nlp:Large Reasoning Models]]
 +
 +    * [[https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf|DeepSeek 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]]
 +    * [[https://arxiv.org/pdf/2505.21493|Zhou et al 2025 - Reinforcing General Reasoning without Verifiers]]
 +
  
 ===== Datasets ===== ===== Datasets =====
Line 37: Line 54:
 ===== Inverse Reinforcement Learning (IRL) ===== ===== Inverse Reinforcement Learning (IRL) =====
 In inverse reinforcement learning (IRL), the agent learns the reward by watching example actions from optimal policies. In inverse reinforcement learning (IRL), the agent learns the reward by watching example actions from optimal policies.
-  * Non-NLP papers+  * **Overviews** 
 +    * [[https://arxiv.org/pdf/1806.06877|Arora & Doshi 2018 - A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress]] 
 +    * Blog post: [[https://dkasenberg.github.io/inverse-reinforcement-learning-rescue/|Inverse Reinforcement Learning]] (nice diagrams) 
 +  * **Non-NLP papers**
     * [[https://arxiv.org/pdf/1603.00448.pdf|Finn et al 2016 - Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization]]     * [[https://arxiv.org/pdf/1603.00448.pdf|Finn et al 2016 - Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization]]
  
ml/reinforcement_learning.1737775587.txt.gz · Last modified: 2025/01/25 03:26 by jmflanig

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki