ml:reinforcement_learning
Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| ml:reinforcement_learning [2025/05/29 07:28] – [Reinforcement Learning with Verifiable Rewards] jmflanig | ml:reinforcement_learning [2025/07/14 05:40] (current) – [Reinforcement Learning with Verifiable Rewards] jmflanig | ||
|---|---|---|---|
| Line 29: | Line 29: | ||
| * [[https:// | * [[https:// | ||
| - | ==== Reinforcement Learning with Verifiable Rewards ==== | + | ===== NLP RL Papers ===== |
| - | DeepSeek-R1-Zero-style reinforcement learning is sometimes called " | + | (Some of the papers above should be moved to this section) |
| + | |||
| + | * **Applied to Text Games** | ||
| + | * [[https:// | ||
| + | |||
| + | |||
| + | ===== Reinforcement Learning with Verifiable Rewards | ||
| + | DeepSeek-R1-Zero-style reinforcement learning is sometimes called | ||
| + | |||
| + | See also [[nlp:Large Reasoning Models]] | ||
| * [[https:// | * [[https:// | ||
| * [[https:// | * [[https:// | ||
| + | |||
| ===== Datasets ===== | ===== Datasets ===== | ||
| Line 43: | Line 54: | ||
| ===== Inverse Reinforcement Learning (IRL) ===== | ===== Inverse Reinforcement Learning (IRL) ===== | ||
| In inverse reinforcement learning (IRL), the agent learns the reward by watching example actions from optimal policies. | In inverse reinforcement learning (IRL), the agent learns the reward by watching example actions from optimal policies. | ||
| - | * Non-NLP papers | + | |
| + | * [[https:// | ||
| + | * Blog post: [[https:// | ||
| + | * **Non-NLP papers** | ||
| * [[https:// | * [[https:// | ||
ml/reinforcement_learning.1748503736.txt.gz · Last modified: 2025/05/29 07:28 by jmflanig