(Some of the papers above should be moved to this section)
DeepSeek-R1-Zero-style reinforcement learning is sometimes called “reinforcement learning (RL) on verifiable rewards” (see for example Zhou 2025) or “RL with outcome supervision.”
See also Large Reasoning Models
In inverse reinforcement learning (IRL), the agent learns the reward by watching example actions from optimal policies.
Refer to this page for an up-to-date list of resources.