====== Reinforcement Learning ======

===== Overviews =====
  * Blogs and Tutorials
    * [[https://spinningup.openai.com/en/latest/spinningup/rl_intro.html|OpenAI Intro to RL]] Good intro to RL, with emphasis on deep learning methods
  * Books and Chapters
    * [[https://drive.google.com/file/d/1ntQDc8J4rZpgfZLQoyhX1JqDTT-RyNti/view?usp=sharing|Chapter 18 - Reinforcement Learning (UCSC only)]] from [[https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1492032646|Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 2nd Ed]].  Good, concise introduction.
  * Lectures and Slides
    * [[http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture14.pdf|Lecture 14: Reinforcement Learning]]
  * Overview papers
    * [[https://arxiv.org/pdf/2005.01643.pdf|Levine et al 2020 - Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems]]

===== Papers =====
  * REINFORCE:
    * Ronald J. Williams. A class of gradient-estimating algorithms for reinforcement learning in neural networks.
    * Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
    * See slide 10 [[https://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/slides/lec20.pdf|here]]
  * DAGGER: [[https://arxiv.org/pdf/1011.0686.pdf|Ross et al 2010 - A Reduction of Imitation Learning and Structured Predictionto No-Regret Online Learning]]
  * [[https://arxiv.org/pdf/1906.06062.pdf|Lorberbom et al 2019 - Direct Policy Gradients: Direct Optimization of Policies in Discrete Action Spaces]]
  * PPO: [[https://arxiv.org/pdf/1707.06347.pdf|Schulman et al 2017 - Proximal Policy Optimization Algorithms]] Used in RLHF ([[https://arxiv.org/pdf/1909.08593.pdf|Ziegler 2019]]) and InstructGPT ([[https://arxiv.org/pdf/2203.02155.pdf|Ouyang 2022]])
  * [[https://arxiv.org/pdf/2409.12917|Gao et al 2024 - Training Language Models to Self-Correct via Reinforcement Learning]] Applied to math and code
  * Applied to games
    * [[https://arxiv.org/pdf/1912.06680.pdf|Berner et al 2019 - Dota 2 with Large Scale Deep Reinforcement Learning]] [[https://openai.com/blog/openai-five-defeats-dota-2-world-champions/|blog]] Uses a policy-gradient method called Proximal Policy Optimization ([[https://openai.com/blog/openai-baselines-ppo/|PPO]])
    * [[https://arxiv.org/pdf/1911.08265.pdf|Schrittwieser et al 2020 - Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model]] MuZero. Learns the reward, action-policy, and value-function.  Without knowledge of the rules, MuZero matched the superhuman performance of the AlphaZero.
    * [[https://arxiv.org/pdf/2206.15378.pdf|Perolat et al 2022 - Mastering the Game of Stratego with Model-Free Multiagent Reinforcement Learning]] [[https://www.marktechpost.com/2022/07/09/deepmind-ai-researchers-introduce-deepnash-an-autonomous-agent-trained-with-model-free-multiagent-reinforcement-learning-that-learns-to-play-the-game-of-stratego-at-expert-level/|blog]]
  * Applied to Reasoning Chains
    * [[https://aclanthology.org/2024.acl-long.510.pdf|Wang et al 2024 - Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations]]
    * [[https://aclanthology.org/2024.findings-emnlp.429.pdf|Wang et al 2024 - Multi-step Problem Solving Through a Verifier: An Empirical Analysis on Model-induced Process Supervision]]
    * [[https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf|DeepSeek 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]]

===== NLP RL Papers =====
(Some of the papers above should be moved to this section)

  * **Applied to Text Games**
    * [[https://arxiv.org/pdf/1506.08941|Narasimhan et al 2015 - Language Understanding for Text-based Games using Deep Reinforcement Learning]]


===== Reinforcement Learning with Verifiable Rewards =====
DeepSeek-R1-Zero-style reinforcement learning is sometimes called **"reinforcement learning (RL) on verifiable rewards"** (see for example [[https://arxiv.org/pdf/2505.21493|Zhou 2025]]) or **"RL with outcome supervision."**

See also [[nlp:Large Reasoning Models]]

    * [[https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf|DeepSeek 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]]
    * [[https://arxiv.org/pdf/2505.21493|Zhou et al 2025 - Reinforcing General Reasoning without Verifiers]]


===== Datasets =====
  * NLE: [[https://arxiv.org/pdf/2006.13760.pdf|Küttler et al 2020 - The NetHack Learning Environment]]

===== Theory =====
  * [[https://sham.seas.harvard.edu/files/kakade/files/sham_thesis.pdf|Kakade 2003 - On the Sample Complexity of Reinforcement Learning]] PhD thesis.
  * [[https://arxiv.org/pdf/2112.13487.pdf|Foster et al 2021 - The Statistical Complexity of Interactive Decision Making]] Introduces Decision-Estimation Coefficient (DEC), analogous to VC dimension but for interactive decision making. Proves upper and lower bounds with a realizability assumption.

===== Inverse Reinforcement Learning (IRL) =====
In inverse reinforcement learning (IRL), the agent learns the reward by watching example actions from optimal policies.
  * **Overviews**
    * [[https://arxiv.org/pdf/1806.06877|Arora & Doshi 2018 - A Survey of Inverse Reinforcement Learning: Challenges, Methods and Progress]]
    * Blog post: [[https://dkasenberg.github.io/inverse-reinforcement-learning-rescue/|Inverse Reinforcement Learning]] (nice diagrams)
  * **Non-NLP papers**
    * [[https://arxiv.org/pdf/1603.00448.pdf|Finn et al 2016 - Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization]]

===== Resources =====

Refer to [[https://chrisliu298.io/posts/reinforcement-learning-resource.html|this page]] for an up-to-date list of resources.

  * General
    * [[https://github.com/dbobrenko/awesome-rl|awesome-rl]] by dbobrenko is a repository of RL related resources grouped by RL sub-domains.
    * [[https://github.com/aikorea/awesome-rl|awesome-rl]] by aikorea is another repository of RL related resources grouped by resource type.
  * Books
    * [[http://incompleteideas.net/book/the-book.html|Reinforcement Learning: An Introduction]] by Richard Sutton and Andrew Barto is the most classic reinforcement learning textbook.
  * Papers
    * [[https://spinningup.openai.com/en/latest/spinningup/keypapers.html|Key Papers in Deep RL]] by OpenAI is a list of must-read papers of classic RL algorithms selected by OpenAI researchers.
    * [[https://arxiv.org/abs/1810.06339v1|Deep Reinforcement Learning]] by Yuxi Li is a comprehensive and up-to-date RL survey paper. It can also serve as a tutorial for people who want to have a general understanding of the field.
  * Courses
    * CS285 Deep Reinforcement Learning at UC Berkeley by Professor Sergey Levine is the latest deep RL course. It covers more recent topics and delves deeper into each of them, so it might be difficult for people who are new to RL. [[http://rail.eecs.berkeley.edu/deeprlcourse/|[Course website]]] [[https://www.youtube.com/playlist?list=PL_iWQOsE6TfURIIhCrlt-wj9ByIVpbfGc|[Playlist]]]
    * Introduction to Reinforcement Learning with David Silver by David Silver is an introductory RL course, which can be served as a course for beginners in RL. [[https://www.davidsilver.uk/teaching/|[Course website]]] [[https://www.youtube.com/playlist?list=PLqYmG7hTraZBiG_XpjnPrSNw-1XQaM_gB|[Playlist]]]
  * Blogs
    * [[https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html|A (Long) Peek into Reinforcement Learning]] by Lilian Weng is a good blog post for beginners in RL. For most of the algorithms, it can give you a high-level intuition to help you with further systematic study.
  * Tutorials
    * [[https://github.com/bentrevett/pytorch-rl|pytorch-rl]] by bentrevett is a practical introduction to RL using PyTorch.
    * [[https://spinningup.openai.com/en/latest/index.html|OpenAI Spinning Up]] by OpenAI might be the best educational resource to start with in deep RL. It covers key concepts in RL, kinds of RL algorithms, and a tutorial to the policy gradient algorithm. It also provides a resource list and algorithm documentations.
  * Frameworks
    * [[https://gym.openai.com/|OpenAI Gym]] by OpenAI is a toolkit for benchmarking RL algorithms.
  * Miscellaneous
    * [[https://rupalibhati.github.io/RL-profs/|Professors Working in Reinforcement Learning]] by Rupali Bhati is a list of professors who work in RL.

===== People =====
  * [[https://scholar.google.com/citations?user=htPVdRMAAAAJ&hl=en|Timothy Lillicrap]]

===== Related Pages =====
  * [[ml:theory:Multi-Armed Bandit]]
  * [[nlp:human-in-the-loop#RLHF]]
  * [[Self-Play]]