====== Large Reasoning Models ======
o1 or r1-style LLMs, often called "large reasoning models" (LRMs) (see [[https://arxiv.org/pdf/2502.08235|Cuadron 2025]])

===== Overviews =====
  * [[https://arxiv.org/pdf/2501.09686|Xu et al 2025 - Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models]]
  * [[https://arxiv.org/pdf/2502.21321|Kumar et al 2025 - LLM Post-Training: A Deep Dive into Reasoning Large Language Models]]
  * [[https://arxiv.org/pdf/2503.16419|Sui et al 2025 - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models]]


===== Papers =====
  * [[https://arxiv.org/pdf/2403.04642|Havrilla et al 2024 - Teaching Large Language Models to Reason with Reinforcement Learning]]
  * **OpenAI o1**
    * [[https://openai.com/index/learning-to-reason-with-llms/|Learning to Reason with LLMs]] Has examples of the full reasoning chains.
    * [[https://cdn.openai.com/o1-system-card-20241205.pdf|OpenAI o1 System Card]] [[https://arxiv.org/pdf/2412.16720?|arXiv]] (There is a lot of information to be gleaned about the training process if you read section 2 carefully.)
  * [[https://arxiv.org/pdf/2501.12948|DeepSeek-AI et al 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]]
    * See also [[ml:reinforcement_learning#Reinforcement Learning with Verifiable Rewards]]
    * R1 replication on small datasets
      * [[https://hkust-nlp.notion.site/simplerl-reason#18439bdc1c6b8083ba31f9cc912cf7f0|Zheng et al 2025 - 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient]]
  * **General papers**
    * [[https://arxiv.org/pdf/2501.19393|Muennighoff et al 2025 - s1: Simple test-time scaling]]
    * [[https://arxiv.org/pdf/2502.03387|Ye et al 2025 - LIMO: Less is More for Reasoning]]
    * [[https://arxiv.org/pdf/2502.08235|2025 - The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks]]
    * [[https://arxiv.org/pdf/2502.12215|Zeng et al 2025 - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?]]
    * [[https://arxiv.org/pdf/2503.14337|Yang et al 2025 - PENCIL: Long Thoughts with Short Memory]]
    * [[https://arxiv.org/pdf/2504.04022|Essential AI 2025 - Rethinking Reflection in Pre-Training]]
    * [[https://arxiv.org/pdf/2504.09858|Ma et al 2025 - Reasoning Models Can Be Effective Without Thinking]]
    * [[https://arxiv.org/pdf/2504.12329|Yang et al 2025 - Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time]]
    * [[http://arxiv.org/pdf/2504.13837|Yue et al 2025 - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?]]
    * [[https://arxiv.org/pdf/2505.16552|Tan et al 2025 - Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains]]
  * **Concise Reasoning**
    * Using RL
      * [[https://arxiv.org/pdf/2505.21178|Song & Zheng 2025 - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning]]
  * **Parallel and Collaborative Thinking**
    * [[https://arxiv.org/pdf/2504.06261|Rodionov et al 2025 - Hogwild! Inference: Parallel LLM Generation via Concurrent Attention]]
    * [[https://arxiv.org/pdf/2505.07787|Luo et al 2025 - Learning from Peers in Reasoning Models]]
    * [[https://arxiv.org/pdf/2509.04475|Wen et al 2025 - ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute]]
  * **Problems, Criticisms and Insights**
    * [[https://arxiv.org/pdf/2505.22756|Qin et al 2025 - Decomposing Elements of Problem Solving: What "Math" Does RL Teach?]] "RL-trained models struggle with fundamentally new problems, hitting a ‘coverage wall’ due to insufficient planning skills"
    * [[https://arxiv.org/pdf/2506.06941|Shojaee et al 2025 - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity]]
    * **[[https://arxiv.org/pdf/2507.10532|Wu et al 2025 - Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination]]** Very important paper. "By auditing the MATH-500 dataset and introducing a clean benchmark, we demonstrate that Qwen’s successes with spurious reward were driven by memorization of benchmark problems rather than genuine reasoning skills."
  * **Models**
    * Phi-4-Reasoning: [[https://arxiv.org/pdf/2504.21318|Abdin et al 2025 - Phi-4-reasoning Technical Report]]
    * [[https://arxiv.org/pdf/2505.22375|Chen et al 2025 - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition]] Has a "fast" mode for routine queries and a deeper "slow" mode for complex inference

===== Related Pages =====
  * [[Reasoning]]
  * [[Reasoning#Reasoning Chains|Reasoning - Reasoning Chains]]
  * [[ml:reinforcement_learning#Reinforcement Learning with Verifiable Rewards]]
  * [[Test-Time Scaling]]