====== Large Reasoning Models ====== o1 or r1-style LLMs, often called "large reasoning models" (LRMs) (see [[https://arxiv.org/pdf/2502.08235|Cuadron 2025]]) ===== Overviews ===== * [[https://arxiv.org/pdf/2501.09686|Xu et al 2025 - Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models]] * [[https://arxiv.org/pdf/2502.21321|Kumar et al 2025 - LLM Post-Training: A Deep Dive into Reasoning Large Language Models]] * [[https://arxiv.org/pdf/2503.16419|Sui et al 2025 - Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models]] ===== Papers ===== * [[https://arxiv.org/pdf/2403.04642|Havrilla et al 2024 - Teaching Large Language Models to Reason with Reinforcement Learning]] * **OpenAI o1** * [[https://openai.com/index/learning-to-reason-with-llms/|Learning to Reason with LLMs]] Has examples of the full reasoning chains. * [[https://cdn.openai.com/o1-system-card-20241205.pdf|OpenAI o1 System Card]] [[https://arxiv.org/pdf/2412.16720?|arXiv]] (There is a lot of information to be gleaned about the training process if you read section 2 carefully.) * [[https://arxiv.org/pdf/2501.12948|DeepSeek-AI et al 2025 - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning]] * See also [[ml:reinforcement_learning#Reinforcement Learning with Verifiable Rewards]] * R1 replication on small datasets * [[https://hkust-nlp.notion.site/simplerl-reason#18439bdc1c6b8083ba31f9cc912cf7f0|Zheng et al 2025 - 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient]] * **General papers** * [[https://arxiv.org/pdf/2501.19393|Muennighoff et al 2025 - s1: Simple test-time scaling]] * [[https://arxiv.org/pdf/2502.03387|Ye et al 2025 - LIMO: Less is More for Reasoning]] * [[https://arxiv.org/pdf/2502.08235|2025 - The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks]] * [[https://arxiv.org/pdf/2502.12215|Zeng et al 2025 - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?]] * [[https://arxiv.org/pdf/2503.14337|Yang et al 2025 - PENCIL: Long Thoughts with Short Memory]] * [[https://arxiv.org/pdf/2504.04022|Essential AI 2025 - Rethinking Reflection in Pre-Training]] * [[https://arxiv.org/pdf/2504.09858|Ma et al 2025 - Reasoning Models Can Be Effective Without Thinking]] * [[https://arxiv.org/pdf/2504.12329|Yang et al 2025 - Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time]] * [[http://arxiv.org/pdf/2504.13837|Yue et al 2025 - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?]] * [[https://arxiv.org/pdf/2505.16552|Tan et al 2025 - Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains]] * **Concise Reasoning** * Using RL * [[https://arxiv.org/pdf/2505.21178|Song & Zheng 2025 - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning]] * **Parallel and Collaborative Thinking** * [[https://arxiv.org/pdf/2504.06261|Rodionov et al 2025 - Hogwild! Inference: Parallel LLM Generation via Concurrent Attention]] * [[https://arxiv.org/pdf/2505.07787|Luo et al 2025 - Learning from Peers in Reasoning Models]] * [[https://arxiv.org/pdf/2509.04475|Wen et al 2025 - ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute]] * **Problems, Criticisms and Insights** * [[https://arxiv.org/pdf/2505.22756|Qin et al 2025 - Decomposing Elements of Problem Solving: What "Math" Does RL Teach?]] "RL-trained models struggle with fundamentally new problems, hitting a ‘coverage wall’ due to insufficient planning skills" * [[https://arxiv.org/pdf/2506.06941|Shojaee et al 2025 - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity]] * **[[https://arxiv.org/pdf/2507.10532|Wu et al 2025 - Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination]]** Very important paper. "By auditing the MATH-500 dataset and introducing a clean benchmark, we demonstrate that Qwen’s successes with spurious reward were driven by memorization of benchmark problems rather than genuine reasoning skills." * **Models** * Phi-4-Reasoning: [[https://arxiv.org/pdf/2504.21318|Abdin et al 2025 - Phi-4-reasoning Technical Report]] * [[https://arxiv.org/pdf/2505.22375|Chen et al 2025 - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition]] Has a "fast" mode for routine queries and a deeper "slow" mode for complex inference ===== Related Pages ===== * [[Reasoning]] * [[Reasoning#Reasoning Chains|Reasoning - Reasoning Chains]] * [[ml:reinforcement_learning#Reinforcement Learning with Verifiable Rewards]] * [[Test-Time Scaling]]