Differences

This shows you the differences between two versions of the page.

--- nlp:large_reasoning_models [2025/05/30 23:15] – [Papers] jmflanig
+++ nlp:large_reasoning_models [2025/10/10 09:05] (current) – [Papers] jmflanig
@@ Line 10: / Line 10: @@
 ===== Papers =====
   * [[https://arxiv.org/pdf/2403.04642|Havrilla et al 2024 - Teaching Large Language Models to Reason with Reinforcement Learning]]
-  * OpenAI o1
+  * **OpenAI o1**
     * [[https://openai.com/index/learning-to-reason-with-llms/|Learning to Reason with LLMs]] Has examples of the full reasoning chains.
     * [[https://cdn.openai.com/o1-system-card-20241205.pdf|OpenAI o1 System Card]] [[https://arxiv.org/pdf/2412.16720?|arXiv]] (There is a lot of information to be gleaned about the training process if you read section 2 carefully.)
@@ Line 17: / Line 17: @@
     * R1 replication on small datasets
       * [[https://hkust-nlp.notion.site/simplerl-reason#18439bdc1c6b8083ba31f9cc912cf7f0|Zheng et al 2025 - 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient]]
-  * General papers
+  * **General papers**
     * [[https://arxiv.org/pdf/2501.19393|Muennighoff et al 2025 - s1: Simple test-time scaling]]
     * [[https://arxiv.org/pdf/2502.03387|Ye et al 2025 - LIMO: Less is More for Reasoning]]
     * [[https://arxiv.org/pdf/2502.08235|2025 - The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks]]
+    * [[https://arxiv.org/pdf/2502.12215|Zeng et al 2025 - Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities?]]
     * [[https://arxiv.org/pdf/2503.14337|Yang et al 2025 - PENCIL: Long Thoughts with Short Memory]]
     * [[https://arxiv.org/pdf/2504.04022|Essential AI 2025 - Rethinking Reflection in Pre-Training]]
@@ Line 26: / Line 27: @@
     * [[https://arxiv.org/pdf/2504.12329|Yang et al 2025 - Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time]]
     * [[http://arxiv.org/pdf/2504.13837|Yue et al 2025 - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?]]
-    * [[https://arxiv.org/pdf/2505.07787|Luo et al 2025 - Learning from Peers in Reasoning Models]]
     * [[https://arxiv.org/pdf/2505.16552|Tan et al 2025 - Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains]]
-  * Concise Reasoning
+  * **Concise Reasoning**
     * Using RL
       * [[https://arxiv.org/pdf/2505.21178|Song & Zheng 2025 - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning]]
-  * Models
+  * **Parallel and Collaborative Thinking**
+    * [[https://arxiv.org/pdf/2504.06261|Rodionov et al 2025 - Hogwild! Inference: Parallel LLM Generation via Concurrent Attention]]
+    * [[https://arxiv.org/pdf/2505.07787|Luo et al 2025 - Learning from Peers in Reasoning Models]]
+    * [[https://arxiv.org/pdf/2509.04475|Wen et al 2025 - ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute]]
+  * **Problems, Criticisms and Insights**
+    * [[https://arxiv.org/pdf/2505.22756|Qin et al 2025 - Decomposing Elements of Problem Solving: What "Math" Does RL Teach?]] "RL-trained models struggle with fundamentally new problems, hitting a ‘coverage wall’ due to insufficient planning skills"
+    * [[https://arxiv.org/pdf/2506.06941|Shojaee et al 2025 - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity]]
+    * **[[https://arxiv.org/pdf/2507.10532|Wu et al 2025 - Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination]]** Very important paper. "By auditing the MATH-500 dataset and introducing a clean benchmark, we demonstrate that Qwen’s successes with spurious reward were driven by memorization of benchmark problems rather than genuine reasoning skills."
+  * **Models**
     * Phi-4-Reasoning: [[https://arxiv.org/pdf/2504.21318|Abdin et al 2025 - Phi-4-reasoning Technical Report]]
+    * [[https://arxiv.org/pdf/2505.22375|Chen et al 2025 - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition]] Has a "fast" mode for routine queries and a deeper "slow" mode for complex inference
 ===== Related Pages =====
-  * [[Reasoning Chains]]
+  * [[Reasoning]]
+  * [[Reasoning#Reasoning Chains|Reasoning - Reasoning Chains]]
   * [[ml:reinforcement_learning#Reinforcement Learning with Verifiable Rewards]]
   * [[Test-Time Scaling]]