QUICK REVIEW

[论文解读] Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

Siyuan Gan, Jiaheng Liu|arXiv (Cornell University)|Jan 8, 2026

Advanced Graph Neural Networks被引用 0

一句话总结

TNT 提出一种在思维模式解决方案组件引导下的自适应非思维令牌上限，以缓解基于 RL 的混合推理中的奖励操纵，在数学基准上提升准确性和令牌效率。

ABSTRACT

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.

研究动机与目标

激励在混合推理模型的强化学习训练中出现的奖励操纵问题，该模型在思考模式与非思考模式之间交替。
引入 Thinking-Based Non-Thinking (TNT)，在不进行监督微调的情况下自适应设置每个查询的非思维令牌上限。
证明 TNT 在标准数学基准上可将令牌使用量降低约 50% 的同时提升准确性。
展示 TNT 对基础模型的鲁棒性以及与 CoT 压缩方法和基线 RL 方法的竞争力。

提出的方法

定义思考模式和非思考模式，以及在基于强化学习的混合推理模型训练中奖励操纵的问题。
提出 TNT：使用思考模式解决方案组件（</think> 之后的令牌）来决定非思考模式的最大令牌使用量。
计算 Lx^N 作为在思考模式样本中 </think> 之后的平均剩余令牌数，并乘以系数 ω 进行缩放，并通过 L∅ 进行保护以处理采样上限。
构建一个区分思考与非思考模式的奖励函数，并通过基于长度的阈值 Lx^N 来缓解奖励操纵。
使用定义好的奖励，以令牌级策略梯度目标（GRPO）进行训练，使基于查询难度实现动态模式选择。

实验结果

研究问题

RQ1自适应的、基于查询难度的非思考令牌上限在不进行监督微调的情况下，是否能降低在 RL 训练的混合推理模型中的奖励操纵？
RQ2与 Thinkless、AdaptThink、AutoThink 及基础模型相比，TNT 是否在标准数学基准上改善了准确性-令牌效率的权衡？
RQ3TNT 的性能如何随更强的基础模型以及不同的 RL 设置而扩展？
RQ4TNT 对分布外任务是否鲁棒，以及对其奖励组件的消融是否敏感？

主要发现

TNT 使平均令牌使用量下降约 46%，在五个数学基准上平均准确性提高约 4%。
TNT 实现更好的令牌效率（TE），在所评估的数据集中超过 Thinkless、AdaptThink 与 AutoThink。
在测试数据上，TNT 的非思考模式比例保持较低，与任务难度呈负相关，表明在需要时实现自适应的思考。
TNT 显著缓解奖励操纵，非思考模式输出中用于思考的动词使用相较基线较少，表明非思考输出中的真实思考较少。
当基础模型更强（如 DeepScaleR-1.5B、DeepSeek-R1-Distill-Qwen-7B）时，TNT 的优势更加明显。
TNT 在准确性和 TE 上超越 CoT 压缩方法，并在分布外设置中保持鲁棒性。

Figure 2: Average accuracy and token usage comparison across different hybrid reasoning model training methods on mathematical benchmarks. We only presented the evaluation results of their open-source checkpoints while some of these methods lack the trained checkpoints based on DeepScaleR-1.5B, and

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。