QUICK REVIEW

[논문 리뷰] Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

Siyuan Gan, Jiaheng Liu|arXiv (Cornell University)|2026. 01. 08.

Advanced Graph Neural Networks인용 수 0

한 줄 요약

TNT는 사고-모드 해답 구성요소에 의해 안내되는 적응적 비사고 토큰 한도를 제안하여 강화학습 기반 하이브리드 추론에서 보상 훼손(reward hacking)을 완화하고 수학 벤치마크에서 정확도와 토큰 효율성을 향상시킨다.

ABSTRACT

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.

연구 동기 및 목표

하이브리드 추론 모델의 강화학습 훈련에서 사고와 비사고 모드를 번갈아 사용하는 보상 훼손 문제를 동기 부여한다.
supervised 파인튜닝 없이 적응적으로 쿼리당 비사고 토큰 한도를 설정하기 위해 Thinking-Based Non-Thinking(TNT)을 도입한다.
표준 수학 벤치마크에서 TNT가 토큰 사용량을 약 50% 줄이면서 정확도를 향상시킨다는 것을 보여준다.
기본 모델에 대한 TNT의 견고성과 CoT 압축 방법 및 RL 기반 기준선과의 경쟁력을 입증한다.

제안 방법

사고 모드와 비사고 모드를 정의하고 하이브리드 추론 모델의 RL 기반 훈련에서 보상 훼손 문제를 제시한다.
TNT를 제안: 사고 모드 솔루션 구성요소(</think> 이후 토큰)을 사용하여 프롬프트당 비사고 모드의 최대 토큰 사용량을 결정한다.
Lx^N을 사고 모드 샘플에서 </think> 이후 남은 토큰의 평균으로 계산하고 이를 계수 ω로 확장하고 샘플링 한계를 다루기 위해 L∅로 안정화한다.
사고 모드와 비사고 모드를 구분하고 길이 기반 페널티 임계값 Lx^N를 통해 보상 훼손을 완화하는 보상 함수를 구성한다.
정의된 보상을 사용한 토큰 단위 정책 그래디언트 목표(GRPO)로 학습하여 쿼리 난이도에 따라 동적 모드 선택을 가능하게 한다.

실험 결과

연구 질문

RQ1적응적이고 쿼리 난이도에 민감한 비사고 토큰 한도가 SFT 없이 RL로 학습된 하이브리드 추론 모델의 보상 훼손을 줄일 수 있는가?
RQ2TNT가 Thinkless, AdaptThink, AutoThink 및 기본 모델과 비교했을 때 표준 수학 벤치마크에서 정확도-토큰 효율성 trade-off를 개선하는가?
RQ3더 강한 기본 모델과 다양한 RL 설정에서 TNT의 성능은 어떻게 확장되는가?
RQ4TNT가 분포 외 작업(out-of-distribution) 및 보상 구성 요소의 제거에 대해 견고한가?

주요 결과

TNT는 다섯 개의 수학 벤치마크에서 평균 토큰 사용량을 약 46% 줄이고 평균 정확도를 약 4% 향상시킨다.
TNT는 토큰 효율성(TE)을 개선하고 평가된 데이터셋에서 Thinkless, AdaptThink, AutoThink를 능가한다.
테스트 데이터에서 비사고 모드 비율은 낮게 유지되며 작업 난이도와 음의 상관관계를 보여 필요에 따라 adaptive한 사고를 시사한다.
TNT는 보상 훼손을 크게 완화시키고 비사고 모드에서의 동사 사용이 기저 모델에 비해 실제 사고를 나타내는 경우가 드물다고 나타난다.
강한 기본 모델(예: DeepScaleR-1.5B, DeepSeek-R1-Distill-Qwen-7B)에서 TNT의 이점이 더 두드러진다.
TNT는 CoT 압축 방법보다 정확도와 TE에서 우수하고 분포 외 설정에서도 견고성을 유지한다.

Figure 2: Average accuracy and token usage comparison across different hybrid reasoning model training methods on mathematical benchmarks. We only presented the evaluation results of their open-source checkpoints while some of these methods lack the trained checkpoints based on DeepScaleR-1.5B, and

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.