QUICK REVIEW

[论文解读] Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents

Zeping Li, Hongru Wang|arXiv (Cornell University)|Feb 2, 2026

Topic Modeling被引用 0

一句话总结

该论文提出 TEPO，即将熵减作为对大型语言模型代理工具使用的监督，设计稀疏与密集奖励以分别减少工具调用或提升性能。

ABSTRACT

Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.

研究动机与目标

通过将工具调用与熵减联系起来，激励长时序 LLM 代理推理中的更好工具使用行为。
研究熵动力学作为一个内在的、与模型无关的信号，用于衡量跨领域的工具调用质量。
提出两种奖励设计（稀疏与密集）以优化工具使用效率和/或性能。
证明熵减信号可以在无特定任务手工规则的情况下引导基于 RL 的工具使用。

提出的方法

将工具增强生成形式化为代理与工具执行器之间的迭代交互。
定义 delta entropy：ΔHk = H(rk) − H(rk−1)，用以量化工具调用后不确定性的变化。
提出 TEPO 的两种奖励方案：1）稀疏结果奖励，通过工具调用中熵减的比例来调节最终任务奖励；2）密集过程奖励，在每次工具调用降低熵时给予奖励。
将标记级 GRPO 重新表述为将奖励归因于生成的标记，并将工具层面的优势传递给工具调用前的推理段。
在多领域上进行评估（数学推理、知识密集推理、深度信息检索），采用 SFT 后再进行 RL 训练，以 Qwen2.5 与 Llama3.1 作为基模型。

Figure 1: Changes in entropy reflect shifts in uncertainty within the agent. High-quality tool calls help the model reduce uncertainty, as indicated by a decrease in entropy.

实验结果

研究问题

RQ1熵减是否可以作为一个轻量级、模型无关的信号，用于衡量长时序 LLM 推理中的工具调用质量？
RQ2两种奖励设计（稀疏结果奖励 vs. 密集过程奖励）是否能有效提升工具使用效率和/或推理性能？
RQ3TEPO 如何在不同模型规模与领域中扩展，与现有的过程奖励 RL 方法相比，熵基 supervision 的表现如何？
RQ4在实际的工具增强推理任务中，熵动力学与高质量工具调用之间的关系是什么？

主要发现

基于熵的初步研究表明，在多个领域与模型中，高质量的工具调用与熵减（ΔHk 为负）相关。
TEPO_sparse 在最终性能相当的前提下将工具调用减少了 72.07%，强调效率提升。
TEPO_dense 通过提供细粒度的熵减监督，平均相比基线提升约 22.27% 的推理性能。
两种 TEPO 变体在推理任务与深度检索任务中均优于若干基线，展示了对跨领域的鲁棒性。
熵减作为有效的监督信号，可在无需任务特定手工规则的情况下引导工具使用。

Figure 2: The overall framework of $\text{TEPO}_{\text{sparse}}$ and $\text{TEPO}_{\text{dense}}$ . In the sparse reward design, the reward and advantage are calculated and then uniformly assigned to each token within the trajectory (same $A_{i,t}$ for all tokens). In contrast, the dense reward desi

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。