Skip to main content
QUICK REVIEW

[论文解读] Rethinking the Role of Entropy in Optimizing Tool-Use Behaviors for Large Language Model Agents

Zeping Li, Hongru Wang|arXiv (Cornell University)|Feb 2, 2026
Topic Modeling被引用 0
一句话总结

该论文提出 TEPO,即将熵减作为对大型语言模型代理工具使用的监督,设计稀疏与密集奖励以分别减少工具调用或提升性能。

ABSTRACT

Tool-using agents based on Large Language Models (LLMs) excel in tasks such as mathematical reasoning and multi-hop question answering. However, in long trajectories, agents often trigger excessive and low-quality tool calls, increasing latency and degrading inference performance, making managing tool-use behavior challenging. In this work, we conduct entropy-based pilot experiments and observe a strong positive correlation between entropy reduction and high-quality tool calls. Building on this finding, we propose using entropy reduction as a supervisory signal and design two reward strategies to address the differing needs of optimizing tool-use behavior. Sparse outcome rewards provide coarse, trajectory-level guidance to improve efficiency, while dense process rewards offer fine-grained supervision to enhance performance. Experiments across diverse domains show that both reward designs improve tool-use behavior: the former reduces tool calls by 72.07% compared to the average of baselines, while the latter improves performance by 22.27%. These results position entropy reduction as a key mechanism for enhancing tool-use behavior, enabling agents to be more adaptive in real-world applications.

研究动机与目标

  • 通过将工具调用与熵减联系起来,激励长时序 LLM 代理推理中的更好工具使用行为。
  • 研究熵动力学作为一个内在的、与模型无关的信号,用于衡量跨领域的工具调用质量。
  • 提出两种奖励设计(稀疏与密集)以优化工具使用效率和/或性能。
  • 证明熵减信号可以在无特定任务手工规则的情况下引导基于 RL 的工具使用。

提出的方法

  • 将工具增强生成形式化为代理与工具执行器之间的迭代交互。
  • 定义 delta entropy:ΔHk = H(rk) − H(rk−1),用以量化工具调用后不确定性的变化。
  • 提出 TEPO 的两种奖励方案:1)稀疏结果奖励,通过工具调用中熵减的比例来调节最终任务奖励;2)密集过程奖励,在每次工具调用降低熵时给予奖励。
  • 将标记级 GRPO 重新表述为将奖励归因于生成的标记,并将工具层面的优势传递给工具调用前的推理段。
  • 在多领域上进行评估(数学推理、知识密集推理、深度信息检索),采用 SFT 后再进行 RL 训练,以 Qwen2.5 与 Llama3.1 作为基模型。
Figure 1: Changes in entropy reflect shifts in uncertainty within the agent. High-quality tool calls help the model reduce uncertainty, as indicated by a decrease in entropy.
Figure 1: Changes in entropy reflect shifts in uncertainty within the agent. High-quality tool calls help the model reduce uncertainty, as indicated by a decrease in entropy.

实验结果

研究问题

  • RQ1熵减是否可以作为一个轻量级、模型无关的信号,用于衡量长时序 LLM 推理中的工具调用质量?
  • RQ2两种奖励设计(稀疏结果奖励 vs. 密集过程奖励)是否能有效提升工具使用效率和/或推理性能?
  • RQ3TEPO 如何在不同模型规模与领域中扩展,与现有的过程奖励 RL 方法相比,熵基 supervision 的表现如何?
  • RQ4在实际的工具增强推理任务中,熵动力学与高质量工具调用之间的关系是什么?

主要发现

  • 基于熵的初步研究表明,在多个领域与模型中,高质量的工具调用与熵减(ΔHk 为负)相关。
  • TEPO_sparse 在最终性能相当的前提下将工具调用减少了 72.07%,强调效率提升。
  • TEPO_dense 通过提供细粒度的熵减监督,平均相比基线提升约 22.27% 的推理性能。
  • 两种 TEPO 变体在推理任务与深度检索任务中均优于若干基线,展示了对跨领域的鲁棒性。
  • 熵减作为有效的监督信号,可在无需任务特定手工规则的情况下引导工具使用。
Figure 2: The overall framework of $\text{TEPO}_{\text{sparse}}$ and $\text{TEPO}_{\text{dense}}$ . In the sparse reward design, the reward and advantage are calculated and then uniformly assigned to each token within the trajectory (same $A_{i,t}$ for all tokens). In contrast, the dense reward desi
Figure 2: The overall framework of $\text{TEPO}_{\text{sparse}}$ and $\text{TEPO}_{\text{dense}}$ . In the sparse reward design, the reward and advantage are calculated and then uniformly assigned to each token within the trajectory (same $A_{i,t}$ for all tokens). In contrast, the dense reward desi

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。