QUICK REVIEW

[论文解读] What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Yan Ma, Weiyu Zhang|arXiv (Cornell University)|Feb 1, 2026

Robot Manipulation and Learning被引用 0

一句话总结

本文提出 MED，以在视觉工具使用强化学习中，将内在能力提升与工具引发效应区分开来，结果显示内在学习占主导，工具使用主要减少损害而非成为工具的掌握者。

ABSTRACT

Vision tool-use reinforcement learning (RL) can equip vision-language models with visual operators such as crop-and-zoom and achieves strong performance gains, yet it remains unclear whether these gains are driven by improvements in tool use or evolving intrinsic capabilities.We introduce MED (Measure-Explain-Diagnose), a coarse-to-fine framework that disentangles intrinsic capability changes from tool-induced effects, decomposes the tool-induced performance difference into gain and harm terms, and probes the mechanisms driving their evolution. Across checkpoint-level analyses on two VLMs with different tool priors and six benchmarks, we find that improvements are dominated by intrinsic learning, while tool-use RL mainly reduces tool-induced harm (e.g., fewer call-induced errors and weaker tool schema interference) and yields limited progress in tool-based correction of intrinsic failures. Overall, current vision tool-use RL learns to coexist safely with tools rather than master them.

研究动机与目标

评估视觉工具使用强化学习的改进，是来自内在能力增长还是工具使用动态。
将工具引发的效应分解为增益和损害，并分析其训练动态。
诊断在不同工具熟悉度阶段驱动工具使用演化的潜在机制。

提出的方法

在强化学习中使用裁剪与缩放工具训练视觉语言模型，并在每个检查点评估无工具与有工具的性能。
定义工具引发漂移 G(t)=Acc_w(t)−Acc_wo(t)，并将端到端漂移 f_w(t) 分解为内在漂移 f_wo(t) 和工具引发漂移 Δ_tool(t)。
将 G(t) 分解为四项（Call Gain、Schema Gain、Call Harm、Schema Harm），并进一步将每一项分解为 Mass、Policy、Quality 三个分量（式(8)）。
测量、解释并诊断训练动态（MED），以将增益/损害归因于工具使用行为与工具模式交互。
使用两种不同工具先验的骨干网络（tool-naive Qwen2.5-VL 与 tool-native Qwen3-VL）与六个基准，在检查点粒度分析。
基于人类对齐的评估等健全性检查进行真实世界分析，并对失败集进行鲁棒性检查。

实验结果

研究问题

RQ1工具使用强化学习的增益在多大程度上来自于内在能力提升，而非工具引发效应？
RQ2在不同工具熟悉度阶段，内在与工具引发分量如何随训练演化？
RQ3驱动增益和损害的机制（Mass、Policy、Quality）以及工具模式干扰的演变如何？
RQ4视觉工具使用策略是否真正在掌握工具，还是仅与之安全共存？

主要发现

内在漂移支配整体性能提升；工具引发漂移仅占少数学习进展（工具贡献比 S_tool ≈ 0.22–0.30）。
两种骨干网络呈现不同的工具漂移动力学：工具新手模型从使用工具中获益；工具原生模型在工具效用趋于平稳时更多依赖内在改进。
在训练过程中，工具使用带来的总体伤害下降，而总增益停滞或下降，导致工具引发差距 G(t) 出现平台期。
Call Harm 与 Schema Harm 均随训练下降，且工具模式干扰减少，尤其对工具原生模型更明显。
工具使用行为保持保守：在难以修正的故障上对基于工具的纠错改进有限，表示学习到的是安全共存而非真正掌握工具。
与人类对齐的 Call Gain 对工具原生模型（Qwen3-VL）较高，说明可解释的增益与人类推理一致；工具新手模型呈现某些捷径式行为。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。