[论文解读] Learning to Configure Agentic AI Systems
tldr:ARC 学习一种轻量级分层强化学习策略,以按查询配置工作流、工具、预算和提示,用于基于LLM的代理,性能超过静态设计和其他基线,同时减少计算量。
Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbersome configuration is often applied to both easy and hard input queries. We formulate agent configuration as a query-wise decision problem and introduce ARC (Agentic Resource & Configuration learner), which learns a light-weight hierarchical policy using reinforcement learning to dynamically tailor these configurations. Across multiple benchmarks spanning reasoning and tool-augmented question answering, the learned policy consistently outperforms strong hand-designed and other baselines, achieving up to 25% higher task accuracy while also reducing token and runtime costs. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.
研究动机与目标
- Motivate the need for query-adaptive agent configurations to avoid brittle, one-size-fits-all setups.
- Formulate agent configuration as a query-wise decision problem amenable to reinforcement learning.
- Develop ARC, a hierarchical RL framework with a structure policy and a prompt policy to optimize workflows, tools, budgets, and prompts without retraining the backbone model.
- Provide a hybrid training pipeline combining masked RL and supervised fine-tuning to stabilize learning with sparse rewards.
- Empirically validate ARC across reasoning and tool-use benchmarks, showing improved accuracy and efficiency over baselines.
提出的方法
- Represent configuration as a two-level policy: a high-level structure policy selects workflows, tools, and budgets, and a low-level prompt policy composes instructions.
- Use a short episodic MDP where each episode configures and runs the agent system for a single query, with state derived from a fixed semantic-query embedding and simple query features.
- Train with PPO using a shaped reward that balances correctness and efficiency, including a tool-shaping term to align tool allocation with actual usage.
- Apply action masking to prune invalid configurations and reduce the effective action space.
- Perform post-training supervised fine-tuning (SFT) on elite trajectories to distill high-quality configurations, with theoretical guarantees on policy concentration.
- Provide a theoretical justification showing that SFT concentrates the policy on elite configurations and maintains a reward floor.

实验结果
研究问题
- RQ1Can a learned query-adaptive configuration outperform fixed architectures and heuristic optimization baselines across reasoning and tool-use tasks?
- RQ2Does adaptive resource allocation reduce token usage and runtime while preserving or improving accuracy?
- RQ3How well do learned configurations transfer across tasks and model capacities, and what factors influence transfer?
- RQ4Does the combination of hierarchical RL with SFT provide stability and performance gains over non-hierarchical or single-objective methods?
主要发现
- ARC 在多个基准上相较强基线实现任务准确率最高可达 25% 的提升,并降低标记/运行成本。
- 两级策略(结构与提示)在样本效率和搜索复杂度方面优于扁平策略。
- 掩码动作可以减少无效配置,使探索更高效。
- SFT 精炼在不同数据集和模型上将平均奖励提高 5–35%,并在集中于精英配置方面提供理论保证。
- ARC 在 GSM8k、DROP、HotPotQA 和 GAIA 基准上呈现帕累托最优的准确性-成本权衡,优于基础模型和其他优化器。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。