Skip to main content
QUICK REVIEW

[论文解读] Learning to Configure Agentic AI Systems

Aditya Taparia, Som Sagar|arXiv (Cornell University)|Feb 12, 2026
AI-based Problem Solving and Planning被引用 0
一句话总结

tldr:ARC 学习一种轻量级分层强化学习策略,以按查询配置工作流、工具、预算和提示,用于基于LLM的代理,性能超过静态设计和其他基线,同时减少计算量。

ABSTRACT

Configuring LLM-based agent systems involves choosing workflows, tools, token budgets, and prompts from a large combinatorial design space, and is typically handled today by fixed large templates or hand-tuned heuristics. This leads to brittle behavior and unnecessary compute, since the same cumbersome configuration is often applied to both easy and hard input queries. We formulate agent configuration as a query-wise decision problem and introduce ARC (Agentic Resource & Configuration learner), which learns a light-weight hierarchical policy using reinforcement learning to dynamically tailor these configurations. Across multiple benchmarks spanning reasoning and tool-augmented question answering, the learned policy consistently outperforms strong hand-designed and other baselines, achieving up to 25% higher task accuracy while also reducing token and runtime costs. These results demonstrate that learning per-query agent configurations is a powerful alternative to "one size fits all" designs.

研究动机与目标

  • Motivate the need for query-adaptive agent configurations to avoid brittle, one-size-fits-all setups.
  • Formulate agent configuration as a query-wise decision problem amenable to reinforcement learning.
  • Develop ARC, a hierarchical RL framework with a structure policy and a prompt policy to optimize workflows, tools, budgets, and prompts without retraining the backbone model.
  • Provide a hybrid training pipeline combining masked RL and supervised fine-tuning to stabilize learning with sparse rewards.
  • Empirically validate ARC across reasoning and tool-use benchmarks, showing improved accuracy and efficiency over baselines.

提出的方法

  • Represent configuration as a two-level policy: a high-level structure policy selects workflows, tools, and budgets, and a low-level prompt policy composes instructions.
  • Use a short episodic MDP where each episode configures and runs the agent system for a single query, with state derived from a fixed semantic-query embedding and simple query features.
  • Train with PPO using a shaped reward that balances correctness and efficiency, including a tool-shaping term to align tool allocation with actual usage.
  • Apply action masking to prune invalid configurations and reduce the effective action space.
  • Perform post-training supervised fine-tuning (SFT) on elite trajectories to distill high-quality configurations, with theoretical guarantees on policy concentration.
  • Provide a theoretical justification showing that SFT concentrates the policy on elite configurations and maintains a reward floor.
Figure 1 : (a) Shows how our method learns to configure optimal configuration across thousands of possibilities for the given input. (b) Shows improvement by our method over multiple datasets. (These results are for Qwen 2.5 7B Instruct model.)
Figure 1 : (a) Shows how our method learns to configure optimal configuration across thousands of possibilities for the given input. (b) Shows improvement by our method over multiple datasets. (These results are for Qwen 2.5 7B Instruct model.)

实验结果

研究问题

  • RQ1Can a learned query-adaptive configuration outperform fixed architectures and heuristic optimization baselines across reasoning and tool-use tasks?
  • RQ2Does adaptive resource allocation reduce token usage and runtime while preserving or improving accuracy?
  • RQ3How well do learned configurations transfer across tasks and model capacities, and what factors influence transfer?
  • RQ4Does the combination of hierarchical RL with SFT provide stability and performance gains over non-hierarchical or single-objective methods?

主要发现

  • ARC 在多个基准上相较强基线实现任务准确率最高可达 25% 的提升,并降低标记/运行成本。
  • 两级策略(结构与提示)在样本效率和搜索复杂度方面优于扁平策略。
  • 掩码动作可以减少无效配置,使探索更高效。
  • SFT 精炼在不同数据集和模型上将平均奖励提高 5–35%,并在集中于精英配置方面提供理论保证。
  • ARC 在 GSM8k、DROP、HotPotQA 和 GAIA 基准上呈现帕累托最优的准确性-成本权衡,优于基础模型和其他优化器。
Figure 2 : Training pipeline. The structure policy selects workflows, tools, and budgets while the prompt policy composes instructions. During RL training, episodes are stored in a memory buffer. After RL converges, high-reward episodes are filtered and used for supervised fine-tuning (SFT), which c
Figure 2 : Training pipeline. The structure policy selects workflows, tools, and budgets while the prompt policy composes instructions. During RL training, episodes are stored in a memory buffer. After RL converges, high-reward episodes are filtered and used for supervised fine-tuning (SFT), which c

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。