QUICK REVIEW

[论文解读] SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Yadi Cao, Sicheng Lai|arXiv (Cornell University)|Mar 11, 2026

Scientific Computing and Data Management被引用 0

一句话总结

SimulCost 通过在12个模拟器上将LLMs与穷举搜索和贝叶斯优化进行对比评估，衡量成功率和计算成本，从而进行物理仿真中的成本感知参数调优。

ABSTRACT

Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46--64% success rates in single-round mode, dropping to 35--54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi-round mode improves rates to 71--80%, but LLMs are 1.5--2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in-context examples and reasoning effort, providing practical implications for deployment and fine-tuning. We open-source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost-aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at https://github.com/Rose-STL-Lab/SimulCost-Bench.

研究动机与目标

在LLM辅助的物理仿真中凸显成本感知评估的必要性。
介绍SimulCost，首个同时衡量成功率与工具成本效率的基准。
提供多样化、可扩展的工具包，涵盖12个模拟器并具可复现的成本跟踪框架。
将最前沿的LLM与穷举扫描和贝叶斯优化进行对比。
给出关于知识迁移、上下文学习与推理努力的消融分析，以指导部署。

提出的方法

将成本定义为每个模拟器的基于FLOPs的工具成本（其中EPOCH使用墙钟时间）。
评估单轮（初始猜测）与多轮（试错）推理模式。
在12个求解器（覆盖流体动力学、固体力学和等离子体物理）中筛选出2,916个单轮任务和1,900个多轮任务。
将调优限定在单个参数上，以实现有意义的扫描基线和成本比较。
提供一个可扩展的工具箱（simulcost-tools），具有标准化API和基于Hydra的配置，便于复现与扩展。
将贝叶斯优化作为多轮调优的基线，并对ICL与推理努力进行消融。

Figure 1 : Overview of SimulCost . Our benchmark evaluates LLM agents on cost-sensitive parameter tuning across 12 physics simulators spanning fluid dynamics, solid mechanics, and plasma physics. Given a simulation task, tuning mode, and accuracy requirement, the LLM proposes tunable parameters in e

实验结果

研究问题

RQ1LLMs在跨多种模拟器的成本感知参数调优中表现如何？
RQ2单轮与多轮调优在准确性需求与计算成本之间的权衡如何？
RQ3知识迁移、上下文学习或推理努力是否显著提升成本高效的调优？
RQ4在这个成本感知环境中，贝叶斯优化与基于LLM的方法相比如何？
RQ5该工具包能否在保持可复现成本跟踪的同时推广到新求解器与环境？

主要发现

前沿LLMs在单轮模式下的成功率为46–64%，在高准确性要求下下降至35–54%。
多轮模式将成功率提升至71–80%，但LLMs的速度比穷举扫描慢1.5–2.5倍。
常见参数比求解器特定参数更易调优，参数间相关性较低，表明迁移潜力有限。
上下文学习将单轮成功率提升15–25%，但会降低多轮探索效果。
BO-GP在总体成功率上与LLMs相当，但跨求解器方差更大；在低准确性要求下，LLMs在成本效率方面具有优势。
推理努力总体上未显示显著改善。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。