QUICK REVIEW

[논문 리뷰] SimulCost: A Cost-Aware Benchmark and Toolkit for Automating Physics Simulations with LLMs

Yadi Cao, Sicheng Lai|arXiv (Cornell University)|2026. 03. 11.

Scientific Computing and Data Management인용 수 0

한 줄 요약

SimulCost 벤치마크는 물리 시뮬레이션의 비용 인식 매개변수 튜닝을 12개 시뮬레이터에 걸친 brute-force 스캔과 베이지안 최적화에 대해 LLM을 평가함으로써 수행하며, 성공률과 계산 비용을 모두 측정한다.

ABSTRACT

Evaluating LLM agents for scientific tasks has focused on token costs while ignoring tool-use costs like simulation time and experimental resources. As a result, metrics like pass@k become impractical under realistic budget constraints. To address this gap, we introduce SimulCost, the first benchmark targeting cost-sensitive parameter tuning in physics simulations. SimulCost compares LLM tuning cost-sensitive parameters against traditional scanning approach in both accuracy and computational cost, spanning 2,916 single-round (initial guess) and 1,900 multi-round (adjustment by trial-and-error) tasks across 12 simulators from fluid dynamics, solid mechanics, and plasma physics. Each simulator's cost is analytically defined and platform-independent. Frontier LLMs achieve 46--64% success rates in single-round mode, dropping to 35--54% under high accuracy requirements, rendering their initial guesses unreliable especially for high accuracy tasks. Multi-round mode improves rates to 71--80%, but LLMs are 1.5--2.5x slower than traditional scanning, making them uneconomical choices. We also investigate parameter group correlations for knowledge transfer potential, and the impact of in-context examples and reasoning effort, providing practical implications for deployment and fine-tuning. We open-source SimulCost as a static benchmark and extensible toolkit to facilitate research on improving cost-aware agentic designs for physics simulations, and for expanding new simulation environments. Code and data are available at https://github.com/Rose-STL-Lab/SimulCost-Bench.

연구 동기 및 목표

LLM 보조 물리 시뮬레이션에서 비용 인식 평가의 필요성을 제고한다.
성공성과 도구 비용 효율성을 함께 측정하는 최초의 벤치마크로 SimulCost를 소개한다.
12개의 시뮬레이터와 재현 가능한 비용 추적 프레임워크를 갖춘 다양하고 확장 가능한 도구 키트를 제공한다.
최첨단 LLM과 brute-force 스캔 및 베이지안 최적화를 비교한다.
배포를 안내하기 위한 지식 전달, 인-context 학습 및 추론 노력에 대한 차등 평가를 제공한다.

제안 방법

각 시뮬레이터에 대한 FLOPs 기반 도구 비용으로 비용을 정의하고(에폭은 벽시계 시간을 사용).
싱글-round(초기 추측) 및 멀티-round(시도-오차) 추론 모드를 평가한다.
유체역학, 고체 역학 및 플라즈마 물리학에 걸친 12개의 솔버에 걸친 2,916개의 single-round 및 1,900개의 multi-round 작업을 큐레이션한다.
개별 매개변수로 튜닝을 분리하여 의미 있는 스캔 베이스라인 및 비용 비교를 가능하게 한다.
표준화된 API와 Hydra 기반 구성을 갖춘 확장 가능한 도구 상자(simulcost-tools)를 제공하여 재현 및 확장을 가능하게 한다.
멀티-룰 튜닝의 기준선으로 베이지안 최적화를 포함하고 ICL 및 추론 노력에 대한 차등 분석을 수행한다.

Figure 1 : Overview of SimulCost . Our benchmark evaluates LLM agents on cost-sensitive parameter tuning across 12 physics simulators spanning fluid dynamics, solid mechanics, and plasma physics. Given a simulation task, tuning mode, and accuracy requirement, the LLM proposes tunable parameters in e

실험 결과

연구 질문

RQ1다양한 시뮬레이터에 걸친 물리 시뮬레이션에서 비용 인식 매개변수 튜닝에 대해 LLM은 어떻게 수행하는가?
RQ2싱글-round과 멀티-round 튜닝에서 정확도 요구사항과 계산 비용 간의 트레이드오프는 무엇인가?
RQ3지식 이전/이전 학습, 인-context 학습 또는 추론 노력이 비용 효율적 튜닝을 실질적으로 개선하는가?
RQ4이 비용 인식 설정에서 베이지안 최적화는 LLM 기반 접근법과 어떻게 비교되는가?
RQ5새로운 솔버와 환경에 일반화하면서 재현 가능한 비용 추적을 보존할 수 있는가?

주요 결과

최첨단 LLM은 single-round 모드에서 46–64%의 성공을 달성하고, 높은 정확도 요구사항 하에서 35–54%로 떨어진다.
멀티-round 모드는 성공률을 71–80%로 올리지만, LLM은 brute-force 스캔보다 1.5–2.5배 느리다.
공통 매개변수는 솔버-specific 매개변수보다 튜닝이 쉽고, 매개변수 간 교차 상관이 거의 없어 제한된 이전 가능성을 시사한다.
인-context 학습은 싱글-round 성공을 15–25% 개선하지만 멀티-round 탐색을 악화시킨다.
BO-GP는 더 높은 솔버 간 분산에도 총체적 성공을 따라가며, LLM은 낮은 정확도 요구에서 비용 효율성 이점을 보인다.
전반적인 추론 노력은 의미 있는 개선을 보이지 않았다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.