QUICK REVIEW

[论文解读] Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

Darioush Kevian, Usman Syed|arXiv (Cornell University)|Apr 4, 2024

Reservoir Engineering and Simulation Methods被引用 18

一句话总结

本论文在 ControlBench 上对 GPT-4、Claude 3 Opus 和 Gemini 1.0 Ultra 进行基准测试，数据集为本科控制问题，结果显示 Claude 3 Opus 通常优于其他模型，视觉数据解读方面存在显著挑战。

ABSTRACT

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.

研究动机与目标

介绍 ControlBench，一个自然语言控制问题数据集，反映本科控制设计广度和复杂性。
评估在 ControlBench 上的领先大型语言模型（GPT-4、Claude 3 Opus、Gemini 1.0 Ultra）通过人工专家评估。
分析准确性、推理质量与解释，以及模型特定的优势与局限。
探索自我纠错以及视觉数据（绘图）对模型性能的影响。
提供一个简化的 ControlBench-C，用于快速、非专业评估。

提出的方法

构建 ControlBench，涵盖稳定性、时域响应、Bode/Nyquist 绘图、环路整形以及高级主题等147个本科控制问题。
用 LaTeX 对问题进行注释，并给出详细的逐步解决方案以实现可重复性。
通过人工专家对准确性（ACC）和自我纠错准确性（ACC-s）进行评分，在零-shot 与自我纠错设置下评估三种 LLM。
分析错误模式和对视觉数据的误读，以识别瓶颈与改进方向。
呈现一个缩减版的 ControlBench-C，便于快速自动评估。

实验结果

研究问题

RQ1GPT-4、Claude 3 Opus、Gemini 1.0 Ultra 在 ControlBench 的本科控制问题上有何表现？
RQ2在控制主题中，哪一模型在准确性和自我纠错能力方面最强？
RQ3LLM 解决控制问题时的主要失效模式是什么，视觉数据解读如何影响性能？
RQ4是否可以通过简化的多项选择版本（ControlBench-C）在没有控制背景的前提下可靠地基准测试 LLM？
RQ5结果对将 LLM 引入控制工程教育与工作流程有哪些启示？

主要发现

主题	GPT-4 ACC	GPT-4 ACC-s	Claude 3 Opus ACC	Claude 3 Opus ACC-s	Gemini 1.0 Ultra ACC	Gemini 1.0 Ultra ACC-s
Background	60.7% (17/28)	64.3% (18/28)	75% (21/28)	89.3% (25/28)	53.6% (15/28)	57.1% (16/28)
Stability	57.9% (11/19)	57.9% (11/19)	76.2% (15/19)	89.5% (17/19)	31.6% (6/19)	31.6% (6/19)
Time response	57.1% (12/21)	66.6% (14/21)	76.2% (16/21)	76.2% (16/21)	52.4% (11/21)	57.1% (12/21)
Block diagrams	40.0% (2/5)	40.0% (2/5)	40.0% (2/5)	60.0% (3/5)	0.0% (0/5)	0.0% (0/5)
Control System Design	29.2% (7/24)	29.2% (7/24)	33.3% (8/24)	62.5% (15/24)	25.0% (6/24)	37.5% (9/24)
Bode Analysis	6.66% (1/15)	6.66% (1/15)	13.3% (2/15)	13.3% (2/15)	6.66% (1/15)	6.66% (1/15)
Root-Locus Design	28.6% (2/7)	28.6% (2/7)	42.9% (3/7)	42.9% (3/7)	28.6% (2/7)	28.6% (2/7)
Nyquist Design	0.0% (0/5)	0.0% (0/5)	40.0% (2/5)	40.0% (2/5)	0.0% (0/5)	0.0% (0/5)
Gain/Phase Margins	66.7% (6/9)	66.7% (6/9)	66.7% (6/9)	66.7% (6/9)	33.3% (3/9)	33.3% (3/9)
System Sensitivity Measures	100.0% (3/3)	100.0% (3/3)	100.0% (3/3)	100.0% (3/3)	66.7% (2/3)	100.0% (3/3)
Loop-shaping	25.0% (1/4)	25.0% (1/4)	50.0% (2/4)	75.0% (3/4)	25.0% (1/4)	25.0% (1/4)
Advanced Topics	71.4% (5/7)	71.4% (5/7)	85.7% (6/7)	85.7% (6/7)	42.9% (3/7)	57.1% (4/7)

Claude 3 Opus 实现了在所有主题中的最高总体 ACC 与 ACC-s，显示出优越的准确性和自我纠错能力。
GPT-4 与 Claude 3 Opus 在背景数学、稳定性和时域响应问题上表现良好， Claude 3 Opus 在视觉组件任务上通常领先。
Gemini 1.0 Ultra 在整体表现上落后于其他模型，并且 across 主题的稳定性不如前者。
所有模型在读取如 Bode、Nyquist 和 root-locus 绘图等图形数据方面都存在困难，凸显在视觉语言理解方面的局限。
自我纠错提示显著提高了 ACC-s，展示迭代推理的实际价值。
ControlBench-C 提供了更快但更窄的 LLM 能力评估，可能无法捕捉全面的推理能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。