QUICK REVIEW

[论文解读] Finding Inductive Loop Invariants using Large Language Models

Adharsh Kamath, Aditya Senthilnathan|arXiv (Cornell University)|Nov 14, 2023

Software Engineering Research被引用 15

一句话总结

本论文研究使用大型语言模型（LLMs）生成用于 C 程序的归纳循环不变量，结合 LLM 输出与符号验证工具来验证 Hoare 三元组，并在多数情况下显示出比符号基线更好的验证性能。

ABSTRACT

Loop invariants are fundamental to reasoning about programs with loops. They establish properties about a given loop's behavior. When they additionally are inductive, they become useful for the task of formal verification that seeks to establish strong mathematical guarantees about program's runtime behavior. The inductiveness ensures that the invariants can be checked locally without consulting the entire program, thus are indispensable artifacts in a formal proof of correctness. Finding inductive loop invariants is an undecidable problem, and despite a long history of research towards practical solutions, it remains far from a solved problem. This paper investigates the capabilities of the Large Language Models (LLMs) in offering a new solution towards this old, yet important problem. To that end, we first curate a dataset of verification problems on programs with loops. Next, we design a prompt for exploiting LLMs, obtaining inductive loop invariants, that are checked for correctness using sound symbolic tools. Finally, we explore the effectiveness of using an efficient combination of a symbolic tool and an LLM on our dataset and compare it against a purely symbolic baseline. Our results demonstrate that LLMs can help improve the state-of-the-art in automated program verification.

研究动机与目标

整理一个包含带循环的 C 程序的验证问题数据集，用于研究 LLMs 在不变量生成方面的表现。
开发一个工具链，向 LLMs 发送提示以生成循环不变量，并使用符号验证器对其进行验证。
在该数据集上评估不同的 LLM 以及混合 LLM-符号方法相对于符号基线的表现。
提出并评估一个具备修复能力的管道，在容忍 LLM 不准确性的同时确保归纳性。

提出的方法

构建一个由一个 LLM（L）和一个 oracle/符号验证器（O）组成的两组件工具链。
使用提示 M 和目标程序 P 通过 L 生成候选不变量 I。
将 P 与 I 标注为 A(P,I)，并用 O 验证其归纳性。
应用 Loopy：在多次补全中累积 I，然后使用 Houdini 提取一个归纳子集，或使用 Repair 来改进 I。
Houdini 剪枝非归纳或语法上无效的候选者，并寻找一个归纳的子集。
Repair 迭代地提示 L 修复带有错误信息的不变量，并对结果进行重新用 O 校验。

Figure 1. Success rate of Loopy with GPT-4 as the number of completions is varied. Dashed lines depict the performance of Loopy without Houdini and without Repair, for different prompts. Solid lines depict the performance of Loopy with Houdini, but without Repair.

实验结果

研究问题

RQ1RQ1 LLMs 在多大程度上能够为一个 C 程序生成一组正确的循环不变量？
RQ2RQ2 LLMs 在多大程度上能够生成用于验证 C 程序所需的正确循环不变量集合中的元素？
RQ3RQ3 不同基础模型在找到归纳性不变量方面的能力有何差异？
RQ4RQ4 LLMs 能否利用 oracle 的错误信息来修复不正确的不变量，其成功率是多少？
RQ5RQ5 在哪些程序特征下，LLMs 无法生成正确的不变量？
RQ6RQ6 Loopy 的性能与最先进的符号验证器相比如何？

主要发现

Loopy 的成功率随着完成次数增加而提高，但大约 8 次完成后收益递减。
在提示中加入提示微调（M2）相对于简单提示（M1）使成功率大约提升 23%。
Houdini 通过利用 LLM 生成的部分不变量显著提升性能。
GPT-4 通常优于其他 LLM，但 Houdini 能帮助其他模型赶上；使用多种 LLM 可能有益。
Repair 进一步扩大了验证覆盖，在所报道的设置中解决了 398 个基准，而仅使用 Houdini 时为 383。
与符号基线 Ultimate 相比，Loopy 可以解决 Ultimate 不能解决的一些基准，而 Ultimate 总体解决更多（在 469 个基准中为 430 对比 398）。

Figure 2. Performance of Loopy instantiated with different LLMs, with $15$ completions and prompt $\mathcalorig{M}_{2}$

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。