QUICK REVIEW

[论文解读] Interactive Code Generation via Test-Driven User-Intent Formalization

Shuvendu K. Lahiri, Fakhoury, Sarah|arXiv (Cornell University)|Aug 11, 2022

Software Engineering Research被引用 24

一句话总结

论文提出了交互式测试驱动代码生成（ITDCG），实现为 TiCoder，它通过生成的测试对轻量级的用户反馈来形式化意图并对代码建议进行裁剪/排序，从而在 MBPP 和 HumanEval 上改善 pass@k 指标。

ABSTRACT

Large language models (LLMs) have shown great potential in automating significant aspects of coding by producing natural code from informal natural language (NL) intent. However, when interacting with LLMs, users have no guarantees that the code suggestions produced correctly satisfy the intent they provided. In fact, it is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics. In this paper, we propose the workflow of {\it interactive test-driven code generation}, which leverages lightweight user feedback to (a) formalize the user intent using generated tests that can be useful for debugging, and (b) produce an improved set of code suggestions by pruning and ranking candidate code suggestions. We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder. We perform an automated evaluation of TiCoder on the \emph{MBPP} and \emph{HumanEval} code generation benchmarks. Our results are promising with using the OpenAI Codex LLM: our best algorithm improves the \passk{1} code generation accuracy (in absolute percentages) between $22.49\%$ to $37.71\%$ for MBPP and between $24.79\%$ to $53.98\%$ for HumanEval using between 1 to 5 simulated user queries.

研究动机与目标

通过生成的测试将用户意图形式化为轻量级、可运行的规范来实现代码生成
利用用户考虑的测试的运行时反馈对大型语言模型生成的代码建议进行裁剪与排序
提供一个语言无关的、抽象的 ITDCG 算法及一个具体实现（TiCoder）
在 Python 数据集 MBPP 和 HumanEval 上评估 ITDCG，以 Codex 作为底层 LLM
展示交互、提示、变异和排序等对比纯 LLM 基线的贡献

提出的方法

描述一个抽象的 InteractiveTestDrivenCodeGen 算法，其通过用于裁剪、变异和排序测试的组件进行参数化
使用 CodePrompt 和 TestPrompt 实现 TiCoder，从问题描述生成代码和测试
使用句法变异和动态变异（SyntacticMutateTests 与 DynMutateTests）来扩展测试集
对测试进行排序（RankTests），以选择最大化裁剪效果的查询，然后据此裁剪或保留代码候选
按它们通过的测试数量对代码建议进行排序（RankCodes）
使用 pass@k@m 和 accept@{m} 指标进行评估，其中 m 表示用户查询次数
使用一个 oracle（参考实现）来对离线评估模拟用户响应

实验结果

研究问题

RQ1与基线相比，交互式工作流如何提升代码建议的准确性（pass@k@m）？
RQ2随着模拟用户查询次数的增加，生成的代码和测试的正确性如何提高？
RQ3设计选择（提示、变异、排序）对 ITDCG 整体有效性有何影响？

主要发现

TiCoder 在一个用户查询下，MBPP 的 pass@1@1 为 70.73%，HumanEval 为 55.28%，超过 Codex 基线及若干基线。
在五个用户查询下，TiCoder 在所报告的配置中达到 70.73%（MBPP）和 55.28%（HumanEval）的 pass@1@1，并且可以达到接近 IdealRanking 的性能。
TiCoder 在 MBPP 与 HumanEval 的 k 值为 {2,5,10} 的所有情况下，始终优于无交互的基线与 Codex。
通过基于 oracle 的评估，随着用户查询增多，TiCoder 接近 IdealRanking，表明测试驱动的消歧带来显著收益。
默认 TiCoder 配置（TestGenPrompt = pass, StaticMutateTests = single-assert, DynMutateTests = assert-rewrite-all, TestRanking = discriminative, CodeRanking = passing-tests）被选定以获得强的 pass@1@1 性能。
平均需要 1.7 次用户查询即可生成与用户意图一致的单元测试，覆盖 87.12% 的 MBPP 实例和 95.73% 的 HumanEval 实例。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。