Skip to main content
QUICK REVIEW

[论文解读] Extending the Frontier of ChatGPT: Code Generation and Debugging

Fardin Ahsan Sakib, Saadat Hasan Khan|arXiv (Cornell University)|Jul 17, 2023
Topic Modeling被引用 9
一句话总结

论文评估 ChatGPT 生成和调试 LeetCode 题目代码的能力,整体成功率为 71.875%,反馈带来的改进有限,对结构化题目表现更好。

ABSTRACT

Large-scale language models (LLMs) have emerged as a groundbreaking innovation in the realm of question-answering and conversational agents. These models, leveraging different deep learning architectures such as Transformers, are trained on vast corpora to predict sentences based on given queries. Among these LLMs, ChatGPT, developed by OpenAI, has ushered in a new era by utilizing artificial intelligence (AI) to tackle diverse problem domains, ranging from composing essays and biographies to solving intricate mathematical integrals. The versatile applications enabled by ChatGPT offer immense value to users. However, assessing the performance of ChatGPT's output poses a challenge, particularly in scenarios where queries lack clear objective criteria for correctness. For instance, evaluating the quality of generated essays becomes arduous and relies heavily on manual labor, in stark contrast to evaluating solutions to well-defined, closed-ended questions such as mathematical problems. This research paper delves into the efficacy of ChatGPT in solving programming problems, examining both the correctness and the efficiency of its solution in terms of time and memory complexity. The research reveals a commendable overall success rate of 71.875\%, denoting the proportion of problems for which ChatGPT was able to provide correct solutions that successfully satisfied all the test cases present in Leetcode. It exhibits strengths in structured problems and shows a linear correlation between its success rate and problem acceptance rates. However, it struggles to improve solutions based on feedback, pointing to potential shortcomings in debugging tasks. These findings provide a compact yet insightful glimpse into ChatGPT's capabilities and areas for improvement.

研究动机与目标

  • 评估 ChatGPT 从自然语言描述中生成正确的编程题解的能力。
  • 在 LeetCode 提供反馈时评估 ChatGPT 的调试能力。
  • 分析在题目领域、难度等级和通过率上的表现,以识别优势与局限。
  • 在成功的情况下,表征 ChatGPT 生成解的运行时与内存效率。

提出的方法

  • 构建跨多领域(树、分治、贪心、动态规划等)的基于 LeetCode 的精选数据集。
  • 用题目描述和代码结构提示 ChatGPT 生成解法,然后在 LeetCode IDE 进行评估。
  • 将结果记录为 Passed Instance 或带有 RTE/TLE/MLE 的失败及 LeetCode 测试结果。
  • 向 ChatGPT 提供 LeetCode 反馈并重新提交以评估调试能力。
  • 分析在领域、难度和通过率上的结果;另外评估成功解的运行时间和内存效率。

实验结果

研究问题

  • RQ1在 LeetCode 问题中,ChatGPT 产生正确解的总体成功率是多少?
  • RQ2在此编码设置中,ChatGPT 在不同题目领域和难度等级上的表现如何?
  • RQ3ChatGPT 在 LeetCode 反馈中能在多大程度上通过学习改进解题(调试性能)?

主要发现

  • ChatGPT 的总体成功率为 71.875%(128 道题中有 92 道通过)。
  • 在成功案例中有 84 道在首次尝试就通过;有 8 道在反馈后需要调试。
  • 在有反馈的重试案例中,ChatGPT 的调试改进仅发生在 36.7% 的情况下,且修订后的解在测试用例通过数量上比之前少通过了 63% 的测试用例。
  • 模型在树和分治问题上表现最佳,在贪心和动态规划领域表现较弱。
  • Solve 成功与题目通过率相关,对于简单题(最高可达 90%)成功率较高,而困难题约为 55% 左右。
  • 效率(运行时间/内存)提升并不一致;较高的通过率与更好的效率信号相关。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。