[论文解读] ChatGPT and Software Testing Education: Promises & Perils
该研究在教材中的31道软件测试题上评估ChatGPT,发现77.5%可回答,55.6%正确/部分正确,53.0%正确/部分正确解释,且情境与自信度影响结果。
Over the past decade, predictive language modeling for code has proven to be a valuable tool for enabling new forms of automation for developers. More recently, we have seen the advent of general purpose "large language models", based on neural transformer architectures, that have been trained on massive datasets of human written text spanning code and natural language. However, despite the demonstrated representational power of such models, interacting with them has historically been constrained to specific task settings, limiting their general applicability. Many of these limitations were recently overcome with the introduction of ChatGPT, a language model created by OpenAI and trained to operate as a conversational agent, enabling it to answer questions and respond to a wide variety of commands from end users. The introduction of models, such as ChatGPT, has already spurred fervent discussion from educators, ranging from fear that students could use these AI tools to circumvent learning, to excitement about the new types of learning opportunities that they might unlock. However, given the nascent nature of these tools, we currently lack fundamental knowledge related to how well they perform in different educational settings, and the potential promise (or danger) that they might pose to traditional forms of instruction. As such, in this paper, we examine how well ChatGPT performs when tasked with answering common questions in a popular software testing curriculum. Our findings indicate that ChatGPT can provide correct or partially correct answers in 55.6% of cases, provide correct or partially correct explanations of answers in 53.0% of cases, and that prompting the tool in a shared question context leads to a marginally higher rate of correct responses. Based on these findings, we discuss the potential promises and perils related to the use of ChatGPT by students and instructors.
研究动机与目标
- 评估 ChatGPT 对来自一本流行教材的软件测试题的回答水平。
- 评估 ChatGPT 对其答案的解释质量。
- 研究提示策略与对话上下文如何影响表现。
- 检验 ChatGPT 自报自信度与答案正确性之间的相关性。
提出的方法
- 使用经过人工核验的31道题数据集(来自 Ammann & Offutt 的五章)及每题三个 ChatGPT 回答。
- 比较分离上下文提示与共享上下文提示,以评估对正确性的影响。
- 在每次回答后提出一个自信度问题以研究校准。
- 让两位或以上研究人员独立标注答案与解释的正确性。
- 每题进行三次运行以分析非确定性效应。
实验结果
研究问题
- RQ1RQ1:在不同提示策略下,ChatGPT 能否提供正确的答案和解释的频率有多高?
- RQ2RQ2:ChatGPT 产生具有不同正确性水平的答案-解释对的频率有多高?
- RQ3RQ3:ChatGPT 的非确定性如何影响答案与解释的正确性?
- RQ4RQ4:ChatGPT 的自报自信与实际正确性之间是否存在相关性?
主要发现
| Iter | AC-EC | AC-EPC | AC-EIC | APC-EC | APC-EPC | APC-EIC | AIC-EC | AIC-EPC | AIC-EIC |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 15 | 0 | 2 | 0 | 1 | 0 | 0 | 2 | 11 |
| 2 | 15 | 0 | 2 | 0 | 2 | 0 | 0 | 0 | 11 |
| 3 | 15 | 1 | 2 | 0 | 2 | 0 | 0 | 1 | 10 |
- ChatGPT 对尝试回答的题目有77.5%回答正确,且在已回答中有55.6%为正确/部分正确。
- ChatGPT 在回答的案例中提供正确/部分正确的解释占53.0%。
- 共享上下文提示比分离上下文提示得到更高的正确率(正确答案49.4% vs 34.6%;部分正确6.2% vs 7.4%)。
- 采用共享上下文的提示在平均上提升答案和解释;自信度报告与正确性之间的对齐性并不可靠。
- 非确定性在9.7%的题目中影响答案正确性,在6.5% 的题目中影响解释正确性。
- ChatGPT 自报的自信度对答案是否正确影响很小。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。