QUICK REVIEW

[論文レビュー] ChatGPT: A Study on its Utility for Ubiquitous Software Engineering Tasks

Giriprasad Sridhara, H. G. Ranjani|arXiv (Cornell University)|May 26, 2023

Artificial Intelligence in Healthcare and Education被引用数 12

ひとこと要約

本論文はChatGPTを15のソフトウェア工学タスクにわたって評価し、多くのタスクで信頼できる性能を見出す一方、他のタスクでは人間または最先端の基準と比較して限界があることを示している。

ABSTRACT

ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot launched by OpenAI on November 30, 2022. OpenAI's GPT-3 family of large language models serve as the foundation for ChatGPT. ChatGPT is fine-tuned with both supervised and reinforcement learning techniques and has received widespread attention for its articulate responses across diverse domains of knowledge. In this study, we explore how ChatGPT can be used to help with common software engineering tasks. Many of the ubiquitous tasks covering the breadth of software engineering such as ambiguity resolution in software requirements, method name suggestion, test case prioritization, code review, log summarization can potentially be performed using ChatGPT. In this study, we explore fifteen common software engineering tasks using ChatGPT. We juxtapose and analyze ChatGPT's answers with the respective state of the art outputs (where available) and/or human expert ground truth. Our experiments suggest that for many tasks, ChatGPT does perform credibly and the response from it is detailed and often better than the human expert output or the state of the art output. However, for a few other tasks, ChatGPT in its present form provides incorrect answers and hence is not suited for such tasks.

研究の動機と目的

開発、品質保証、保守にまたがる一般的なソフトウェア工学タスクにおけるChatGPTの有用性を探る。
利用可能な場合は、ChatGPTの出力を人間の専門家のグラウンドトゥルースおよび最先端ツールと比較する。
ChatGPTが高い性能を示すタスクと、誤りまたは低水準の結果を提供するタスクを特定する。

提案手法

ChatGPTと対話する（2022年12月15日および2023年1月9日版）各タスクにつき最大10サンプルで。
比較には公開データセットと既存の最先端ツールまたは人間のゴールドセットを使用する。
正確性は、出力がグラウンドトゥルースまたは最終的な開発者出力と一致する割合として評価する。
長所と短所を示す定性的観察と例示対話を提供する。
コードレビュー、ログ要約、メソッド名の提案など、多様なタスクを分析する。

実験結果

リサーチクエスチョン

RQ1ChatGPTは、最先端ツールや人間の専門家と比較して、正確なメソッド名と短いコード要約を生成できるか。
RQ2ベースラインと比較して、ChatGPTはログ要約、コミットメッセージ生成、重複バグ報告検出をどれだけうまく行えるか。
RQ3マージ競合解決、照応解決、コードレビュー、型推論、データフレーム駆動コード生成に対してChatGPTは信頼できるか。
RQ4脆弱性検出、リファクタリング、テストオラクル生成におけるChatGPTの制限は何か。

主な発見

ChatGPTは10個のメソッドのうち9個で正しいメソッド名を提案し、しばしば最先端よりも情報量の多い名前を提供した。
ChatGPTは全ての10件のログで最先端より優れたログ要約を出力した。
ChatGPTは10件中7件で正しいコミットメッセージを生成し、3件で余計な内容が含まれていた。
ChatGPTは強力な照応解決性能を示し、すべての10要件で前件を正しく解決した。
ChatGPTは10個中6個のグラウンドトゥルースなテストオラクルに一致し、主張の妥当な説明を生成した。
コードレビューで10件中4件の脆弱性を特定し、低レベルのCコードの一部シナリオでは苦戦した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。