QUICK REVIEW

[論文レビュー] Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation

Zhiqiang Yuan, Junwei Liu|arXiv (Cornell University)|Aug 2, 2023

Software Engineering Research被引用数 18

ひとこと要約

この研究は、0-shot、few-shot、およびファインチューニング設定の下で、4つのコード関連タスク（欠陥検出、クローン検出、アサーション生成、コード要約）に対して10個のオープンソースの命令調整済みLLMを評価し、強力な0-shot性能と注目すべきfew-shotのばらつきおよびコスト影響を明らかにしている。

ABSTRACT

In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension and generation tasks. We have the following main findings. First, for the zero-shot setting, instructed LLMs are very competitive on code comprehension and generation tasks and sometimes even better than small SOTA models specifically fine-tuned on each downstream task. We also find that larger instructed LLMs are not always better on code-related tasks. Second, for the few-shot setting, we find that adding demonstration examples substantially helps instructed LLMs perform better on most code comprehension and generation tasks; however, the examples would sometimes induce unstable or even worse performance. Furthermore, we find widely-used BM25-based shot selection strategy significantly outperforms the basic random selection or fixed selection only on generation problems. Third, for the fine-tuning setting, we find that fine-tuning could further improve the model performance on downstream code comprehension and generation tasks compared to the zero-shot/one-shot performance. In addition, after being fine-tuned on the same downstream task dataset, instructed LLMs outperform both the small SOTA models and similar-scaled LLMs without instruction tuning. Based on our findings, we further present practical implications on model and usage recommendation, performance and cost trade-offs, and future direction.

研究の動機と目的

コード関連タスクにおける命令調整済みLLMの0-shot一般化を評価する。
コードタスクにおけるfew-shotイン-context学習とショット選択戦略を評価する。
下流のコード理解・生成タスクに対するファインチューニングの影響を検証する。
コード知能におけるモデル選択、コストと性能のトレードオフ、将来の方向性について実践的な指針を提供する。

提案手法

標準化されたプロンプトを用いて、4つのコードタスクにわたって10個のオープンソースの指示付きLLM（6B–16B）を比較する。
ゼロショット、1-shot（3つのショット選択戦略を用いる）、およびLoRAを用いたタスク固有のファインチューニングの3設定を使用する。
欠陥検出、クローン検出、アサーション生成、コード要約に対してタスク固有のプロンプトを使用する。
タスクに適した指標（正確度、F1、厳密一致）で性能を評価し、コード要約評価にはChatGPTをジャッジとして用いる。
ファインチューニングと推論時のメモリ・時間コストを評価する。
データセット（訓練/検証/テスト）とモデルごとの標準化プロンプト設計を取り入れる。

実験結果

リサーチクエスチョン

RQ1RQ1: 命令調整済みLLMは0-shot設定でコード理解・生成タスクにおいてどのように性能を示すか？
RQ2RQ2: 命令調整済みLLMはfew-shot設定でどのように性能を示し、ショット選択戦略の影響は何か？
RQ3RQ3: 下流タスクでのファインチューニング後、命令調整済みLLMの性能はどう変化するか？
RQ4RQ4: ファインチューニングと推論時のメモリ・時間コストはどの程度か？

主な発見

モデル	DD (%)	CD (%)	AG (%)	CS (%)
CodeGen-6B	0.3	1.4	0.0	0.0
ChatGLM-6B	7.1	17.5	1.7	45.0
Vicuna-7B	54.0	13.2	10.1	48.0
Alpaca-7B	45.8	22.1	5.3	32.0
Dolly-7B	33.1	21.3	1.9	12.0
StableLM-7B	44.3	24.3	1.1	30.0
CodeAlpaca-7B	51.9	1.4	4.4	9.0
Dolly-12B	33.8	23.5	1.0	5.0
Vicuna-13B	49.8	14.1	12.0	63.0
WizardCoder-15B	54.4	23.8	19.4	71.0
Instruct-CodeGen-16B	47.8	14.2	8.4	9.0

0-shotでは、指示付きLLMがいくつかのタスクで小規模SOTAモデルと競合するか、または上回ることがある；より大きいモデルサイズが必ずしも0-shot性能を向上させるとは限らない。
Few-shotはデモンストレーションからの全体的な性能向上を示す一方で、長い入力での不安定さや性能低下を引き起こす可能性がある；BM25ベースのショット選択は生成タスクには有益だが、分類タスクには明確に優れているとは言えない。
LoRAを用いたファインチューニングはタスク性能をさらに向上させる；指示付きLLMはファインチューニング後に小規模SOTAモデルや指示調整なしの同規模モデルを上回る。
同程度のサイズのモデル間では、メモリコストが必ずしも小規模SOTAモデルより高いとは限らないが、ファインチューニングと推論の両方で時間コストは大幅に増大する可能性がある。
本研究は、コード関連タスクにおけるモデル選択、ショット戦略、コストとパフォーマンスのトレードオフに関する実践的な指針を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。