QUICK REVIEW

[論文レビュー] AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

Linyuan Gong, Mostafa Elhoushi|arXiv (Cornell University)|Jan 5, 2024

Software Engineering Research被引用数 5

ひとこと要約

AST-T5 は AST-aware pretraining をエンコーダ-デコーダ LM に導入し、AST-aware segmentation と AST-aware subtree masking を用いて、アーキテクチャの変更なしにコード生成・トランスパイル・理解を改善します。

ABSTRACT

Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.

研究の動機と目的

コードを構造化されたものとして扱う動機づけを提示し、コード関連タスクの改善を目指す。
複雑なプログラム分析を必要としない構造認識型事前学習パラダイムを開発する。
既存のエンコーダ-デコーダ・Transformers とのシームレスな統合をドロップイン置換として実現する。
生成、トランスパイル、および理解のベンチマーク全般で改善を示す。

提案手法

Tree-sitter を用いてコードを AST に解析し、構造を捉える。
AST-aware セグメンテーションを適用して、AST 構造の崩れを最小限に抑えつつコードを分割する。
AST-aware Subtree Corruption を適用して、 span corruption の間に AST サブツリーをマスクする。
構造認識マスキングを補強した単一の事前学習目的（span corruption）で訓練する。
追加のヘッドやアーキテクチャなしで Vanilla T5 との互換性を維持する。

実験結果

リサーチクエスチョン

RQ1AST-aware segmentation は greedy segmentation よりコード構造をより良く保つだろうか？
RQ2AST-aware subtree masking は標準の span corruption と比較して生成とトランスパイルの品質を向上させるか？
RQ3構造認識付きの単一の事前学習目的で、コード理解および生成タスクにおいて競争力のある性能を達成できるか？
RQ4AST-T5 は CodeT5 および CodeT5+ に対して HumanEval、MBPP、Bugs2Fix、Java-C# のトランスパイルでどのような成績を示すか？

主な発見

モデル	HumanEval	Concode	Bugs2Fix	Java-C#	クローン	欠陥	平均
T5	5.2	18.3	21.2/13.8	65.5/68.4	96.9	64.1	44.2
+ AST. Segmentation	7.2	20.2	22.5/15.1	66.3/69.3	98.3	65.9	45.7
+ AST-Aware Subtree Corrupt	9.6	22.1	23.3/16.5	67.3/72.2	98.6	66.0	47.0
+ Mask 25% (AST-T5)	14.0	22.9	23.8/16.1	68.9/72.3	98.6	65.8	47.9
+ Mask 50%	14.3	22.0	21.9/15.0	66.5/70.1	97.1	64.2	46.4

AST-Aware Segmentation は Greedy Segmentation より一貫性のある学習チャンクを生み出し、AST構造の崩れを減らす。
AST-Aware Subtree Corruption は vanilla span corruption と比較して生成とトランスパイルの指標を改善する。
マスク比を25%に増やすと理解への影響は限定的で生成が向上し、50%では収益が頭打ちになる。
AST-T5 はその規模に対して強力な結果を示し、CodeT5 や CodeT5+ のような同程度のモデルを複数のタスクで上回る。
トランスパイルとクローン検知において、AST-認識はベースラインより顕著な優位性をもたらす。
AST-T5 は HumanEval および MBPP でより大きなモデルと競合し、パラメータ効率の高さを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。