QUICK REVIEW

[論文レビュー] A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models

David Delgado, Lola Burgueño|arXiv (Cornell University)|Mar 5, 2026

Model-Driven Software Engineering Techniques被引用数 0

ひとこと要約

論文は、テキスト仕様からLLM生成のDSLコード（OCL、Alloy）とGPLコード（Python）を評価するモジュラー評価フレームワークを提案し、整形式性と正確性に焦点を当て、言語・モデル間での比較結果を報告します。

ABSTRACT

Large language models (LLMs) can be used to support software development tasks, e.g., through code completion or code generation. However, their effectiveness drops significantly when considering less popular programming languages such as domain-specific languages (DSLs). In this paper, we propose a generic framework for evaluating the capabilities of LLMs generating DSL code from textual specifications. The generated code is assessed from the perspectives of well-formedness and correctness. This framework is applied to a particular type of DSL, constraint languages, focusing our experiments on OCL and Alloy and comparing their results to those achieved for Python, a popular general-purpose programming language. Experimental results show that, in general, LLMs have better performance for Python than for OCL and Alloy. LLMs with smaller context windows such as open-source LLMs may be unable to generate constraint-related code, as this requires managing both the constraint and the domain model where it is defined. Moreover, some improvements to the code generation process such as code repair (asking an LLM to fix incorrect code) or multiple attempts (generating several candidates for each coding task) can improve the quality of the generated code. Meanwhile, other decisions like the choice of a prompt template have less impact. All these dimensions can be systematically analyzed using our evaluation framework, making it possible to decide the most effective way to set up code generation for a particular type of task.

研究の動機と目的

データが少ない制約DSLを対象としたLLMの評価の必要性を動機づけ、形式化する。
テキスト仕様からDSLおよびGPLコードを生成、解析、検証するモジュラーで設定可能なフレームワークを開発する。
様々なモデルとプロンプトを用いて制約DSL（OCL、Alloy）とPythonにおけるLLMの性能を比較する。
プロンプトテンプレート、コード修復、複数試行、および整形式性と正確性の体系的評価のための機構を提供する。）

提案手法

入力を定義する：コードタスク、ドメイン説明、およびプロンプトを作成するドメインモデル。
2つの増強次元を導入する：CoTベースの反復 prompting とタスク指向 prompting。
複数のプロンプトテンプレートとタスク提供モード（バッチ、連鎖、孤立）を提供する。
LLM出力から生成コードを抽出し、言語パーサーやツール実行で整形式性を評価する。
自動化されたLLMを審査役としての正確性と仕様充足を評価し、必要に応じて単一パス修复を行う。
精度とpass@k指標で成功を定量化し、設定の成果を報告する。

実験結果

リサーチクエスチョン

RQ1LLMsはPythonよりもOCLおよびAlloyの制約DSLに対して正確で整形式なコードを生成するのか。
RQ2異なる prompting 戦略、augmentation 技術、評価設定はコード品質にどのような影響を与えるのか。
RQ3モジュール式フレームワークは構成を体系的に比較し、DSLコード生成の効果的な設定を特定できるのか。
RQ4単一パス修復と複数回の試行は生成コード品質の改善にどのように寄与するのか。
RQ5文脈/ウィンドウサイズとモデル選択はDSLコード生成の性能にどのように影響するのか。

主な発見

LLMsは一般にPythonの方がOCLおよびAlloyよりも良好な性能を示す。
小規模コンテキストのLLMsは、ドメインモデルと制約の両方を必要とする制約関連コードの生成で苦戦する可能性がある。
コード修復（エラー修正）と複数回の試行はコード品質と正確性を向上させる。
プロンプトテンプレートの選択は、いくつかの設定でaugmentationや他の要因ほど影響が大きくない。
フレームワークは言語とタスクをまたぐコード生成の意思決定を体系的に分析できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。