[论文解读] A framework for assessing the capabilities of code generation of constraint domain-specific languages with large language models
The paper proposes a modular evaluation framework to assess LLM-generated DSL code (OCL, Alloy) and GPL code (Python) from textual specifications, focusing on well-formedness and correctness, and reports comparative results across languages and models.
Large language models (LLMs) can be used to support software development tasks, e.g., through code completion or code generation. However, their effectiveness drops significantly when considering less popular programming languages such as domain-specific languages (DSLs). In this paper, we propose a generic framework for evaluating the capabilities of LLMs generating DSL code from textual specifications. The generated code is assessed from the perspectives of well-formedness and correctness. This framework is applied to a particular type of DSL, constraint languages, focusing our experiments on OCL and Alloy and comparing their results to those achieved for Python, a popular general-purpose programming language. Experimental results show that, in general, LLMs have better performance for Python than for OCL and Alloy. LLMs with smaller context windows such as open-source LLMs may be unable to generate constraint-related code, as this requires managing both the constraint and the domain model where it is defined. Moreover, some improvements to the code generation process such as code repair (asking an LLM to fix incorrect code) or multiple attempts (generating several candidates for each coding task) can improve the quality of the generated code. Meanwhile, other decisions like the choice of a prompt template have less impact. All these dimensions can be systematically analyzed using our evaluation framework, making it possible to decide the most effective way to set up code generation for a particular type of task.
研究动机与目标
- Motivate and formalize the need to evaluate LLMs on constraint DSLs where data is low-resource.
- Develop a modular, configurable framework to generate, parse, and validate DSL and GPL code from textual specifications.
- Compare LLM performance across constraint DSLs (OCL, Alloy) and Python using various models and prompts.
- Provide mechanisms for prompt templates, code repair, multiple attempts, and systematic evaluation of well-formedness and correctness.
提出的方法
- Define inputs: code task, domain description, and domain model to build prompts.
- Introduce two augmentation dimensions: CoT-based iterative prompting and task-oriented prompting.
- Provide multiple prompt templates and task delivery modes (batch, chained, isolated).
- Extract generated code from LLM outputs and evaluate well-formedness with language parsers or tool execution.
- Assess correctness via automated LLM-as-a-judge and manual specification fulfilment, with single-pass repair when needed.
- Quantify success via accuracy and pass@k metrics and report on configuration outcomes.
实验结果
研究问题
- RQ1How well do LLMs generate correct and well-formed code for constraint DSLs (OCL, Alloy) compared to Python?
- RQ2What impact do different prompting strategies, augmentation techniques, and evaluation setups have on code quality?
- RQ3Can a modular framework systematically compare configurations and identify effective setups for DSL code generation?
- RQ4What role do single-pass repairs and multiple attempts play in improving generated code quality?
- RQ5How do context/window size and model choice influence DSL code generation performance?
主要发现
- LLMs generally perform better for Python than for OCL and Alloy.
- Smaller-context LLMs may struggle to generate constraint-related code due to needing both domain model and constraints.
- Code repair (fixing errors) and multiple attempts improve code quality and correctness.
- Prompt template choice has less impact than augmentation and other factors in some settings.
- The framework enables systematic analysis of code generation decisions across languages and tasks.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。