QUICK REVIEW

[論文レビュー] Environment-Aware Code Generation: How far are We?

Tongtong Wu, Rongyi Chen|arXiv (Cornell University)|Jan 18, 2026

Software Engineering Research被引用数 0

ひとこと要約

The paper defines Environment-Aware Code Generation (EACG) and introduces VersiBCB, a benchmark with executable, multi-library, and version-aware tasks; it evaluates three inference-time customization strategies (data, parameters, cache) and analyzes their impact on executability, compatibility, and composability.

ABSTRACT

Recent progress in large language models (LLMs) has improved code generation, but most evaluations still test isolated, small-scale code (e.g., a single function) under default or unspecified software environments. As a result, it is unclear whether LLMs can reliably generate executable code tailored to a user's specific environment. We present the first systematic study of Environment-Aware Code Generation (EACG), where generated code must be functionally correct and directly executable under arbitrary software configurations. To enable realistic evaluation, we introduce VersiBCB, a benchmark that is multi-package, execution-verified, and deprecation-aware, capturing complex and evolving environments that prior datasets often overlook. Using VersiBCB, we investigate three complementary adaptation axes: data, parameters, and cache, and develop representative strategies for each. Our results show that current LLMs struggle with environment-specific code generation, while our adaptations improve environment compatibility and executability. These findings highlight key challenges and opportunities for deploying LLMs in practical software engineering workflows.

研究の動機と目的

Formalize Environment-Aware Code Generation (EACG) as generating code that is functionally correct and executable under a specified environment.
Create VersiBCB, a large-scale, executable-verified, multi-library benchmark reflecting real-world Python environments.
Evaluate three inference-time adaptation strategies (data-based RAG, parameter-based MoE, cache-based memory) for environment adaptability.
Assess executability, API compatibility, and generalization to unseen library/version configurations.
Provide insights and guidance for deploying LLMs in practical software engineering workflows.

提案手法

Define task formulation for EACG and Environment-Aware Code Migration (EACM) with environment specs L, V and functional requirement d.
Construct VersiBCB by augmenting BigCodeBench with environment-aware annotations and executing code in controlled environments to validate feasibility.
Evaluate state-of-the-art LLMs on EACG and EACM using Pass@k, including strict and lenient API usage regimes.
Propose three adaptation axes at inference time: data-based retrieval-augmented generation (RAG), parameter-based mixture-of-experts (MoE) with version-aware routing, and cache-based memory with environment-specific patterns.

Figure 1 . Task definition covering both environment-aware code generation and code migration.

実験結果

リサーチクエスチョン

RQ1Can LLMs generate code that executes correctly under a specified environment (executability)?
RQ2Are generated APIs compatible with the given environment's API set (compatibility)?
RQ3Do models generalize to unseen combinations of libraries and versions (composability)?
RQ4How do environment-adaptation strategies compare in terms of strict API adherence and practical executability?

主な発見

Model	Code Generation Pass@1	Code Generation Pass@3	Code Generation Pass@5	Code Migration Pass@1	Code Migration Pass@3	Code Migration Pass@5
DS-7B	0.00	0.00	0.00	2.99	6.59	8.38
CodeGemma-7B	0.60	0.90	2.69	13.47	34.13	49.70
CodeLlama-13B	0.30	0.90	1.79	18.26	36.23	49.10
StarCoder2-15B	0.00	0.00	0.30	5.99	15.87	21.26
LLaMA3-70B	18.51	24.78	27.76	57.19	60.78	61.98
GPT-4.1-mini	27.76	32.24	33.43	53.29	59.28	61.68
DeepSeek-v3	23.88	28.06	30.75	66.17	70.06	70.66

Existing LLMs struggle with environment-aware code generation; larger models perform better but still lag behind non-environment-agnostic benchmarks.
MoE-based adaptation improves strict API consistency and partial correctness in generation tasks.
Memory-based adaptation yields strong gains in code migration by reusing environment-conditioned patterns; jedoch, it can tolerate deprecated APIs.
RAG offers conservative adaptation with moderate gains and interpretable environment signals.
All strategies show performance drops in machine learning domains and unseen library/version combinations, highlighting persistent challenges in version-sensitive environments.
VersiBCB enables fine-grained evaluation across execution, compatibility, and cross-library evolution, revealing gaps not captured by standard benchmarks.

Figure 2 . Overview of dataset construction via bidirectional environment traversal.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。