[论文解读] TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation
TRACE 是一个基准,使用 1,000 个任务和压力测试,衡量 LLM 翻译代码在 C++、Java 和 Python 的执行效率,评估 28 种模型。它揭示正确性并不等于效率,并指出常见的效率下降。
While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of extit{execution efficiency} remains overlooked. We present extbf{ extsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. extsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using extsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader extit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as extit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position extsc{trace} as a principled foundation for efficiency-oriented evaluation.
研究动机与目标
- Motivate the need to evaluate execution efficiency in LLM-based code translation.
- Introduce TRACE as the first benchmark focused on efficiency for translated code.
- Assess a broad set of LLMs (28 models) across multiple programming languages.
- Characterize the prevalence and patterns of inefficiency in translated code.
- Provide a principled foundation for efficiency-oriented evaluation in code translation.
提出的方法
- Define 1,000 efficiency-critical tasks in C++, Java, and Python.
- Augment tasks with stress tests to reveal efficiency degradations.
- Evaluate 28 representative LLMs on translated code for efficiency metrics.
- Analyze inefficiency by categories: algorithmic faults, language construct mismatches, and resource mismanagement.
- Compare correctness with time efficiency to assess correlation and gaps.
- Establish TRACE as a benchmark and evaluation framework for efficiency in code translation.
实验结果
研究问题
- RQ1Does correctness of LLM-translated code reliably reflect its time efficiency?
- RQ2How prevalent is efficiency degradation in correct translations, and what are its patterns?
- RQ3What categories best explain observed inefficiencies (algorithmic faults, language construct mismatches, resource mismanagement) across languages?
- RQ4Do prompt strategies at inference time meaningfully improve efficiency across models?
- RQ5How do efficiency characteristics vary across C++, Java, and Python translations and across different LLMs?
主要发现
- Correctness is not a reliable proxy for time efficiency; some leaders in correctness are mid-level in efficiency.
- Among correct translations, 23.5% show pronounced inefficiency.
- Inefficiency distributions include 11.9% algorithmic faults, 66.4% language construct mismatches, and 21.7% resource mismanagement.
- Inference-time prompt strategies yield only modest improvements in efficiency.
- TRACE provides a principled foundation for efficiency-oriented evaluation of LLM-based code translation.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。