QUICK REVIEW

[论文解读] TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation

Zhihao Gong, Zeyu Sun|arXiv (Cornell University)|Mar 17, 2026

Natural Language Processing Techniques被引用 0

一句话总结

TRACE 是一个基准，使用 1,000 个任务和压力测试，衡量 LLM 翻译代码在 C++、Java 和 Python 的执行效率，评估 28 种模型。它揭示正确性并不等于效率，并指出常见的效率下降。

ABSTRACT

While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of extit{execution efficiency} remains overlooked. We present extbf{ extsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. extsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using extsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader extit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as extit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position extsc{trace} as a principled foundation for efficiency-oriented evaluation.

研究动机与目标

Motivate the need to evaluate execution efficiency in LLM-based code translation.
Introduce TRACE as the first benchmark focused on efficiency for translated code.
Assess a broad set of LLMs (28 models) across multiple programming languages.
Characterize the prevalence and patterns of inefficiency in translated code.
Provide a principled foundation for efficiency-oriented evaluation in code translation.

提出的方法

Define 1,000 efficiency-critical tasks in C++, Java, and Python.
Augment tasks with stress tests to reveal efficiency degradations.
Evaluate 28 representative LLMs on translated code for efficiency metrics.
Analyze inefficiency by categories: algorithmic faults, language construct mismatches, and resource mismanagement.
Compare correctness with time efficiency to assess correlation and gaps.
Establish TRACE as a benchmark and evaluation framework for efficiency in code translation.

实验结果

研究问题

RQ1Does correctness of LLM-translated code reliably reflect its time efficiency?
RQ2How prevalent is efficiency degradation in correct translations, and what are its patterns?
RQ3What categories best explain observed inefficiencies (algorithmic faults, language construct mismatches, resource mismanagement) across languages?
RQ4Do prompt strategies at inference time meaningfully improve efficiency across models?
RQ5How do efficiency characteristics vary across C++, Java, and Python translations and across different LLMs?

主要发现

Correctness is not a reliable proxy for time efficiency; some leaders in correctness are mid-level in efficiency.
Among correct translations, 23.5% show pronounced inefficiency.
Inefficiency distributions include 11.9% algorithmic faults, 66.4% language construct mismatches, and 21.7% resource mismanagement.
Inference-time prompt strategies yield only modest improvements in efficiency.
TRACE provides a principled foundation for efficiency-oriented evaluation of LLM-based code translation.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。