Skip to main content
QUICK REVIEW

[论文解读] TRACE: Evaluating Execution Efficiency of LLM-Based Code Translation

Zhihao Gong, Zeyu Sun|arXiv (Cornell University)|Mar 17, 2026
Natural Language Processing Techniques被引用 0
一句话总结

TRACE 是一个基准,使用 1,000 个任务和压力测试,衡量 LLM 翻译代码在 C++、Java 和 Python 的执行效率,评估 28 种模型。它揭示正确性并不等于效率,并指出常见的效率下降。

ABSTRACT

While Large Language Models (LLMs) have substantially improved the functional correctness of code translation, the critical dimension of extit{execution efficiency} remains overlooked. We present extbf{ extsc{trace}}, the first benchmark to explicitly assess efficiency in LLM-translated code. extsc{trace} includes 1,000 efficiency-critical tasks across C++, Java, and Python, each augmented with stress tests that reveal efficiency degradations often overlooked by small-scale tests. Using extsc{trace}, we conduct an extensive evaluation of 28 representative LLMs and highlight several key insights: 1) Correctness is not a reliable proxy for efficiency: the correctness leader extit{Claude-4-think} achieves only mid-level time efficiency, outperformed by smaller open-source LLMs such as extit{Qwen2.5-Coder-14B-Instruct}. 2) Inefficiency is both prevalent and patterned: 23.5\% of correct translations exhibit pronounced inefficiency, distributed across algorithmic faults (11.9\%), language construct mismatches (66.4\%), and resource mismanagement (21.7\%). 3) Inference-time prompt strategies bring only modest improvements, suggesting that current LLMs lack intrinsic efficiency awareness. Together, our results establish efficiency as an essential dimension of code translation and position extsc{trace} as a principled foundation for efficiency-oriented evaluation.

研究动机与目标

  • Motivate the need to evaluate execution efficiency in LLM-based code translation.
  • Introduce TRACE as the first benchmark focused on efficiency for translated code.
  • Assess a broad set of LLMs (28 models) across multiple programming languages.
  • Characterize the prevalence and patterns of inefficiency in translated code.
  • Provide a principled foundation for efficiency-oriented evaluation in code translation.

提出的方法

  • Define 1,000 efficiency-critical tasks in C++, Java, and Python.
  • Augment tasks with stress tests to reveal efficiency degradations.
  • Evaluate 28 representative LLMs on translated code for efficiency metrics.
  • Analyze inefficiency by categories: algorithmic faults, language construct mismatches, and resource mismanagement.
  • Compare correctness with time efficiency to assess correlation and gaps.
  • Establish TRACE as a benchmark and evaluation framework for efficiency in code translation.

实验结果

研究问题

  • RQ1Does correctness of LLM-translated code reliably reflect its time efficiency?
  • RQ2How prevalent is efficiency degradation in correct translations, and what are its patterns?
  • RQ3What categories best explain observed inefficiencies (algorithmic faults, language construct mismatches, resource mismanagement) across languages?
  • RQ4Do prompt strategies at inference time meaningfully improve efficiency across models?
  • RQ5How do efficiency characteristics vary across C++, Java, and Python translations and across different LLMs?

主要发现

  • Correctness is not a reliable proxy for time efficiency; some leaders in correctness are mid-level in efficiency.
  • Among correct translations, 23.5% show pronounced inefficiency.
  • Inefficiency distributions include 11.9% algorithmic faults, 66.4% language construct mismatches, and 21.7% resource mismanagement.
  • Inference-time prompt strategies yield only modest improvements in efficiency.
  • TRACE provides a principled foundation for efficiency-oriented evaluation of LLM-based code translation.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。