QUICK REVIEW

[论文解读] ArchBench: Benchmarking Generative-AI for Software Architecture Tasks

Bassam Adnan, Aviral Gupta|arXiv (Cornell University)|Mar 18, 2026

Software Engineering Research被引用 0

一句话总结

tldr: ArchBench 提供统一的 CLI 和基于网络的平台，在软件架构任务上对 GenAI/LLM 能力进行基准测试，具模块化插件架构、标准化评估和社区驱动的排行榜。

ABSTRACT

Benchmarks for large language models (LLMs) have progressed from snippet-level function generation to repository-level issue resolution, yet they overwhelmingly target implementation correctness. Software architecture tasks remain under-specified and difficult to compare across models, despite their central role in maintaining and evolving complex systems. We present ArchBench, the first unified platform for benchmarking LLM capabilities on software architecture tasks. ArchBench provides a command-line tool with a standardized pipeline for dataset download, inference with trajectory logging, and automated evaluation, alongside a public web interface with an interactive leaderboard. The platform is built around a plugin architecture where each task is a self-contained module, making it straightforward for the community to contribute new architectural tasks and evaluation results. We use the term LLMs broadly to encompass generative AI (GenAI) solutions for software engineering, including both standalone models and LLM-based coding agents equipped with tools. Both the CLI tool and the web platform are openly available to support reproducible research and community-driven growth of architectural benchmarking.

研究动机与目标

Address the lack of uniform benchmarks for GenAI in software architecture tasks.
Provide a centralized, extensible platform to aggregate architecture-focused tasks and evaluations.
Enable reproducible, standardized evaluation across models through a CLI and web leaderboard.

提出的方法

Three-stage CLI pipeline: download, inference, evaluation with task-specific plugins.
Plugin-based architecture where each task module provides dataset loading, prompts, response parsing, and metrics.
Full trajectory logging of prompts, responses, token usage, and latency for reproducibility.
Uniform provider interface to dispatch prompts to LLMs and collect structured predictions.
Task-specific metrics spanning NLP similarity (ROUGE, BLEU, METEOR, BERTScore), structure/traceability metrics (precision/recall/F1), code metrics (CodeBLEU, test pass rates), and qualitative LLM-as-judge options.
Web-based leaderboard implemented in React to compare models across tasks.

Figure 1: Annotated screenshot of the ArchBench web interface. Circled elements highlight the platform’s key sections: the leaderboard for comparing model performance across tasks, task descriptions with evaluation metrics, source papers for each dataset, and contribution guidelines for community su

实验结果

研究问题

RQ1How do architectural reasoning abilities vary across model families?
RQ2Do performance patterns transfer across different architecture tasks or correlate between tasks?
RQ3How do prompting strategies impact output quality in software-architecture-focused GenAI tasks?
RQ4What is the value of trajectory logging for diagnosing failures in architectural reasoning?

主要发现

ArchBench aggregates five architecture tasks with multiple model results per task.
Two tasks (ADR Generation and Traceability Link Recovery) have fully automated evaluation pipelines in the CLI.
New tasks can be added as plugins without modifying the core framework.
Results are accessible via a public leaderboard and can be contributed through pull requests.
The replication package and datasets are open-source and CC BY 4.0 licensed.
Platform enables end-to-end runs from dataset download to scored reports in a single command for some tasks.

Figure 2: ArchBench platform architecture showing the three pipeline stages (Download, Inference, Evaluation) and the leaderboard web interface.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。