QUICK REVIEW

[論文レビュー] ArchBench: Benchmarking Generative-AI for Software Architecture Tasks

Bassam Adnan, Aviral Gupta|arXiv (Cornell University)|Mar 18, 2026

Software Engineering Research被引用数 0

ひとこと要約

tldr: ArchBench provides a unified CLI and web-based platform to benchmark GenAI/LLM capabilities on software architecture tasks with a modular plugin architecture, standardized evaluation, and community-driven leaderboard.

ABSTRACT

Benchmarks for large language models (LLMs) have progressed from snippet-level function generation to repository-level issue resolution, yet they overwhelmingly target implementation correctness. Software architecture tasks remain under-specified and difficult to compare across models, despite their central role in maintaining and evolving complex systems. We present ArchBench, the first unified platform for benchmarking LLM capabilities on software architecture tasks. ArchBench provides a command-line tool with a standardized pipeline for dataset download, inference with trajectory logging, and automated evaluation, alongside a public web interface with an interactive leaderboard. The platform is built around a plugin architecture where each task is a self-contained module, making it straightforward for the community to contribute new architectural tasks and evaluation results. We use the term LLMs broadly to encompass generative AI (GenAI) solutions for software engineering, including both standalone models and LLM-based coding agents equipped with tools. Both the CLI tool and the web platform are openly available to support reproducible research and community-driven growth of architectural benchmarking.

研究の動機と目的

Address the lack of uniform benchmarks for GenAI in software architecture tasks.
Provide a centralized, extensible platform to aggregate architecture-focused tasks and evaluations.
Enable reproducible, standardized evaluation across models through a CLI and web leaderboard.

提案手法

Three-stage CLI pipeline: download, inference, evaluation with task-specific plugins.
Plugin-based architecture where each task module provides dataset loading, prompts, response parsing, and metrics.
Full trajectory logging of prompts, responses, token usage, and latency for reproducibility.
Uniform provider interface to dispatch prompts to LLMs and collect structured predictions.
Task-specific metrics spanning NLP similarity (ROUGE, BLEU, METEOR, BERTScore), structure/traceability metrics (precision/recall/F1), code metrics (CodeBLEU, test pass rates), and qualitative LLM-as-judge options.
Web-based leaderboard implemented in React to compare models across tasks.

Figure 1: Annotated screenshot of the ArchBench web interface. Circled elements highlight the platform’s key sections: the leaderboard for comparing model performance across tasks, task descriptions with evaluation metrics, source papers for each dataset, and contribution guidelines for community su

実験結果

リサーチクエスチョン

RQ1How do architectural reasoning abilities vary across model families?
RQ2Do performance patterns transfer across different architecture tasks or correlate between tasks?
RQ3How do prompting strategies impact output quality in software-architecture-focused GenAI tasks?
RQ4What is the value of trajectory logging for diagnosing failures in architectural reasoning?

主な発見

ArchBench aggregates five architecture tasks with multiple model results per task.
Two tasks (ADR Generation and Traceability Link Recovery) have fully automated evaluation pipelines in the CLI.
New tasks can be added as plugins without modifying the core framework.
Results are accessible via a public leaderboard and can be contributed through pull requests.
The replication package and datasets are open-source and CC BY 4.0 licensed.
Platform enables end-to-end runs from dataset download to scored reports in a single command for some tasks.

Figure 2: ArchBench platform architecture showing the three pipeline stages (Download, Inference, Evaluation) and the leaderboard web interface.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。