QUICK REVIEW

[논문 리뷰] Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Thibaud Gloaguen, Niels Mündler|arXiv (Cornell University)|2026. 02. 12.

Software Engineering Research인용 수 0

한 줄 요약

이 논문은 저장소 수준의 컨텍스트 파일(AGENTS.md)을 체계적으로 평가하고, 개발자가 작성한 컨텍스트 파일은 미미한 성능 향상만을 제공하는 반면 자동으로 생성된 컨텍스트 파일은 성능을 저하시켜 비용을 증가시키는 경향이 있음을 발견한다; 또한 컨텍스트 파일은 더 많은 탐색 및 테스트를 촉진한다.

ABSTRACT

A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent developers, there is currently no rigorous investigation into whether such context files are actually effective for real-world tasks. In this work, we study this question and evaluate coding agents' task completion performance in two complementary settings: established SWE-bench tasks from popular repositories, with LLM-generated context files following agent-developer recommendations, and a novel collection of issues from repositories containing developer-committed context files. Across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%. Behaviorally, both LLM-generated and developer-provided context files encourage broader exploration (e.g., more thorough testing and file traversal), and coding agents tend to respect their instructions. Ultimately, we conclude that unnecessary requirements from context files make tasks harder, and human-written context files should describe only minimal requirements.

연구 동기 및 목표

저장소 수준의 컨텍스트 파일이 자율 코딩 작업 완료를 개선하는지 평가한다.
AGENTbench를 만들어 컨텍스트 파일이 실제 작업에 미치는 영향을 벤치마크한다.
개발자 제공 컨텍스트 파일과 자동으로 생성된 컨텍스트 파일을 여러 에이전트와 프롬프트에 걸쳐 비교한다.
컨텍스트 파일 포함 시 행동 변화와 비용 함의를 조사한다.

제안 방법

개발자 작성 컨텍스트 파일이 있는 저장소의 실제 GitHub PR에서 AGENTbench를 구축한다.
세 가지 설정에서 SWE-bench Lite와 AGENTbench를 대상으로 네 가지 코딩 에이전트를 평가한다: 없음, LLM-생성 컨텍스트, 인간이 제공한 컨텍스트.
성공률, 해결 단계 수, LLM 추론 비용을 측정한다.
탐색, 테스트, 추론의 변화를 이해하기 위해 에이전트 추적을 분석한다.

Figure 1 : Overview of our evaluation pipeline. We begin with real-world repositories and tasks derived from past pull requests. For each repository state, we generate three settings: \tiny{1}⃝ If a developer-provided context file exists, we include it in the repository. In \tiny{2}⃝, we omit the co

실험 결과

연구 질문

RQ1저장소 수준의 컨텍스트 파일이 실제 세계 작업에서 코딩 에이전트의 성공률을 높이는가?
RQ2개발자 제공 대비 자동으로 생성된 컨텍스트 파일이 에이전트의 행동과 비용에 어떤 영향을 미치는가?
RQ3컨텍스트 파일이 태스크 해결을 돕는 의미 있는 저장소 개요를 제공하는가?
RQ4컨텍스트 파일이 에이전트의 테스트 및 탐색 행동에 미치는 영향은 무엇인가?

주요 결과

컨텍스트 파일은 저장소 컨텍스트를 제공하지 않는 경우에 비해 작업 성공률을 감소시키는 경향이 있다.
LLM-생성 컨텍스트 파일은 평균적으로 성능을 약간만 감소시키고 추론 비용을 20% 이상 증가시킨다.
개발자가 제공한 컨텍스트 파일은 컨텍스트 파일이 없을 때보다 평균 약 4% 정도의 미미한 성능 향상을 제공한다.
컨텍스트 파일은 탐색, 테스트 및 추론을 증가시켜 명확한 개요 혜택 없이 비용을 더 높인다.
문서가 제거되면 LLM-생성 컨텍스트 파일이 개발자 작성된 파일보다 더 우수할 수 있어 일반 저장소에 있는 많은 컨텍스트 파일 섹션의 중복성을 시사한다.
컨텍스트 파일은 일반적으로 에이전트에 의해 따르지만 효과적인 저장소 개요로 작용하지 않는다.

Figure 2 : Distribution of AGENTbench instances across 12 open-source GitHub repositories, each containing context files.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.