QUICK REVIEW

[논문 리뷰] Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations

A. Korn, Lea Zaruchas|arXiv (Cornell University)|2026. 01. 05.

Software Engineering Research인용 수 0

한 줄 요약

이 논문은 LLM 기반 SE 연구에서 프롬프트 보고가 어떻게 이루어지는지 경험적으로 분석하고 필수, 바람직, 예외적 보고 요소를 구분하는 증거 기반 지침을 제안한다.

ABSTRACT

Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a critical factor in system performance and behavior. Despite their growing role in SE research, prompt-related decisions are rarely documented in a systematic or transparent manner, hindering reproducibility and comparability across studies. To address this gap, we conducted a two-phase empirical study. First, we analyzed nearly 300 papers published at the top-3 SE conferences since 2022 to assess how prompt design, testing, and optimization are currently reported. Second, we surveyed 105 program committee members from these conferences to capture their expectations for prompt reporting in LLM-driven research. Based on the findings, we derived a structured guideline that distinguishes essential, desirable, and exceptional reporting elements. Our results reveal significant misalignment between current practices and reviewer expectations, particularly regarding version disclosure, prompt justification, and threats to validity. We present our guideline as a step toward improving transparency, reproducibility, and methodological rigor in LLM-based SE research.

연구 동기 및 목표

현재 SE 연구가 LLM 기반 연구에서 프롬프트 설계, 테스트 및 최적화를 어떻게 보고하는지 평가한다.
ICSE, FSE, ASE의 PC 멤버를 대상으로 한 설문조사를 통해 프롬프트 보고에 대한 심사위원의 기대를 파악한다.
현재 관행과 커뮤니티의 기대 사이의 차이를 식별하여 구조화된 보고 지침을 제안한다.

제안 방법

약 2022년 이후의 ~300편의 SE 논문에 대한 문헌 분석과 105명의 PC 멤버 설문조사를 포함한 2단계 실증 연구를 수행했다.
6명의 저자 간 일관성을 보장하기 위한 순환형(반복) 라운드를 포함한 코딩/추출 스키마를 개발했다.
실제 보고 관행과 심사자 기대를 비교하여 지침을 도출했다.
재현 패키지에 재현 데이터와 코드를 제공했다.

실험 결과

연구 질문

RQ1RQ1: 연구자들이 SE 연구 논문에서 프롬프트를 현재 어떻게 보고하는가?
RQ2RQ2: 프롬프트 생성, 평가 및 보고에 대해 SE 연구자들은 어떤 기대를 갖고 있는가?
RQ3RQ3: 현재의 관행은 이러한 기대와 얼마나 일치하는가?

주요 결과

대부분의 논문은 사용된 LLM의 이름을 명시하지만 정확한 버전은 종종 누락되며(정확한 버전을 명시하는 경우는 16.43%에 불과하다).
대략 69.93%가 하나 이상의 구성 매개변수를 보고하며, 온도와 토큰 한도가 가장 일반적이다.
대략 75.17%가 프롬프트를 전부 또는 부분적으로 설명하고, 69.58%가 프롬프트를 단어별로 제시하며, 58.74%가 프롬프트 구성의 타당성을 정당화한다.
대략 62.24%가 프롬프트 엔지니어링 기법을 보고하며, 샷 프롬프트와 체인-오브-생각(chain-of-thought)이 가장 일반적이다.
프롬프트 튜닝에 대해 46.5%가 언급하고, 44.06%가 여러 프롬프트 변형을 설명하며, 23.43%가 프롬팅을 타당성의 위협으로 논의한다.
경험적으로 도출된 지침은 현재 관행과 심사자 기대 간의 격차를 드러내며 투명성과 재현성 향상의 필요성을 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.