QUICK REVIEW

[논문 리뷰] From Prompt Engineering to Prompt Science With Human in the Loop

Chirag Shah|arXiv (Cornell University)|2024. 01. 01.

Complex Systems and Decision Making인용 수 7

한 줄 요약

본 논문은 자발적(prompt engineering) 를 네 단계의 인류-루프(human-in-the-loop) 방법으로, 질적 코딩에 영감을 받아 LLM 보조 연구를 위한 검증 가능하고 재현 가능한 프롬프트 과학으로 전환한다.

ABSTRACT

As LLMs make their way into many aspects of our lives, one place that warrants increased scrutiny with LLM usage is scientific research. Using LLMs for generating or analyzing data for research purposes is gaining popularity. But when such application is marred with ad-hoc decisions and engineering solutions, we need to be concerned about how it may affect that research, its findings, or any future works based on that research. We need a more scientific approach to using LLMs in our research. While there are several active efforts to support more systematic construction of prompts, they are often focused more on achieving desirable outcomes rather than producing replicable and generalizable knowledge with sufficient transparency, objectivity, or rigor. This article presents a new methodology inspired by codebook construction through qualitative methods to address that. Using humans in the loop and a multi-phase verification processes, this methodology lays a foundation for more systematic, objective, and trustworthy way of applying LLMs for analyzing data. Specifically, we show how a set of researchers can work through a rigorous process of labeling, deliberating, and documenting to remove subjectivity and bring transparency and replicability to prompt generation process. A set of experiments are presented to show how this methodology can be put in practice.

연구 동기 및 목표

연구에서 LLM을 사용할 때 과학적 엄밀성의 필요성을 촉구하고 임의(Ad-hoc) 프롬프트 엔지니어링의 위험을 식별한다.
프롬프트를 개발하고 LLM 산출물을 평가하기 위한 체계적이고 투명한 프로세스를 소개한다.
여러 평가자가 참여하는 질적 코딩을 적용하여 재현 가능한 프롬프트 구성 코트북(codebook)을 만든다.
프롬프트와 응답의 신뢰성, 일반화 가능성, 검증 가능성을 보장하는 다단계 파이프라인을 제공한다.

제안 방법

질적 코딩의 코드북 구축 접근법을 채택하여 프롬프트를 구성한다.
ICR이 포함된 기준 수립, 반복적 프롬프트 개발, 검증으로 구성된 네 단계 파이프라인(설정, 기준 수립, 반복적 프롬프트 개발, 검증)을 인간-개입 평가와 함께 구현한다.
적어도 두 명의 자격 있는 연구자가 참여하고 Cohen’s kappa 또는 Krippendorff의 alpha와 같은 코드 간 신뢰도(ICR)를 계산하도록 한다.
평가자 간 이견에 따라 코드북(기준)과 프롬프트를 반복적으로 수정하여 합의와 일반화 가능성을 향상시킨다.
선택적으로 테스트 데이터 하위집합에서 전체 파이프라인을 검증하고 최종 평가를 위한 ICR을 계산한다.

실험 결과

연구 질문

RQ1데이터셋, 모델, 시점에 걸쳐 LLM용 프롬프트 생성을 어떻게 검증 가능하고 신뢰할 수 있으며 재현 가능하게 만들 수 있는가?
RQ2객관적이고 투명한 프롬프트 생성을 달성하는 데 인간 평가자와 코드북과 같은 기준이 어떤 역할을 하는가?
RQ3다단계의 질적 코딩에서 영감을 받은 프로세스가 LLM 기반 데이터 라벨링이나 분석에서 주관성과 편향을 줄일 수 있는가?
RQ4프롬프트 사이언스를 구현하는 것과 전통적 프롬프트 엔지니어링의 비용과 이점은 무엇인가?

주요 결과

인간-탐입을 포함한 다단계 프롬프트 구성 프로세스는 더 투명하고 검증 가능하며 재현 가능한 프롬프트를 낳는다.
여러 연구자의 참여와 형식적 ICR 측정은 개인 편견을 줄이고 평가의 일관성을 향상시킨다.
심의 및 결정 과정을 문서화하는 것은 향후 연구자들에게 개방성과 재현성을 강화한다.
임의 프롬프트 엔지니어링에 비해 제안된 접근은 비용이 더 들지만 품질과 이해도를 높인다.
선택적 검증 단계는 데이터 샘플 전반에 걸친 파이프라인의 신뢰성을 추가로 보장할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.