QUICK REVIEW

[論文レビュー] From Prompt Engineering to Prompt Science With Human in the Loop

Chirag Shah|arXiv (Cornell University)|Jan 1, 2024

Complex Systems and Decision Making被引用数 7

ひとこと要約

本論文は、定性的コーディングに触発された4段階の人間-in-the-loopの手法を提案し、場当たり的なプロンプト設計を検証可能で再現可能なプロンプトサイエンスへと変換することで、LLM支援研究を促進する。

ABSTRACT

As LLMs make their way into many aspects of our lives, one place that warrants increased scrutiny with LLM usage is scientific research. Using LLMs for generating or analyzing data for research purposes is gaining popularity. But when such application is marred with ad-hoc decisions and engineering solutions, we need to be concerned about how it may affect that research, its findings, or any future works based on that research. We need a more scientific approach to using LLMs in our research. While there are several active efforts to support more systematic construction of prompts, they are often focused more on achieving desirable outcomes rather than producing replicable and generalizable knowledge with sufficient transparency, objectivity, or rigor. This article presents a new methodology inspired by codebook construction through qualitative methods to address that. Using humans in the loop and a multi-phase verification processes, this methodology lays a foundation for more systematic, objective, and trustworthy way of applying LLMs for analyzing data. Specifically, we show how a set of researchers can work through a rigorous process of labeling, deliberating, and documenting to remove subjectivity and bring transparency and replicability to prompt generation process. A set of experiments are presented to show how this methodology can be put in practice.

研究の動機と目的

研究でLLMsを使用する際の科学的厳密さの必要性を喚起し、場当たり的なプロンプト設計のリスクを特定する。
プロンプトを開発しLLM出力を評価するための体系的で透明なプロセスを導入する。
複数の評価者を用いた定性的コーディングを適用し、再現可能なプロンプト構築用コードブックを作成する。
プロンプトと応答の信頼性・一般化可能性・検証可能性を保証する多段階のパイプラインを提供する。

提案手法

定性的コーディングのコードブック構築アプローチを採用してプロンプトを構築する。
人間-in-the-loop評価を伴う4段階のパイプライン（設定、ICRを用いた基準設定、反復的なプロンプト開発、検証）を実装する。
少なくとも2名の適格な研究者の参加を求め、CohenのκやKrippendorffのαなどのインターコーダ信頼性(ICR)を算出する。
評価者間の意見の不一致に基づいてコードブック（基準）とプロンプトを反復的に修正し、合意形成と一般化可能性を向上させる。
任意でテストデータのサブセット上で全パイプラインを検証し、最終評価のICRを算出する。

実験結果

リサーチクエスチョン

RQ1LLMsのプロンプト生成を、データセット・モデル・時間を超えて検証可能で信頼性の高いものにするにはどうすればよいか？
RQ2人間の評価者とコードブックのような基準は、客観的で透明なプロンプト生成の達成にどのような役割を果たすか？
RQ3多段階の定性的コーディングに触発されたプロセスは、LLM主導のデータラベリングや分析における主観性やバイアスを低減できるか？
RQ4プロンプトサイエンスの実装と従来のプロンプトエンジニアリングのコストとメリットは何か？

主な発見

人間-in-the-loopを伴う多段階のプロンプト構築プロセスは、より透明で検証可能かつ再現性のあるプロンプトを生む。
複数の研究者の関与と正式なICR測定は、個々の偏見を減らし、評価の一貫性を向上させる。
熟議と決定の記録は、将来の研究者にとっての透明性と再現性を高める。
場当たり的なプロンプト設計と比べ、提案されたアプローチは品質と理解を高めるが、コストは高くなる。
任意の検証段階は、データサンプル全体に渡るパイプラインの信頼性をさらに保証できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。