QUICK REVIEW

[论文解读] From Prompt Engineering to Prompt Science With Human in the Loop

Chirag Shah|arXiv (Cornell University)|Jan 1, 2024

Complex Systems and Decision Making被引用 7

一句话总结

本文提出一个四阶段、人工参与的与定性编码启发的、将即时提示工程转化为可验证、可重复的提示科学，以用于 LLM 辅助研究的四阶段、人工参与方法。

ABSTRACT

As LLMs make their way into many aspects of our lives, one place that warrants increased scrutiny with LLM usage is scientific research. Using LLMs for generating or analyzing data for research purposes is gaining popularity. But when such application is marred with ad-hoc decisions and engineering solutions, we need to be concerned about how it may affect that research, its findings, or any future works based on that research. We need a more scientific approach to using LLMs in our research. While there are several active efforts to support more systematic construction of prompts, they are often focused more on achieving desirable outcomes rather than producing replicable and generalizable knowledge with sufficient transparency, objectivity, or rigor. This article presents a new methodology inspired by codebook construction through qualitative methods to address that. Using humans in the loop and a multi-phase verification processes, this methodology lays a foundation for more systematic, objective, and trustworthy way of applying LLMs for analyzing data. Specifically, we show how a set of researchers can work through a rigorous process of labeling, deliberating, and documenting to remove subjectivity and bring transparency and replicability to prompt generation process. A set of experiments are presented to show how this methodology can be put in practice.

研究动机与目标

在研究中使用 LLM 时，动机科学严谨性，并识别即兴提示工程的风险。
引入一个系统性、透明的流程来制定提示并评估 LLM 输出。
将定性编码方法应用于多位评估者，以创建可复制的提示构建代码簿。
提供一个多阶段的流程，确保提示和响应的可靠性、可泛化性和可验证性。

提出的方法

采用定性编码中的代码簿构建方法来构造提示。
实施一个四阶段的流水线（搭建、带 ICR 的标准建立、迭代提示开发、验证）并进行人工在环评估。
要求至少两名合格研究人员参与，并计算互评可靠性（ICR），如 Cohen’s kappa 或 Krippendorff’s alpha。
基于评估者的分歧对代码簿（标准）和提示进行迭代修订，以提高一致性和可泛化性。
可选地在测试数据子集上对整个流程进行验证，并对最终评估计算 ICR。

实验结果

研究问题

RQ1如何使 LLM 的提示生成在跨数据集、跨模型、跨时间上具备可验证、可重复性？
RQ2人类评估者和类似代码簿的标准在实现客观、透明的提示生成中扮演何种角色？
RQ3多阶段、受定性编码启发的流程是否能降低 LLM 驱动的数据标注或分析中的主观性与偏差？
RQ4实施提示科学相对于传统提示工程的成本与收益有哪些？

主要发现

带有人在环的多阶段提示构建过程使提示更透明、可验证、可重复。
多位研究人员的参与和正式的 ICR 测度减少了个体偏见并提高了评估一致性。
记录讨论过程和决策增强了未来研究者的开放性和可重复性。
与即兴提示工程相比，所提出的方法在质量和理解度上有所提升，尽管成本更高。
可选验证阶段可以进一步确保整个管线在数据样本上的可靠性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。