QUICK REVIEW

[论文解读] Scalable Delphi: Large Language Models for Structured Risk Estimation

Tobias Lorenz, Mario Fritz|arXiv (Cornell University)|Feb 9, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

本文提出 Scalable Delphi，利用多元化的 LLM 人物扮演进行迭代、理由共享的结构化 elicitation，在网络安全基准测试中与人类专家保持良好校准与对齐，同时大幅降低 elicitation 时间。

ABSTRACT

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.

研究动机与目标

证明传统 Delphi 过慢时需要可扩展的结构化风险估计的必要性。
将 Delphi 协议适配到具有多样人物设定的 LLM 代理，以实现独立回合与反馈。
开发一个关注校准、证据敏感性和潜在量与人类对齐的评估框架。
在 AI 增强的网络安全中展示该方法，使用基准的真实代理和人工基线进行评估。

提出的方法

实例化一个包含 k 个具有不同人物设定的 LLM 专家代理的面板。
进行多轮 elicitation，代理在证据和先前反馈条件下给出估计。
在最后一轮后通过面板成员的均值汇总最终估计值。
提示将系统（人物设定、流程）与用户（任务、证据）组件分离，以实现跨量化的重用。
用留一法对不可证伪的代理进行校准评估，与可验证代理进行对照并评估证据敏感性。
将 LLM 估计与独立的人类专家面板进行比较，并分析推理质量。

实验结果

研究问题

RQ1基于 LLM 的面板能否对潜在风险量生产出经过校准的概率估计？
RQ2当增加或移除证据时，LLM 估计是否能够合理响应并与人类专家判断保持一致？
RQ3基于 LLM 的 elicitation 结果与真实基准以及网络安全风险中的人类专家面板相比如何？
RQ4多人物设 LLM 集成是否能提供与人类专家相近的现实不确定性与方差？

主要发现

LLM 面板与真实基准之间的相关性较强（Pearson r 在 0.87 到 0.95 之间）。
当证据增加时，估计呈现系统性改进（高证据敏感性）。
LLM 估计与人类专家面板对齐，在某些情形甚至比两个人类面板彼此之间的相似度（MAD 5.0 百分点 vs 16.6 百分点）更接近。
两种前沿模型（GPT-5.1 与 Claude Opus 4.1）在基准测试中优于简单启发式。
elicitation 将时间从数月缩短到数分钟，同时提供可扩展、可审计的结构化判断。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。