QUICK REVIEW

[논문 리뷰] Scalable Delphi: Large Language Models for Structured Risk Estimation

Tobias Lorenz, Mario Fritz|arXiv (Cornell University)|2026. 02. 09.

Artificial Intelligence in Healthcare and Education인용 수 0

한 줄 요약

본 논문은 Scalable Delphi를 도입하여 다양한 LLM 페르소나를 활용해 반복적이고 합리적 근거를 공유하는 구조화된 도출을 수행하고, 사이버 보안 벤치마크 전반에서 인간 전문가와의 강한 보정 및 정렬을 보이며 도출 시간을 대폭 줄이는 것을 보여준다.

ABSTRACT

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of coordination and specialist time, placing rigorous risk assessment out of reach for most applications. We investigate whether Large Language Models (LLMs) can serve as scalable proxies for structured expert elicitation. We propose Scalable Delphi, adapting the classical protocol for LLMs with diverse expert personas, iterative refinement, and rationale sharing. Because target quantities are typically unobservable, we develop an evaluation framework based on necessary conditions: calibration against verifiable proxies, sensitivity to evidence, and alignment with human expert judgment. We evaluate in the domain of AI-augmented cybersecurity risk, using three capability benchmarks and independent human elicitation studies. LLM panels achieve strong correlations with benchmark ground truth (Pearson r=0.87-0.95), improve systematically as evidence is added, and align with human expert panels - in one comparison, closer to a human panel than the two human panels are to each other. This demonstrates that LLM-based elicitation can extend structured expert judgment to settings where traditional methods are infeasible, reducing elicitation time from months to minutes.

연구 동기 및 목표

전통적인 Delphi가 너무 느린 경우에 대해 확장 가능하고 구조화된 위험 추정의 필요성을 제시한다.
다양한 페르소나를 가진 k명의 LLM 에이전트에 Delphi 프로토콜을 독립 라운드와 피드백을 가능하도록 적응한다.
잠재량에 대해 보정, 증거 민감도, 인간 정합성에 초점을 맞춘 평가 프레임워크를 개발한다.
벤치마크된 실제값 프록시와 인간 기준치를 활용하여 AI 보강 사이버 보안에서 이 접근법을 시연한다.

제안 방법

서로 다른 페르소나를 가진 k명의 LLM 전문가 에이전트 패널을 구성한다.
에이전트가 증거와 이전 피드백을 조건으로 추정치를 제공하는 다중 라운드 도출을 사용한다.
최종 라운드 후 패널 구성원 간의 평균으로 최종 추정치를 집계한다.
프롬프트는 시스템(페르소나, 프로세스) 및 사용자(과제, 증거) 구성 요소를 분리하여 다양한 수량에 대해 재사용 가능하도록 한다.
검증 가능한 프록시와의 leave-one-out 예측으로 보정을 평가하고 증거 민감도를 평가한다.
LLM 추정을 독립적인 인간 전문가 패널과 비교하고 추론 품질을 분석한다.]
research_questions_ordered_in_list_to_be_kept

실험 결과

연구 질문

RQ1LLM 기반 패널이 잠재 위험 수량에 대해 보정된 확률 추정치를 생성할 수 있는가?
RQ2LLM 추정이 추가되거나 제거된 증거에 의미있게 반응하고 인간 전문가 판단과 정합하는가?
RQ3사이버 보안 위험에서 LLM 기반 도출 결과가 실제 벤치마크와 인간 전문가 패널과 어떻게 비교되는가?
RQ4다중 페르소나 LLM 앙상블이 인간 전문가와 유사한 현실적인 불확실성 및 변동성을 제공하는가?

주요 결과

LLM 패널은 실제 벤치마크와 강한 상관관계를 보인다(Pearson r은 0.87에서 0.95 사이).
증거가 추가될수록 추정치가 체계적으로 개선된다(높은 증거 민감도).
LLM 추정은 인간 전문가 패널과 정합하며, 한 경우 두 인간 패널 간의 거리보다 한 인간 패널에 더 가깝다(MAD 5.0 pp 대 16.6 pp).
두 가지 최전선 모델(GPT-5.1 및 Claude Opus 4.1)은 벤치마크 전반에서 단순 휴리스틱을 능가한다.
도출 시간은 수개월에서 분으로 단축되면서도 확장 가능하고 감사 가능한 구조적 판단을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.