QUICK REVIEW

[論文レビュー] Multi-Perspective LLM Annotations for Valid Analyses in Subjective Tasks

Navya Mehrotra, Adam Visokay|arXiv (Cornell University)|Mar 22, 2026

Topic Modeling被引用数 0

ひとこと要約

この論文は Perspective-Driven Inference (PDI) を提案します。これは LLM の注釈と少量の人間の入力を組み合わせて、主観的な課題における群別注釈平均を推定する適応的サンプリング枠組みで、人口統計グループ間の意見の不一致を保持し、モデル化が難しい視点を改善します。

ABSTRACT

Large language models are increasingly used to annotate texts, but their outputs reflect some human perspectives better than others. Existing methods for correcting LLM annotation error assume a single ground truth. However, this assumption fails in subjective tasks where disagreement across demographic groups is meaningful. Here we introduce Perspective-Driven Inference, a method that treats the distribution of annotations across groups as the quantity of interest, and estimates it using a small human annotation budget. We contribute an adaptive sampling strategy that concentrates human annotation effort on groups where LLM proxies are least accurate. We evaluate on politeness and offensiveness rating tasks, showing targeted improvements for harder-to-model demographic groups relative to uniform sampling baselines, while maintaining coverage.

研究の動機と目的

主観的な課題で単一の真実に収束させるのではなく、注注者の不一致を保持する必要性を動機づける。
多視点コーパス推定を、群別平均のベクトルを推定する問題として formalize する。
LLM プロキシの精度が低い群に人間注釈を集中させる適応的サンプリング戦略を開発する。
ブートストラップベースの信頼区間を伴う inverse probability weighting (IPW) ベースの推定量を提案し、群レベルの有効な推定を得る。

提案手法

テキスト T_i が K 群にわたるデモグラフィック群 d_i の注釈者によってアノテーションされ、 theta* = (theta*_g1, ..., theta*_gK) を推定する問題設定を定義する。
LLM の注釈を安価な代理として用い、デモグラフィック特徴から hat{err}_i(d_i) を学習して人間注釈の適応サンプリングを導く、バーンイン期はバッチ更新に先行する。
バッチ内で正規化しつつ hat{err}_i に蓄積データを反映して pi_i に応じて人間注釈 H_i を収集する。
theta*_gk を逆確率重み付け(IPW)の整合推定量で推定し、ブートストラップ (Zrnic & Candès, 2024) によって信頼区間を得る。
基線として LLM のみ（ゼロ/少数ショット、ペルソナプロンプト）や PPI（均一サンプリング）と比較し、カバレッジを確保し、デモグラフィック群間の差分（平均絶対誤差の変化）を評価する。

Figure 1: Overview of the Perspective-Driven Inference . Starting from a corpus of $n$ texts, we collect LLM annotations, initialize human annotation via uniform sampling, and then enter an adaptive loop that predicts LLM error from demographic features, sampling human annotations across groups. The

実験結果

リサーチクエスチョン

RQ1主観的な課題で群別注釈平均のベクトルを推定し、人口統計的視点を保持できるか。
RQ2適応的・誤差主導の人間注釈配分は、均一サンプリングやLLMのみの基準と比べて精度を改善し、難しいモデル化対象群のカバレッジを維持できるか。
RQ3Perspective-Driven Inference は丁寧さや攻撃性の評価タスクおよび合成データでどう機能するか。

主な発見

PDI は丁寧さタスクで年齢層のカバレッジを 90% 以上に維持し、50 歳以上で delta の改善が最大である（PDI の場合 11.23%、LLM のみのベストは 16.31%）。
PDI は年齢層間の丁寧さで最小の平均 delta を達成しており、特に 50 歳以上で有意な改善（11.23% 対 PPI の 13.63%）。
攻撃性では LLM のみの手法がカバレッジと delta の点で劣る一方、PDI と PPI は年齢層全体で 95.0% のカバレッジを維持し、50 歳以上の delta は 5.24%（LLM のみのバリアントは 24% 以上）。
PDI は難しい群（例：50 歳以上）により多くの人間注釈を割り当て、丁寧さで均等サンプリングより 33% の増加、攻撃性で 50+ に対して 19% の増加を示す。
合成データ実験では、人間予算が全体の 20% を超え、LLM の群間差が大きい場合に適応的サンプリングが有利であり、高く歪んだ群やモデル化が難しい群でより大きな利得が見られる。

Figure 2: Annotation distributions vary across demographic groups. Human ratings for politeness (top) and offensiveness (bottom) broken down by annotator demographics. Variation across groups motivates estimating a vector of group-specific means rather than a single aggregate.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。