QUICK REVIEW

[論文レビュー] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Gaurav Singh Negi, MA Waskow|arXiv (Cornell University)|Jan 23, 2026

Sentiment Analysis and Opinion Mining被引用数 0

ひとこと要約

本文は、大規模言語モデル（LLMs）を自动注釈者としてASTEおよびACOSの詳細感情タスクで使用することを調査し、複数の注釈を最終ラベルに統合するための宣言的 DSPy ベースのパイプラインと LLM ベースの調停法を導入する。

ABSTRACT

Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications. We explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis, addressing the shortage of domain-specific labelled datasets. In this work, we use a declarative annotation pipeline. This approach reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a novel methodology for an LLM to adjudicate multiple labels and produce final annotations. After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators. This reduces the cost and human effort needed to create these fine-grained opinion-annotated datasets.

研究の動機と目的

ASTEおよびACOSの細粒度意見データセット作成に要するコストと人手をLLMsによる自動注釈で削減する。
限られた注釈付き例からプロンプトを最適化する宣言的パイプライン（DSPy）を用いてプロンプト設計のばらつきを緩和する。
複数の注釈者の同意を得た上での最終注釈を生成するLLMベースの調停法を提案・評価する。
異なるサイズのLLMsが注釈者および調停者として、ドメインデータセット（laptop、restaurant）でどのように機能するかを評価する。

提案手法

小さな注釈付きDevセットから最適化されたプロンプトを生成する宣言的注釈パイプライン（DSPy）を使用。
微調整なしでASTEおよびACOSタスクを対象に複数のLLM注釈者（3モデルサイズ）を評価。
入力ごとに複数の注釈を生成し、LLMがこれらを最終注釈へ統合する調停ステップを適用（アンサンブル/スタッキングに触発）。
人間の注釈と比較した精度、再現率、F1、そしてアノテーター間一致度のKrippendorff’s αを報告。
要素ごとの整合性と誤りパターンを分析して、タスク特有の課題（例：ACOSの暗黙的アスペクト）を理解する。

実験結果

リサーチクエスチョン

RQ1LLMsは微調整なしでASTEおよびACOSタスクの自動注釈者として信頼できるか。
RQ2LLMベースの調停ステップは、個々の注釈者より人間の注釈との整合性を改善するか。
RQ3モデルサイズはASTEおよびACOS設定で注釈品質とIAAにどのように影響するか。
RQ4細粒度の意見をASTeとACOSで注釈する際の主要な誤りモードは何か。
RQ5ドメイン（laptop対restaurant）はACOSの注釈難易度とIAAにどのように影響するか。

主な発見

パラメータ数が大きいLLM注釈者は、一般に人間の注釈とASTEおよびACOSタスクでより良く整合する。
調停ステップは、いくつかのモデルサイズとデータセットで整合性を改善し、アンサンブル法のように機能する。
ACOSの quadruples は ASTE の triplets より難易度が高く、ドメイン差（laptop対restaurant）が exact-match の F1 スコアに影響する。
IKAA分析はモデルサイズとともに Krippendorff’s α が増加することを示し、より大きなモデルでIAAの信頼性が高いことを示唆する。
感情極性の予測は人間の注釈と最も良く整合する傾向がある一方、正確なターゲットとスパンの抽出にはより強い課題がある。
ACOSの結果は一部の設定で人間の注釈からの乖離が大きく、タスクの難易度差を示している。

Figure 2 : LLM-Based Annotation Pipeline using DSPy (Left) with LLM-as-adjudicator (Right)

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。