Skip to main content
QUICK REVIEW

[Paper Review] Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Gaurav Singh Negi, MA Waskow|arXiv (Cornell University)|Jan 23, 2026
Sentiment Analysis and Opinion Mining0 citations
TL;DR

The paper investigates using large language models (LLMs) as automatic annotators for ASTE and ACOS fine-grained opinion tasks, and introduces a declarative DSPy-based pipeline plus an LLM-based adjudication method to combine multiple annotations into final labels.

ABSTRACT

Fine-grained opinion analysis of text provides a detailed understanding of expressed sentiments, including the addressed entity. Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications. We explore the feasibility of LLMs as automatic annotators for fine-grained opinion analysis, addressing the shortage of domain-specific labelled datasets. In this work, we use a declarative annotation pipeline. This approach reduces the variability of manual prompt engineering when using LLMs to identify fine-grained opinion spans in text. We also present a novel methodology for an LLM to adjudicate multiple labels and produce final annotations. After trialling the pipeline with models of different sizes for the Aspect Sentiment Triplet Extraction (ASTE) and Aspect-Category-Opinion-Sentiment (ACOS) analysis tasks, we show that LLMs can serve as automatic annotators and adjudicators, achieving high Inter-Annotator Agreement across individual LLM-based annotators. This reduces the cost and human effort needed to create these fine-grained opinion-annotated datasets.

Motivation & Objective

  • Reduce the cost and human effort required to create fine-grained opinion datasets (ASTE and ACOS) by using LLMs as automatic annotators.
  • Mitigate variability in prompt engineering by employing a declarative pipeline (DSPy) to optimize prompts from limited annotated examples.
  • Propose and evaluate an LLM-based adjudication method to resolve inter-annotator disagreements and produce final annotations.
  • Assess how LLMs of different sizes perform as annotators and adjudicators across domain datasets (laptop, restaurant).

Proposed method

  • Use a declarative annotation pipeline (DSPy) to generate optimized prompts from a small annotated Dev set.
  • Evaluate multiple LLM annotators (three model sizes) on ASTE and ACOS tasks without fine-tuning.
  • Generate multiple annotations per input and apply an adjudication step where an LLM aggregates these into final annotations (ensemble/stacking-inspired).
  • Report precision, recall, and F1 against human annotations, and Krippendorff’s alpha for inter-annotator agreement (IAA).
  • Analyze element-wise alignment and error patterns to understand task-specific challenges (e.g., ACOS’ implicit aspects).
(a) ACOS & ASTE Specifications
(a) ACOS & ASTE Specifications

Experimental results

Research questions

  • RQ1Can LLMs serve as reliable automatic annotators for ASTE and ACOS tasks without fine-tuning?
  • RQ2Does an LLM-based adjudication step improve alignment with human annotations over individual annotators?
  • RQ3How does model size influence annotation quality and IAA in ASTE and ACOS settings?
  • RQ4What are the main error modes when annotating fine-grained opinions with ASTE and ACOS?
  • RQ5How does domain (laptop vs. restaurant) affect annotation difficulty and IAA in ACOS?

Key findings

  • LLM annotators with larger parameter counts generally align better with human annotations on ASTE and ACOS tasks.
  • The adjudication step improves alignment for some model sizes and datasets, acting like an ensemble method.
  • ACOS quadruples are more challenging than ASTE triplets, with domain differences (laptop vs. restaurant) impacting exact-match F1 scores.
  • IKAA analyses show Krippendorff’s alpha increases with model size, indicating higher IAA reliability for larger models.
  • Sentiment polarity predictions tend to align best with human annotations, while extracting exact targets and spans presents stronger challenges.
  • ACOS results show greater deviation from human annotations than ASTE in some configurations, indicating task difficulty differences.
Figure 2 : LLM-Based Annotation Pipeline using DSPy (Left) with LLM-as-adjudicator (Right)
Figure 2 : LLM-Based Annotation Pipeline using DSPy (Left) with LLM-as-adjudicator (Right)

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.