QUICK REVIEW

[論文レビュー] Optimizing LLM Annotation of Classroom Discourse through Multi-Agent Orchestration

Bakhtawar Ahtisham, Kirk Vanacore|arXiv (Cornell University)|Mar 8, 2026

Online Learning and Analytics被引用数 0

ひとこと要約

論文は、LLMベースの教室ディスコース注釈のための階層的・マルチエージェント編成フレームワークを提案し、段階的検証と裁定を通じて信頼性を向上させ、推論モデルと非推論モデルの間での reasoning によるコストと非推論モデルとのコストのトレードオフを分析します。

ABSTRACT

Large language models (LLMs) are increasingly positioned as scalable tools for annotating educational data, including classroom discourse, interaction logs, and qualitative learning artifacts. Their ability to rapidly summarize instructional interactions and assign rubric-aligned labels has fueled optimism about reducing the cost and time associated with expert human annotation. However, growing evidence suggests that single-pass LLM outputs remain unreliable for high-stakes educational constructs that require contextual, pedagogical, or normative judgment, such as instructional intent or discourse moves. This tension between scale and validity sits at the core of contemporary education data science. In this work, we present and empirically evaluate a hierarchical, cost-aware orchestration framework for LLM-based annotation that improves reliability while explicitly modeling computational tradeoffs. Rather than treating annotation as a one-shot prediction problem, we conceptualize it as a multi-stage epistemic process comprising (1) an unverified single-pass annotation stage, in which models independently assign labels based on the rubric; (2) a self-verification stage, in which each model audits its own output against rubric definitions and revises its label if inconsistencies are detected; and (3) a disagreement-centric adjudication stage, in which an independent adjudicator model examines the verified labels and justifications and determines a final label in accordance with the rubric. This structure mirrors established human annotation workflows in educational research, where initial coding is followed by self-checking and expert resolution of disagreements.

研究の動機と目的

信頼性のある大規模な教室ディスコース注釈が、単一パスのLLM出力を超えて必要である動機づけ。
多段階検証と裁定を含む階層的でコスト認識型の注釈フレームワークを提案。
K–12の数学ディスコースデータセットに対して、複数のLLMファミリを横断して注釈戦略を実証的に評価。

提案手法

3段階の注釈プロセスを定義する：(1) LLMによる検証なしの単一パスラベリング、(2) ルーブリックとの自己検証と改訂、(3) 最終ラベルを確定する独立モデルによる裁定。
2つのパイプラインを評価する：(a) 標準的なLLMによる単一パス、検証、裁定；(b) 推論機能を備えたLLMと同じ段階。
6戦略を3つのモデルファミリ（GPT、Claude、Gemini）で評価し、プロンプトトークン、完了トークン、総コストを追跡する。
Talk Movesコーディング（7カテゴリ）を63のトランスクリプトに適用し、800のターゲット教員発話と467の対話セグメントを用い、カテゴリ分布を保持する。

Figure 1: TalkMoves Per-Category Performance Across Orchestration Strategies. Average F1 scores for seven TalkMoves categories under different annotation orchestration strategies. Bars show mean performance averaged across three LLMs (GPT, Claude, Gemini), while markers indicate individual model sco

実験結果

リサーチクエスチョン

RQ1階層的オーケストレーションは、単一パス注釈と比較してTalk Moves注釈の信頼性を向上させるか？
RQ2検証および裁定の段階は、推論有り/無しモデルの性能とコストにどのような影響を与えるか？
RQ3高リスクの教育注釈において、オーケストレーションは単に推論機能を持つモデルを用いるよりもコスト効率が良いか？
RQ4異なるオーケストレーション戦略でTalk Movesカテゴリごとの相対的な利得はどの程度か？

主な発見

カテゴリ	モデル	Non-Reasoning Annotated	Non-Reasoning Verified	Non-Reasoning Adjudicated	Reasoning Annotated	Reasoning Verified	Reasoning Adjudicated
Keep Together	Gemini	0.22	0.44	0.56	0.33	0.44	0.57
Keep Together	GPT	0.20	0.41	0.52	0.30	0.40	0.53
Keep Together	Claude	0.19	0.42	0.53	0.31	0.41	0.54
Revoicing	Gemini	0.21	0.32	0.34	0.31	0.32	0.35
Revoicing	GPT	0.21	0.29	0.30	0.28	0.28	0.31
Revoicing	Claude	0.20	0.30	0.31	0.29	0.29	0.32
Press Reason	Gemini	0.42	0.46	0.48	0.45	0.46	0.49
Press Reason	GPT	0.38	0.43	0.44	0.42	0.42	0.45
Press Reason	Claude	0.40	0.44	0.46	0.43	0.44	0.47
Relate	Gemini	0.52	0.56	0.58	0.55	0.56	0.59
Relate	GPT	0.48	0.53	0.54	0.52	0.52	0.55
Relate	Claude	0.50	0.55	0.56	0.54	0.54	0.57
Press Accuracy	Gemini	0.54	0.60	0.62	0.59	0.60	0.63
Press Accuracy	GPT	0.50	0.56	0.57	0.55	0.55	0.58
Press Accuracy	Claude	0.52	0.58	0.60	0.57	0.58	0.61
None	Gemini	0.58	0.63	0.65	0.62	0.63	0.67
None	GPT	0.54	0.60	0.61	0.58	0.59	0.62
None	Claude	0.56	0.61	0.63	0.60	0.61	0.64
Restating	Gemini	0.60	0.64	0.66	0.63	0.64	0.67
Restating	GPT	0.56	0.61	0.62	0.60	0.60	0.63
Restating	Claude	0.58	0.62	0.64	0.61	0.62	0.65

階層的なオーケストレーションは、解釈的・規範的判断を要するカテゴリを特に含む全てのTalk Movesカテゴリで、単一パス注釈よりも一貫した改善を示す。
推論有りの裁定は最も高い絶対性能を達成するが、非推論裁定でもトークンコストを大幅に抑えつつ比較可能な性能を達成できる（約20–25% fewer tokens）。
検証優先のパイプラインは、モデルの推論能力だけに依存するよりも、コストと信頼性のバランスを改善する傾向がある。
非推論モデルと裁定を組み合わせたアプローチは、検証を伴う推論有りモデルと同等以上のカテゴリで達成でき、オーケストレーションの価値をモデルの複雑さだけでは測れないことを示す。
オーケストレーションは注釈を測定問題として再定義し、精度、コスト、不確実性の透明なトレードオフを可能にする。

Figure 2: TalkMoves Token Cost vs. Average Performance for Reasoning and Non-Reasoning Pipelines. Average F1 score as a function of total token usage for TalkMoves annotation pipelines. Points trace the progression from single-pass annotation to verification and adjudication for non-reasoning and re

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。