QUICK REVIEW

[論文レビュー] Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan

Siyu Liang, Talant Mawkanuli|arXiv (Cornell University)|Mar 1, 2026

Natural Language Processing Techniques被引用数 0

ひとこと要約

要約: 本論文は、BiLSTM-CRF の語形予測と LLM の後補正を結合した二段階ハイブリッドパイプラインを提示し、 retrieval-augmented prompting を用いて Jungar Tuvan の IGT glossing を改善する。低リソースで形態学的に豊かな言語における構造化モデルと LLM 推論の統合設計原則を提供する。

ABSTRACT

Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.

研究の動機と目的

低リソースで形態学的に豊かな言語の interlinear glossed text (IGT) 作成のボトルネックに対処する。
構造化予測子と LLM 後補正を組み合わせたハイブリッドアーキテクチャを提案し、 glossing の精度を向上させる。
野外調査文脈における retrieval 策略、用語集、few-shot 拡張の設計選択を体系的に評価する。

提案手法

BiLSTM-CRF を基盤とする初期 gloss 予測の二段階アーキテクチャ。
BiLSTM 出力を精練する retrieval-augmented prompting を用いた LLM 後補正。
4つの LLM を用いた評価（deepseek-v3.2-exp、qwen3-max、gpt-4o-mini、gemma-3-27b-it）、 greedy decoding。
retrieval 対 random 選択、 n-shot 拡張、用語集の除外・有効性、ハイブリッド補正の実験。
Morpheme boundary gold segmentation を仮定、トークンレベルの正確さを評価指標とする。

Figure 1: Hybrid pipeline combining BiLSTM-CRF structured prediction with LLM post-correction using retrieval-augmented prompting.

実験結果

リサーチクエスチョン

RQ1 retrieval-augmented prompting は random 例選択に対して glossing の精度を改善するか？
RQ2 少数-shot (n-shot) 拡張は、低リソース言語における LLM ベースの glossing 性能に影響を与えるか？
RQ3 Morpheme 辞書はこの設定で LLM ベースの glossing を助けるか、それとも妨げるか？
RQ4 ハイブリッドの BiLSTM-CRF + LLM 補正パイプラインは、純粋な BiLSTM または純粋な LLM アプローチを上回るか？
RQ5 野外 IGT タスクにおける retrieval 戦略と辞書リソース適用の設計原則は何か？

主な発見

Model	Random	RAG
deepseek-v3.2-exp	0.118	0.506
qwen3-max	0.062	0.381
gpt-4o-mini	0.103	0.396
gemma-3-27b-it	0.068	0.344

retrieval-augmented prompting は、すべてのモデルでランダム例選択に対して顕著な利得をもたらす。
性能は文脈内例の数と概ね対数的にスケールし、約10–15例で安定化する。
Morpheme 辞書は generally 性能を悪化させる；部分辞書または完全辞書は glossing の正確さを信頼できる形で改善しない。
ハイブリッドの BiLSTM-CRF + LLM 補正は、純粋な生成より一貫して改善を示し、特に少数-shot 設定で顕著。
Lexical morphemes はハイブリッド補正からの最大の利得を示し、文法的モーフィームは既に BiLSTM ベースラインによって適切に処理されている。

Figure 2: Experiment 2: n-shot scaling curves for RAG LLM generation. Performance scales approximately logarithmically with example count, plateauing around n=10–15 for most models. The BiLSTM baseline (0.474) is provided in the text for reference.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。