QUICK REVIEW

[論文レビュー] Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

Mikel Zubillaga, Oscar Sainz|arXiv (Cornell University)|Jan 26, 2026

Topic Modeling被引用数 0

ひとこと要約

本論文は ThinkTwice を紹介する。LLM からの複数の候補文書レベル情報抽出出力を生成し、それらの中から最適なものを選択する枠組みで、特に推論指向モデルを用いた場合にゼロショットおよび監督付きの結果が最先端となる。

ABSTRACT

Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the state-of-the-art.

研究の動機と目的

デコーダ限定の LLM による文書レベル IE の出力可能性の変動をプロンプトガイドラインの下で動機づけ、定量化する。
各文書について複数の候補テンプレートを生成し、最良のものを選択する ThinkTwice を提案する。
無監督（F1 Voting）および監督付き（報酬ベース）セレクタを開発する。
金標準の推論痕跡が欠如している問題に対処するため、再jection サンプリングを用いてシルバー訓練データを作成する。
ゼロショット、監督付き、およびグ greedy からの改善と、先行する最先端との比較を示す。

提案手法

注釈ガイドラインの下で文書用に N 個の候補テンプレートを生成するよう LLMs にプロンプトを与える。
各候補について事前定義された JSON スキーマに従ってデコードを制約する。
セレクタ S を適用し、T_i から最良の候補を選択する（無監督または監督付き）。
Unsure セレクタ：F1 Voting は候補間の平均 F1 ベースの類似度でスコアを付け、トップを選ぶ。
監督付きセレクタ：銀データ（推論–テンプレート対の組み合わせで生成）上で報酬モデルを訓練し、候補をランク付けする。
監督のための高品質な銀色推論痕跡を生成するよう再jection サンプリングで推論LLMsを訓練する。

Figure 1 : Results on MUC-4 showing better greedy results and a more effective set of samples for Qwen3 32B when thinking. Maximum reports the results of oracle selection among generated samples.

実験結果

リサーチクエスチョン

RQ1デコーダ限定の LLM からの複数出力をサンプリングすることは、文書 IE の貪欲法を上回るか？
RQ2推論モデルは DocIE において、推論を行うモデルと行わないモデルではどちらがサンプリングの恩恵を受けやすいか？
RQ3無監督（F1 Voting）と監督付き（報酬モデル）セレクタは高品質なテンプレートを選択する上でどれほど有効か？
RQ4再jection サンプリングは監督付きセレクタの訓練に有用な銀色推論痕跡を生み出すか？
RQ5ThinkTwice はクロスリンガルな文書レベル IE にどの程度一般化できるか？

主な発見

Model	Selector	MUC	MultiMUC	BETTER	AVG
ChatGPT 3.5 †	×	22.41	12.93	-	-
Greedy Llama R1	✗	18.68	11.46	14.78	14.97
ThinkTwice Llama R1	Majority	21.96	12.78	3.12	12.62
ThinkTwice Llama R1	F1 Voting	21.23	13.22	17.10	17.18
ThinkTwice Llama R1	(oracle)	42.32	29.66	34.08	35.35
Greedy Qwen 3	✗	22.99	12.98	16.12	17.36
ThinkTwice Qwen 3	Majority	26.18	14.83	17.38	19.46
ThinkTwice Qwen 3	F1 Voting	24.82	15.04	20.02	19.96
ThinkTwice Qwen 3	(oracle)	46.48	33.08	36.74	38.76

推論モデルはゼロショット設定の DocIE タスクで標準的な LLM を一貫して上回る。
ThinkTwice と F1 Voting によるサンプリングは貪欲ベースラインを上回り、ゼロショット最先端の結果を達成する。
報酬モデルを用いた監督付き選択は実質的な利得を生み、オラクル性能に近づき、単言語間の最先端設定で新たな SOTA を設定する。
英語で訓練された ThinkTwice は報酬セレクタを用いて複数言語へ効果的に一般化し、ターゲット言語のベースラインと同等またはそれを上回る場合が多い。
再jection サンプリングはセレクタの訓練用に高品質な銀色推論痕跡を生成するのに役立つが、完全なオラクル性能にはまだ届かない。
オラクル（最適な選択）結果は、より良いセレクタによってさらなる改善の余地があることを示している。

Figure 2 : ThinkTwice architecture, with the inference process at the bottom. The supervised option includes two steps: \raisebox{-.9pt} {1}⃝ The iterative procedure to generate the silver dataset with trajectories and to fine-tune the reasoning model; \raisebox{-.9pt} {2}⃝ Training the selector wit

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。