QUICK REVIEW

[论文解读] Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

Mikel Zubillaga, Oscar Sainz|arXiv (Cornell University)|Jan 26, 2026

Topic Modeling被引用 0

一句话总结

论文提出 ThinkTwice，一种采样与选择框架，利用来自大语言模型的多种候选文档级信息抽取输出并选择最佳者，在零样本和监督结果方面达到最先进的水平，尤其在以推理为导向的模型上。

ABSTRACT

Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the state-of-the-art.

研究动机与目标

在提示规范下，提出动机并量化解码器唯一的 LLMs 在文档级信息抽取中的输出变异性。
提出 ThinkTwice，以为每个文档生成多个候选模板并选出最佳者。
开发无监督（F1 投票）和有监督（基于奖励）的选择器。
通过拒绝采样解决缺乏金标准推理轨迹的问题，以生成银标准训练数据。
展示在零样本、有监督和跨语言场景中相对于贪婪解码与先前 state-of-the-art 的收益。

提出的方法

在标注指南下，提示LLMs为文档生成N个候选模板。
将解码限制为遵循每个候选项的预定义 JSON 架构。
应用选择器 S，从 T_i 中选出最佳候选项（无监督或有监督）。
无监督选择器：F1 投票使用候选项之间基于 F1 的相似度平均值来评分并选择最高分者。
有监督选择器：在银数据上训练奖励模型（生成的推理–模板对）以对候选项进行排序。
通过拒绝采样训练推理型LLMs，以生成用于监督的高质量银推理轨迹。

Figure 1 : Results on MUC-4 showing better greedy results and a more effective set of samples for Qwen3 32B when thinking. Maximum reports the results of oracle selection among generated samples.

实验结果

研究问题

RQ1解码器唯一 LLMs 的多输出采样是否能超过贪婪解码在文档IE上的表现？
RQ2推理模型在 DocIE 中是否比非推理模型从采样中获益更多？
RQ3无监督（F1 投票）和有监督（奖励模型）选择器在挑选高质量模板方面有多大有效性？
RQ4拒绝采样是否能产生有用的银推理轨迹来训练有监督选择器？
RQ5ThinkTwice 在跨语言文档级 IE 的泛化能力如何？

主要发现

Model	Selector	MUC	MultiMUC	BETTER	AVG
ChatGPT 3.5 †	×	22.41	12.93	-	-
Greedy Llama R1	✗	18.68	11.46	14.78	14.97
ThinkTwice Llama R1	Majority	21.96	12.78	3.12	12.62
ThinkTwice Llama R1	F1 Voting	21.23	13.22	17.10	17.18
ThinkTwice Llama R1	(oracle)	42.32	29.66	34.08	35.35
Greedy Qwen 3	✗	22.99	12.98	16.12	17.36
ThinkTwice Qwen 3	Majority	26.18	14.83	17.38	19.46
ThinkTwice Qwen 3	F1 Voting	24.82	15.04	20.02	19.96
ThinkTwice Qwen 3	(oracle)	46.48	33.08	36.74	38.76

推理模型在零样本设置下持续优于标准LLMs在DocIE任务上的表现。
使用 ThinkTwice 的采样结合 F1 投票超越贪婪基线并达到零样本的最先进结果。
有监督选择（奖励模型）带来显著提升，接近 Oracle 表现，并在单语跨语言设置中创造新SOTA。
跨语言迁移：在英语训练的 ThinkTwice 结合奖励选择器能有效推广到多种语言，常与目标语言基线相匹配或超越。
拒绝采样使得生成高质量的银推理轨迹用于训练选择器成为可能，尽管尚未达到完全的 Oracle 表现。
Oracle（最佳可能选择）结果表明仍有通过更好选择器进一步提升的空间。

Figure 2 : ThinkTwice architecture, with the inference process at the bottom. The supervised option includes two steps: \raisebox{-.9pt} {1}⃝ The iterative procedure to generate the silver dataset with trajectories and to fine-tune the reasoning model; \raisebox{-.9pt} {2}⃝ Training the selector wit

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。