QUICK REVIEW

[논문 리뷰] Do not be greedy, Think Twice: Sampling and Selection for Document-level Information Extraction

Mikel Zubillaga, Oscar Sainz|arXiv (Cornell University)|2026. 01. 26.

Topic Modeling인용 수 0

한 줄 요약

본 논문은 ThinkTwice를 소개한다. 이는 LLM으로부터 다수의 후보 문서 수준 정보 추출 출력들을 샘플링하고 그 중 최적의 하나를 선택하는 프레임워크이며, 특히 추론 지향 모델과 함께 제로샷 및 감독 학습 설정에서 최첨단 결과를 달성한다.

ABSTRACT

Document-level Information Extraction (DocIE) aims to produce an output template with the entities and relations of interest occurring in the given document. Standard practices include prompting decoder-only LLMs using greedy decoding to avoid output variability. Rather than treating this variability as a limitation, we show that sampling can produce substantially better solutions than greedy decoding, especially when using reasoning models. We thus propose ThinkTwice, a sampling and selection framework in which the LLM generates multiple candidate templates for a given document, and a selection module chooses the most suitable one. We introduce both an unsupervised method that exploits agreement across generated outputs, and a supervised selection method using reward models trained on labeled DocIE data. To address the scarcity of golden reasoning trajectories for DocIE, we propose a rejection-sampling-based method to generate silver training data that pairs output templates with reasoning traces. Our experiments show the validity of unsupervised and supervised ThinkTwice, consistently outperforming greedy baselines and the state-of-the-art.

연구 동기 및 목표

제시된 프롬프팅 가이드라인 하에서 문서 수준 IE를 위한 디코더 전용 LLM의 출력 변동성을 동기화하고 양적화한다.
문서당 여러 후보 템플릿을 생성하고 최적의 것을 선택하도록 ThinkTwice를 제안한다.
무감독( F1 Voting) 및 감독(보상 기반) 선택기를 개발한다.
실데링 샘플링을 통해 은색(황금표준이 아닌) 추론 흔적을 만들어 은색 학습 데이터를 생성하는 방법으로 골드-표준 추론 흔적의 부족 문제를 다룬다.
제로샷, 감독 학습 및 다언어 일반화에서Greedy 디코딩 및 이전 최첨단과의 이득을 보인다.

제안 방법

주석 가이드라인 아래에서 문서에 대해 N개의 후보 템플릿을 생성하도록 LLM에 프롬프트를 건다.
각 후보에 대해 미리 정의된 JSON 스키마를 따르도록 디코딩을 제약한다.
선택기 S를 적용해 T_i 중 최적 후보를 선택한다(무감독 또는 감독).
Unsure selector: F1 Voting은 후보들 간의 평균 F1 기반 유사도로 점수를 매겨 상위 후보를 선택한다.
Supervised selector: 은색 데이터(생성된 추론–템플릿 쌍)에서 보상 모델을 학습해 후보를 순위화한다.
보조 학습을 위한 고품질의 은색 추론 흔적을 생성하기 위해 거절 샘플링을 통해 추론 LLM을 학습한다.

Figure 1 : Results on MUC-4 showing better greedy results and a more effective set of samples for Qwen3 32B when thinking. Maximum reports the results of oracle selection among generated samples.

실험 결과

연구 질문

RQ1디코더 전용 LLM의 다중 출력 샘플링이 문서 IE에서 탐욕적 디코딩보다 성능이 우수한가요?
RQ2추론 모델이 비추론 모델보다 문서 수준 IE에서 샘플링으로 더 이익을 얻는가요?
RQ3무감독(F1 Voting) 및 감독(보상 모델) 선택기가 고품질 템플릿을 선택하는 데 얼마나 효과적인가요?
RQ4거절 샘플링이 감독 선택기를 학습하는 데 유용한 은색 추론 흔적을 생성할 수 있나요?
RQ5ThinkTwice가 다언어 문서 수준 IE에서 일반화가 얼마나 잘 되는가요?

주요 결과

모델	선택자	MUC	MultiMUC	BETTER	평균
ChatGPT 3.5 †	×	22.41	12.93	-	-
Greedy Llama R1	✗	18.68	11.46	14.78	14.97
ThinkTwice Llama R1	Majority	21.96	12.78	3.12	12.62
ThinkTwice Llama R1	F1 Voting	21.23	13.22	17.10	17.18
ThinkTwice Llama R1	(oracle)	42.32	29.66	34.08	35.35
Greedy Qwen 3	✗	22.99	12.98	16.12	17.36
ThinkTwice Qwen 3	Majority	26.18	14.83	17.38	19.46
ThinkTwice Qwen 3	F1 Voting	24.82	15.04	20.02	19.96
ThinkTwice Qwen 3	(oracle)	46.48	33.08	36.74	38.76

추론 모델은 제로샷 설정에서 표준 LLM보다 문서 IE 태스크에서 일관되게 우수합니다.
ThinkTwice와 F1 Voting을 이용한 샘플링은Greedy 기반보다 우월하며 제로샷에서 최첨단 결과를 달성합니다.
감독 선택에서 보상 모델은 상당한 이득을 만들어 오링 샷(oracle) 성능에 근접하며 단일언어-다언어 설정에서 새로운 SOTA를 달성합니다.
다언어 전이: 영어로 학습된 ThinkTwice와 보상 선택자가 여러 언어에 효과적으로 일반화하며 대상 언어 기준선과 종종 동등하거나 능가합니다.
거절 샘플링은 선택기를 학습하기 위한 고품질의 은색 추론 흔적 생성을 가능하게 하지만 전체 오라클 성능은 아직 달성되지 않습니다.
오라클(가장 좋은 선택) 결과는 더 나은 선택기로 추가 개선의 여지가 있음을 시사합니다.

Figure 2 : ThinkTwice architecture, with the inference process at the bottom. The supervised option includes two steps: \raisebox{-.9pt} {1}⃝ The iterative procedure to generate the silver dataset with trajectories and to fine-tune the reasoning model; \raisebox{-.9pt} {2}⃝ Training the selector wit

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.