QUICK REVIEW

[논문 리뷰] More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Song Tae-Eun|arXiv (Cornell University)|2026. 03. 17.

Topic Modeling인용 수 0

한 줄 요약

연구는 다중 턴 Dynamic Cross-Context Review (D-CCR)가 단일 패스 Cross-Context Review (CCR)보다 성능이 떨어지며, 이는 거짓 양성 압력과 Review Target Drift 때문임을 보여준다; 독립적인 병렬 리뷰가 바람직하다.

ABSTRACT

Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.

연구 동기 및 목표

Cross-Context Review (CCR)에서 다중 턴 상호작용을 추가하는 것이Injected errors가 있는 아티팩트의 검증에 도움이 되는지 조사한다.
컨텍스트 분리하에 이후 리뷰에서 저자 답변이나 이전 질문을 포함하는 것이 이점이 있는가?
다중 턴 CCR 성능을 저하시키는 메커니즘(거짓 양성, 드리프트)을 결정한다.
컨텍스트 분리 하에서 최적의 리뷰 전략을 식별하고 검증 예산에 대한 실용적 지침을 제공한다.

제안 방법

30개의 아티팩트와 150개의 주입된 오류를 사용하여 4개의 D-CCR 변형과 단일 패스 CCR 기준선을 재현한다.
리뷰 라운드 간 맥락 분리를 보존하기 위해 독립 세션에서 Claude Opus 4.6을 사용한다.
변형 평가: CCR-1(아티팩트만), D-CCR-2a(아티팩트 + 질문), D-CCR-2b(아티팩트 + Q&A), D-CCR-2c(아티팩트만, 새로 검토하는 두 번째 리뷰).
채점 함수는 줄 근접성, 한국어 정규화와의 키워드 중복, 퍼지 부분 문자열 매칭을 결합하여 평가자의 발견을Ground-truth 오류와 매칭한다(임계값 1.0–3.0).
아티팩트별 F1, 정밀도, 재현율을 계산하고, 조건을 비교하기 위해 Bonferroni 보정을 적용한 짝 t 검정 및 Wilcoxon 검정을 수행한다.

Figure 1: F1, Precision, and Recall by condition. CCR-1 achieves the highest F1 and Precision. Multi-turn conditions increase Recall but collapse Precision, resulting in lower F1.

실험 결과

연구 질문

RQ1RQ1. 다중 턴 D-CCR이 단일 패스 CCR보다 성능이 우수한가?
RQ2RQ2. 저자의 답변이 리뷰어를 돕거나 고정시키는가?
RQ3RQ3. 다중 라운드 리뷰에서 연속성(In continuity)이 독립성보다 더 나은가?
RQ4RQ4. 독립적 반복이 단일 패스로보다 개선되는가?

주요 결과

Findings	TP	FP	Dup	Precision	Recall	F1	F1 SD
CCR-1	9.3	2.64	5.23	1.43	0.297	0.376	0.136
D-CCR-2a	15.4	2.96	9.17	3.27	0.197	0.293	0.102
D-CCR-2b	15.2	3.03	8.47	3.70	0.204	0.303	0.110
D-CCR-2c	18.4	3.10	9.70	5.60	0.168	0.263	0.091

단일 패스 CCR이 다중 턴 변형들보다 F1에서 더 나은 성능을 보였다(CRR-1 F1 = 0.376, 다중 턴 F1 = 0.263–0.303; 모든 p < 0.001, 단 하나의 경우 비유의적).
다중 턴 변형은 재현율을 증가시키지만 정밀도가 크게 감소시켜 F1이 무너졌다(CRR-1의 0.297에서 다중 턴은 0.168–0.204로 감소).
거짓 양성 압력은 성능 저하를 야기한다: 라운드 2의 발견은 아티팩트당 3–4개의 추가 거짓 양성을 포함한다.
Review Target Drift는 Q&A 콘텐츠가 리뷰어의 주의를 아티팩트 오류에서 대화 품질로 이동시킴을 설명한다.
독립적인 CCR 리뷰의 앙상블(다수결)은 F1이 더 높아(0.393) 다중 턴 변형보다 우수하며, 병렬 독립 리뷰가 순차적 반복보다 바람직하다는 것을 시사한다.

Figure 2: False positive accumulation across conditions. The dashed line marks CCR-1’s FP level. All multi-turn conditions add substantial Round 2 FPs (red), with D-CCR-2c generating the most noise.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.