Skip to main content
QUICK REVIEW

[論文レビュー] C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

Avni Mittal, Rauno Arike|arXiv (Cornell University)|Mar 5, 2026
Explainable Artificial Intelligence (XAI)被引用数 0
ひとこと要約

C2-Faith benchmarks LLM judges for two faithfulness dimensions—causality and coverage—using controlled perturbations on PRM800K chains, and evaluates GPT-4.1, DeepSeek-V3.1, and o4-mini across binary causal detection, causal step localization, and coverage scoring, revealing task-framing dependent performance and a detection-localization gap.

ABSTRACT

Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation

研究の動機と目的

  • Motivate the need to distinguish process-level faithfulness from answer plausibility in CoT explanations.
  • Define two faithfulness axes—causality (logical consistency of each step) and coverage (presence of essential intermediate inferences).
  • Create a diagnostic benchmark (C2-Faith) with ground-truth perturbations to measure judge reliability on both axes.

提案手法

  • Construct causality perturbations by replacing a middle step with an acausal variant generated by an LLM.
  • Construct coverage perturbations by deleting a fraction of middle-region steps from perfect chains.
  • Derive datasets from PRM800K with ground-truth labels for causal errors and coverage deletions.
  • Evaluate three frontier judges (GPT-4.1, DeepSeek-V3.1, o4-mini) on three tasks: binary causal detection, causal step localization, and coverage scoring.
  • Use controlled perturbations with exact causal error indices and reference-scored coverage levels (0.1, 0.3, 0.5, 0.7).
  • Report metrics including detection rate, exact match for localization, mean absolute error, and Spearman correlation for coverage.]
  • research_questions:[
  • Can frontier LLM judges reliably detect causal unfaithfulness at a given step?
  • Can judges localize the exact position of a causal unfaithfulness in a full chain?
  • How well do judges score coverage/monitorability when substantial parts of the reasoning are removed?
  • How do judge performance and biases vary with task framing and perturbation type?
Figure 1: Overview of C 2 -Faith benchmark construction and evaluation tasks.
Figure 1: Overview of C 2 -Faith benchmark construction and evaluation tasks.

実験結果

主な発見

  • No single judge dominates across all tasks; performance is task-framing dependent.
  • All judges detect errors frequently (88.4%–94.2%), but exact localization is substantially harder (exact match ranges 55.8%–68.0%).
  • There is a consistent early-prediction bias in localization (negative mean signed error).
  • Coverage judgments exhibit inflation: scores remain high even with high deletion rates, and correlations with reference coverage weaken at higher deletions.
  • o4-mini is the strongest overall judge for multi-task faithfulness evaluation, with DeepSeek-V3.1 excelling in constrained causal detection and GPT-4.1 showing moderate coverage tracking.
  • DeepSeek-V3.1 shows a ceiling effect on coverage at low deletion rates, collapsing to near-zero correlation with ground truth.
(a) Exp 1: detection rates with 95% bootstrap CIs.
(a) Exp 1: detection rates with 95% bootstrap CIs.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。