QUICK REVIEW

[論文レビュー] CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kaiwen Shi|arXiv (Cornell University)|Feb 26, 2026

Scientific Computing and Data Management被引用数 0

ひとこと要約

CiteAudit は、多-agent フレームワークと大規模ベンチマークを導入し、科学的文章における引用文献の信頼性とエビデンス整合性を検証します。LLM時代における誤引用を addressing する。ベースラインより検出精度と解釈性が向上し、標準化された評価プロトコルを提供します。

ABSTRACT

Scientific research relies on accurate citation for attribution and integrity, yet large language models (LLMs) introduce a new risk: fabricated references that appear plausible but correspond to no real publications. Such hallucinated citations have already been observed in submissions and accepted papers at major machine learning venues, exposing vulnerabilities in peer review. Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation. We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing. Our multi-agent verification pipeline decomposes citation checking into claim extraction, evidence retrieval, passage matching, reasoning, and calibrated judgment to assess whether a cited source truly supports its claim. We construct a large-scale human-validated dataset across domains and define unified metrics for citation faithfulness and evidence alignment. Experiments with state-of-the-art LLMs reveal substantial citation errors and show that our framework significantly outperforms prior methods in both accuracy and interpretability. This work provides the first scalable infrastructure for auditing citations in the LLM era and practical tools to improve the trustworthiness of scientific references.

研究の動機と目的

LLMにより学術執筆での幻の引用を抑制する必要性を動機づける。
引用の信頼性とエビデンス整合性を評価するための、スケーラブルな多エージェント検証フレームワークを提案する。
多様な領域と引用タイプを網羅する大規模で人間が検証したベンチマークを作成する。
引用検証のための統一された評価プロトコルと指標を提供する。
最先端の LLM を用いた実験で、ベースラインより精度と解釈性が向上することを示す。

提案手法

Claim Extractor、Retriever、Evidence Matcher、Reasoner、Judge の五エージェント・パイプラインを、計画コントローラによって調整する。
現実世界の引用エラーと系統的に生成した幻視的引用を組み合わせ、人工検証を行った大規模データセットを開発する。
検証を、厳密な検証基準で正確なメタデータの一致を要求する、複数段階のエビデンス整合性タスクとして形式化する。
エージェントを介して外部知識源としてウェブ検索と学術データベースを実装し、エビデンスを地盤付けする。
生成ベンチマークと実世界のテストセットの両方で、引用の信頼性と判断の一貫性に関する標準的な指標を用いてモデルを評価する。

実験結果

リサーチクエスチョン

RQ1多-agent フレームワークは科学論文における幻の引用を信頼性高く検出できるか。
RQ2エビデンスの検索と推論は、さまざまな引用タイプにおける信頼性判断にどのように影響するか。
RQ3権威ある学者の検証を取り入れることは、リコールと精度にどのような影響を与えるか。
RQ4生成ベンチマークは実世界の引用エラーパターンを実際にどの程度反映しているか。

主な発見

ベンチマークは実世界の引用と合成された幻の引用を組み合わせ、実務で観察されるのと同様の現実的なエラーパターンを示す。
多エージェント検証は、単一モデルのベースラインよりも精度と解釈性を大幅に向上させる。
Scholar Agent は最終検証段階として、Webベースの検査だけでは通過することのある頑健な幻視を減らす。
実世界データで、提案フレームワークは評価対象手法の中で最高の精度、適合率、再現率、F1を達成する。
このアプローチは、計画と最終判断段階に重い推論を限定することで、多くの独自の LLM ベースソリューションよりコストと待機時間を抑える。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。