QUICK REVIEW

[論文レビュー] Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

Rui Ye, Xianghe Pang|arXiv (Cornell University)|Dec 2, 2024

scientometrics and bibliometrics research被引用数 7

ひとこと要約

本論文は、学術的ピアレビューに用いられる大規模言語モデルが、明示的・暗黙的操作に脆弱であり、固有の欠陥と偏りを持つことを示しており、したがって広範な採用にはまだ適していない。

ABSTRACT

Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.

研究の動機と目的

提出数の増加と作業負荷の増大により、従来のピアレビューへの負担を動機づける。
LLMsが学術原稿を信頼性高く査読できるかを評価する。
LLMの査読を左右し得る操作ベクトル（明示的・暗黙的）を特定する。
幻覚、長さバイアス、著者バイアスなど、LLMベースの査読に内在する欠陥と偏りを調査する。

提案手法

人間の査読と整合性を持つことにリンクした、確立された3つのLLMベースの査読パイプラインを再現する。
論文本文への不可視の白色文字挿入を介して、LLMの査読を受理へと誘導する明示的な操作を設計する。
著者が制約を強調することを分析し、それがLLMと人間の査読に与える影響を含む、暗黙の操作を検討する。
不完全な内容での幻覚、長さバイアス、著者バイアスなど、複数のLLMにわたる内在的欠陥を評価する。
LLMと人間の査読間の整合性指標と、意思決定影響をシミュレートする評価-to-論文モデルを用いて効果を定量化する。

実験結果

リサーチクエスチョン

RQ1論文本文の潜在的な入力によって、LLMベースの査読が人間の判断と分岐させるよう manipulation できるか。
RQ2著者が開示する制限は、LLMの査読を人間の査読よりも偏らせるのか。
RQ3ピアレビュー環境において、LLMはどのような内在的欠陥や偏り（例：幻覚、長さ、著者性）を示すか。
RQ4操作されたLLMの査読が論文の順位付けや採択判断にどう影響し得るか。

主な発見

明示的な操作は、LLMと人間の査読の一貫性を大幅に低下させる（例：53.29 から 15.91 へ）。
論文に注入された操作された内容により、LLMの査読が注入内容と高い割合で一致する（ Injection–LLM-Matched / Injection rises to 92.49%）。
操作された査読の5%が、約12%の論文を上位30%のランキングから外す可能性がある。
LLMは著者が公表した制限と人間よりも4.5倍一貫性があり、暗黙の操作に対して脆弱であることを示している。
LLMは不完全な入力（例：空の論文）でも幻覚を起こし、未完成の論文を完成版と同様に評価することがあり、LLMを査読に用いる際の信頼性の欠如を示している。
単盲設定では、LLMは著名な著者や所属機関に偏る傾向を示し、公平性の懸念を示唆している。
人間の査読との一貫性におけるLLMの性能は、全体的なモデル能力と相関しており（例：GPT-4o-0806が試験モデルの中で最も強い）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。