QUICK REVIEW

[论文解读] Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review

Rui Ye, Xianghe Pang|arXiv (Cornell University)|Dec 2, 2024

scientometrics and bibliometrics research被引用 7

一句话总结

论文表明用于学术同行评审的大型语言模型对显性和隐性操纵易受影响，存在固有缺陷和偏见，因此尚未准备好广泛应用。

ABSTRACT

Scholarly peer review is a cornerstone of scientific advancement, but the system is under strain due to increasing manuscript submissions and the labor-intensive nature of the process. Recent advancements in large language models (LLMs) have led to their integration into peer review, with promising results such as substantial overlaps between LLM- and human-generated reviews. However, the unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. In this study, we comprehensively analyze the vulnerabilities of LLM-generated reviews by focusing on manipulation and inherent flaws. Our experiments show that injecting covert deliberate content into manuscripts allows authors to explicitly manipulate LLM reviews, leading to inflated ratings and reduced alignment with human reviews. In a simulation, we find that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings. Implicit manipulation, where authors strategically highlight minor limitations in their papers, further demonstrates LLMs' susceptibility compared to human reviewers, with a 4.5 times higher consistency with disclosed limitations. Additionally, LLMs exhibit inherent flaws, such as potentially assigning higher ratings to incomplete papers compared to full papers and favoring well-known authors in single-blind review process. These findings highlight the risks of over-reliance on LLMs in peer review, underscoring that we are not yet ready for widespread adoption and emphasizing the need for robust safeguards.

研究动机与目标

由于投稿数量增加和劳动需求上升，推动传统同行评审的压力。
评估LLMs是否能可靠地评审学术稿件。
识别可影响LLM评审的操纵向量（显性和隐性）。
研究基于LLM的评审中的固有缺陷和偏见，如幻觉、篇幅偏差和作者偏见。

提出的方法

复制三个已建立的基于LLM的评审流程，与人类评审对齐相关联的三个既定LLM评审流程。
通过在稿件中注入不可见的白色文本来进行显性操控，将LLM评审引导至接受。
通过分析作者强调的局限性及其对LLM与人类评审的影响来考察隐性操控。
在多种LLM中评估包括对不完整内容的幻觉、篇幅偏差和作者偏见等固有缺陷。
使用LLM与人类评审的一致性指标及一个评分到论文模型来模拟决策影响来量化效应。

实验结果

研究问题

RQ1通过稿件中的隐性输入，LLM评审是否可以被操控而偏离人类判断？
RQ2作者披露的局限性是否比人类评审更容易使LLM评审产生偏见？
RQ3LLM在同行评审环境中表现出哪些固有缺陷或偏见（如幻觉、篇幅、作者等）？
RQ4操控后的LLM评审如何影响论文排名和接受决定？

主要发现

显性操控可以显著降低LLM与人类评审的一致性（例如从53.29降至15.91）。
注入稿件中的操控内容可以使LLM评审与所注入内容高度一致（ Injection–LLM-Matched / Injection rises to 92.49%）。
5%的被操控评审可能导致约12%的论文掉出前30%的排名。
LLMs与作者声称的局限性相比人类更具4.5倍的一致性，表明对隐性操控的易感性。
LLMs在输入不完整时可能产生幻觉（如空白论文），并可能将不完整的论文评为与完整论文相似，揭示了在评审中使用LLMs的不可靠性。
在单盲设置中，LLMs对知名作者或机构存在偏见，表明公平性问题。
LLM在与人类评审的一致性方面的表现与整体模型能力相关（例如在测试的模型中，GPT-4o-0806最强）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。