QUICK REVIEW

[論文レビュー] ReviewerGPT? An Exploratory Study on Using Large Language Models for Paper Reviewing

Ryan Liu, Nihar B. Shah|arXiv (Cornell University)|Jun 1, 2023

Topic Modeling被引用数 26

ひとこと要約

この論文は、GPT-4を審査補助として3つのタスク（誤り検出、チェックリスト検証、要旨ペア比較）で評価し、特定のタスクには有望さを示すが、現時点では完全なレビューにはまだ適用できないことを示している。さらに、ピアレビュー研究のための小規模なLLM評価済データセットも提供している。

ABSTRACT

Given the rapid ascent of large language models (LLMs), we study the question: (How) can large language models help in reviewing of scientific papers or proposals? We first conduct some pilot studies where we find that (i) GPT-4 outperforms other LLMs (Bard, Vicuna, Koala, Alpaca, LLaMa, Dolly, OpenAssistant, StableLM), and (ii) prompting with a specific question (e.g., to identify errors) outperforms prompting to simply write a review. With these insights, we study the use of LLMs (specifically, GPT-4) for three tasks: 1. Identifying errors: We construct 13 short computer science papers each with a deliberately inserted error, and ask the LLM to check for the correctness of these papers. We observe that the LLM finds errors in 7 of them, spanning both mathematical and conceptual errors. 2. Verifying checklists: We task the LLM to verify 16 closed-ended checklist questions in the respective sections of 15 NeurIPS 2022 papers. We find that across 119 {checklist question, paper} pairs, the LLM had an 86.6% accuracy. 3. Choosing the "better" paper: We generate 10 pairs of abstracts, deliberately designing each pair in such a way that one abstract was clearly superior than the other. The LLM, however, struggled to discern these relatively straightforward distinctions accurately, committing errors in its evaluations for 6 out of the 10 pairs. Based on these experiments, we think that LLMs have a promising use as reviewing assistants for specific reviewing tasks, but not (yet) for complete evaluations of papers or proposals.

研究の動機と目的

投稿論文の増加の中で、LLMが同行審査の負荷を軽減できるかを検討する。
意図的な欠陥を含む短文論文において、GPT-4が誤りを特定できるかを評価する。
著者提供の提出チェックリストを地説と照合して、LLMが正確に検証できるかを評価する。
対比された要旨のペアの中から、どちらが良い論文かをLLMが選択できるかを検証する。
将来のLLMによる審査タスクの評価を可能にする、小規模データセットを提供する。

提案手法

複数のLLM（GPT-4、Bard、Vicuna、Koala、Alpaca、LLaMa、Dolly、OpenAssistant、StableLM）を用いた誤り発見のパイロット比較。
ターゲットとなる審査行動を引き出す3つの prompting 戦略（Prompt-Direct、Prompt-OneShot、Prompt-Parts）を開発。
GPT-4による誤り検出を試す deliberate エラーを含む13件の短いCS論文を作成。
LLMの検証精度を測るため、16のNeurIPS 2022チェックリスト質問を15件の論文（119対）で評価。
LLMがより良い論文を選べるかを試すため、設計された優越性を持つ10組の要旨対を作成。
結果を分析し、LLMの審査ワークフローにおける強み・限界・潜在的な役割を特定する。

実験結果

リサーチクエスチョン

RQ1LLMsは人工的に欠陥のあるCS論文の誤りを識別できるか。
RQ2LLMsは著者提供の提出チェックリストを検証する際にどれだけ正確か。
RQ3LLMsは要旨ペアの中でどちらが優れている論文を一貫して選択できるか。
RQ4全体的なレビューを行うことなく、特定の審査タスクを支援するLLMsの潜在性はどれほどか。

主な発見

GPT-4は、 deliberate 欠陥を含む短い論文13件のうち7件の誤りを特定した。
119の対（チェックリスト項目×論文）を横断して、3件の回答から多数決で得られた場合の精度は86.6％であった。
LLMは10対のうち優れた要旨を一貫して特定するのが難しかった。
他のモデルは13件の論文で誤りを特定できず、中には有用でない批評を生み出すものもあった。
対象質問を用いた prompting は、完全なレビューを求めるよりも有用なレビューを生み出す。
LLMsは特定のタスクで審査アシスタントとして有望だが、単独での包括的なレビューを行うにはまだ不足している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。