QUICK REVIEW

[논문 리뷰] AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews

Keith Tyser, Ben Segev|arXiv (Cornell University)|2024. 08. 19.

Scientific Computing and Data Management인용 수 7

한 줄 요약

본 논문은 세 가지 LLM 기반 리뷰 시스템(OpenReviewer, Papers with Reviews, Reviewer Arena)과 인간 선호도, 편향, 그리고 확장 가능한 학술 심사에서의 한계와의 정합성을 평가하는 네 가지 평가 방법을 제시한다.

ABSTRACT

Automatic reviewing helps handle a large volume of papers, provides early feedback and quality control, reduces bias, and allows the analysis of trends. We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. Gathering human preference may be time-consuming; therefore, we also use an LLM to automatically evaluate reviews to increase sample efficiency while reducing bias. In addition to evaluating human and LLM preferences among LLM reviews, we fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We artificially introduce errors into papers and analyze the LLM's responses to identify limitations, use adaptive review questions, meta prompting, role-playing, integrate visual and textual analysis, use venue-specific reviewing materials, and predict human preferences, improving upon the limitations of the traditional review processes. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality. This work develops proof-of-concept LLM reviewing systems that quickly deliver consistent, high-quality reviews and evaluate their quality. We mitigate the risks of misuse, inflated review scores, overconfident ratings, and skewed score distributions by augmenting the LLM with multiple documents, including the review form, reviewer guide, code of ethics and conduct, area chair guidelines, and previous year statistics, by finding which errors and shortcomings of the paper may be detected by automated reviews, and evaluating pairwise reviewer preferences. This work identifies and addresses the limitations of using LLMs as reviewers and evaluators and enhances the quality of the reviewing process.

연구 동기 및 목표

확장 가능한 기초 모델 기반의 심사를 필요성을 제시하고 품질 관리를 유지하면서 편향을 줄이는 것.
arXiv 및 오픈 액세스 Nature 논문의 리뷰를 생성, 수집, 평가하기 위해 세 가지 리뷰 시스템을 개발하고 배치한다.
인간 선호도, 자동 LLM 평가, 선호도 예측을 사용하여 LLM 리뷰와 인간 리뷰 간의 정합성을 평가한다.
LLM 기반 리뷰의 한계와 잠재적 위험을 식별하고 완화 전략을 제안한다.

제안 방법

세 가지 리뷰 시스템: OpenReviewer(LLM 보조 리뷰), Papers with Reviews(대규모 리뷰 수집 및 채점), Reviewer Arena(리뷰의 2차 비교).
네 가지 평가 방법: 익명 인간 평가, 자동 LLM 평가, 인간 선호도의 자동 예측을 위한 LLM의 자동 예측, 의도적 논문 수정을 통한 LLM 리뷰의 한계 자동 발견.
LLM의 역할 놀이를 통해 저자, 검토자, 영역 의장, 프로그램 의장을 포함한 인간 편집 과정을 시뮬레이션.
다수의 문서(리뷰 양식, 지침, 윤리 강령, 통계)를 맥락으로 사용하여 LLM 리뷰를 보정하고 장소 규범에 맞추는 것.

Figure 1: OpenReviewer: A user uploads their paper, which is automatically reviewed, and receives the review along with instructions for revision. The user may provide feedback and upload a revised version.

실험 결과

연구 질문

RQ1LLM 생성 리뷰가 블라인드 평가와 GPT-4 기반 비교에서 인간 선호도와 정합할 수 있는가?
RQ2고정된, 적응형, 생성된 리뷰 프롬프트 전반에 걸친 학술 심사자로서의 LLM의 강점과 한계는 무엇인가?
RQ3쌍대 선호 데이터, BT 모델링, 자동평가 접근 방식이 심사자 품질과 순위를 어떻게 정량화할 수 있는가?
RQ4LLM 기반 리뷰가 보이는 편향과 오류는 무엇이며 프롬프트, 맥락, 후처리를 통해 어떻게 완화할 수 있는가?
RQ5장소별 가이드라인 및 보충 자료가 자동 리뷰의 품질과 신뢰성에 어떤 영향을 미치는가?

주요 결과

블라인드 평가와 GPT-4 기반 비교에서 LLM 리뷰가 인간 리뷰와 합리적으로 정합하며, 일부 모델은 특정 상황에서 인간보다 우수한 성능을 보인다.
GPT-4 Turbo(2024년 4월 9일)가 다섯 명의 심사자 중 인간 선호도 테스트에서 최상위를 차지; 인간이 2위를 차지하고 나머지 LLM들이 뒤를 이었다.
Bradley-Terry 모델링은 심사자의 강도 순위를 산출한다; GPT-4 Turbo가 1위, 그다음이 Human, 그 다음이 Command R+, Claude 3 Opus와 Gemini Pro는 뒤처진다.
PPI 기반 방법을 이용한 자동 평가가 인간 데이터 의존도를 줄이고 선호도 예측의 효율성을 높일 수 있다.
논문 오류를 도입하여 한계를 자동으로 발견하면 특정 콘텐츠 유형과 단점에 대한 LLM의 민감도를 파악하는 데 도움이 된다.

Figure 2: Papers with Reviews: Our system collects papers from arXiv and open-access Nature journals, reviews, ranks, and displays their title, authors, abstract, review, and review score, linking back to the papers on arXiv and Nature. Users provide feedback on the reviews, which is then used to im

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.