QUICK REVIEW

[논문 리뷰] News Verifiers Showdown: A Comparative Performance Evaluation of ChatGPT 3.5, ChatGPT 4.0, Bing AI, and Bard in News Fact-Checking

Kevin Matthe Caramancion|arXiv (Cornell University)|2023. 06. 18.

Artificial Intelligence in Healthcare and Education인용 수 13

한 줄 요약

이 논문은 네 가지 주요 LLM(GPT-3.5, GPT-4, Bard, Bing AI)을 100개의 사실 확인된 뉴스 아이템에서 평가하고, 응답을 True, False, 또는 Partially True/False로 분류하며, 독립적인 검증과 비교합니다.

ABSTRACT

This study aimed to evaluate the proficiency of prominent Large Language Models (LLMs), namely OpenAI's ChatGPT 3.5 and 4.0, Google's Bard(LaMDA), and Microsoft's Bing AI in discerning the truthfulness of news items using black box testing. A total of 100 fact-checked news items, all sourced from independent fact-checking agencies, were presented to each of these LLMs under controlled conditions. Their responses were classified into one of three categories: True, False, and Partially True/False. The effectiveness of the LLMs was gauged based on the accuracy of their classifications against the verified facts provided by the independent agencies. The results showed a moderate proficiency across all models, with an average score of 65.25 out of 100. Among the models, OpenAI's GPT-4.0 stood out with a score of 71, suggesting an edge in newer LLMs' abilities to differentiate fact from deception. However, when juxtaposed against the performance of human fact-checkers, the AI models, despite showing promise, lag in comprehending the subtleties and contexts inherent in news information. The findings highlight the potential of AI in the domain of fact-checking while underscoring the continued importance of human cognitive skills and the necessity for persistent advancements in AI capabilities. Finally, the experimental data produced from the simulation of this work is openly available on Kaggle.

연구 동기 및 목표

최첨단 LLM이 블랙 박스 테스트를 사용하여 뉴스 아이템에서 진실과 기만을 구별하는 능력을 평가합니다.
네 가지 주요 LLM을 독립적으로 검증된 사실 확인과 비교합니다.
AI 기반 사실 확인의 전체 정확도와 맥락적 강점/약점을 정량화합니다.
재현성을 위한 오픈 데이터(Kaggle)를 제공합니다.

제안 방법

독립 기관의 100개 사실 확인 뉴스 아이템에 대한 네 가지 LLM의 블랙 박스 평가.
응답을 True, False, 및 Partially True/False 범주로 분류합니다.
정확도는 독립 검증과의 일치도로 측정합니다.
실험 데이터는 Kaggle에서 공개적으로 이용 가능합니다.

실험 결과

연구 질문

RQ1각 모델은 뉴스 아이템을 True, False, 또는 Partially True/False로 얼마나 정확하게 분류할 수 있나요?
RQ2이 설정에서 어떤 모델이 전반적으로 가장 우수한 성능을 보이나요?
RQ3이 데이터셋에서 AI 모델의 성능은 인간 사실 확인자와 어떻게 비교되나요?
RQ4뉴스 사실 확인에서 AI 모델이 어려움을 겪는 한계와 맥락은 무엇인가요?

주요 결과

모델 간 평균 정확도는 100점 만점에 65.25점입니다.
GPT-4.0가 71점으로 최다 점수를 달성합니다.
모든 모델은 중간 정도의 숙련도를 보이며 미묘함과 맥락 파악에서 인간 사실 확인자보다 뒤처집니다.
AI는 사실 확인에 잠재력을 보이지만 지속적인 AI 능력 개선과 인간 감독이 필요합니다.
연구의 실험 데이터는 Kaggle에서 공개적으로 이용 가능합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.