[论文解读] News Verifiers Showdown: A Comparative Performance Evaluation of ChatGPT 3.5, ChatGPT 4.0, Bing AI, and Bard in News Fact-Checking
该论文在100条经过事实核查的新闻上评估四个知名大模型(GPT-3.5、GPT-4、Bard 和 Bing AI),将回答分类为 True、False,或 Partially True/False,并与独立核验进行对比。
This study aimed to evaluate the proficiency of prominent Large Language Models (LLMs), namely OpenAI's ChatGPT 3.5 and 4.0, Google's Bard(LaMDA), and Microsoft's Bing AI in discerning the truthfulness of news items using black box testing. A total of 100 fact-checked news items, all sourced from independent fact-checking agencies, were presented to each of these LLMs under controlled conditions. Their responses were classified into one of three categories: True, False, and Partially True/False. The effectiveness of the LLMs was gauged based on the accuracy of their classifications against the verified facts provided by the independent agencies. The results showed a moderate proficiency across all models, with an average score of 65.25 out of 100. Among the models, OpenAI's GPT-4.0 stood out with a score of 71, suggesting an edge in newer LLMs' abilities to differentiate fact from deception. However, when juxtaposed against the performance of human fact-checkers, the AI models, despite showing promise, lag in comprehending the subtleties and contexts inherent in news information. The findings highlight the potential of AI in the domain of fact-checking while underscoring the continued importance of human cognitive skills and the necessity for persistent advancements in AI capabilities. Finally, the experimental data produced from the simulation of this work is openly available on Kaggle.
研究动机与目标
- 评估领先的LLM在新闻项中区分真相与欺骗的能力,使用黑箱测试。
- 将四大LLM与独立核验的事实核查进行比较。
- 量化基于AI的事实核查的整体准确性及在情境中的优点/弱点。
- 通过Kaggle提供可重复性的数据开放性。
提出的方法
- 对来自独立机构的100条经过事实核查的新闻项进行四个LLM的黑箱评估。
- 回答被分类为 True、False 和 Partially True/False。
- 准确性以与独立核验的一致性来衡量。
- 实验数据在Kaggle上公开提供。
实验结果
研究问题
- RQ1每个模型将新闻项分类为 True、False,还是 Partially True/False 的准确度是多少?
- RQ2在本设定中,哪一个模型的整体表现最好?
- RQ3在该数据集中,AI模型的表现与人工事实核查者相比如何?
- RQ4AI模型在新闻事实核查中在哪些限制和情境下存在困难?
主要发现
- 所有模型的平均准确度为 65.25 / 100。
- GPT-4.0 的得分最高,为 71。
- 所有模型都表现出中等水平的能力,在把握细微差别和语境方面落后于人工核查者。
- AI 在事实核查方面展现出潜力,但需要持续的AI能力提升和人工监督。
- 本研究的实验数据在 Kaggle 上公开可用。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。