QUICK REVIEW

[论文解读] Accuracy and Political Bias of News Source Credibility Ratings by Large Language Models

Kai‐Cheng Yang, Filippo Menczer|arXiv (Cornell University)|Apr 1, 2023

Misinformation and Its Impacts被引用 41

一句话总结

ChatGPT 能评估新闻机构可信度，其评估与人类专家的对齐度中等（ρ=0.54）；在二分类任务中达到 AUC = 0.89，且能够在多语言和讽刺性领域以低成本（约$3，覆盖7,523个域名）处理。

ABSTRACT

Search engines increasingly leverage large language models (LLMs) to generate direct answers, and AI chatbots now access the Internet for fresh data. As information curators for billions of users, LLMs must assess the accuracy and reliability of different sources. This paper audits nine widely used LLMs from three leading providers -- OpenAI, Google, and Meta -- to evaluate their ability to discern credible and high-quality information sources from low-credibility ones. We find that while LLMs can rate most tested news outlets, larger models more frequently refuse to provide ratings due to insufficient information, whereas smaller models are more prone to making errors in their ratings. For sources where ratings are provided, LLMs exhibit a high level of agreement among themselves (average Spearman's $ρ= 0.79$), but their ratings align only moderately with human expert evaluations (average $ρ= 0.50$). Analyzing news sources with different political leanings in the US, we observe a liberal bias in credibility ratings yielded by all LLMs in default configurations. Additionally, assigning partisan roles to LLMs consistently induces strong politically congruent bias in their ratings. These findings have important implications for the use of LLMs in curating news and political information.

研究动机与目标

评估 ChatGPT 是否能够对大量新闻机构的可信度进行打分。
量化 ChatGPT 的评分与人类专家判断（Lin 等、MBFC、NewsGuard）之间的一致性。
评估在非英语和讽刺领域的表现。
讨论在错误信息研究和媒介素养中使用大语言模型的意义。

提出的方法

从 Tranco（受欢迎列表）中编译一个 7,523 个新闻域名的子集，并在零-shot 设置下提示 ChatGPT 将可信度评为 0–1 的刻度。
使用 OpenAI API gpt-3.5-turbo-0301，温度设为 0，并附带额外的 JSON 格式指令以获得域名评分。
通过五个并行进程处理约 7,523 个域名，大约需要 2 小时，成本约 $3。
将人类专家评分（Lin 等聚合、MBFC、NewsGuard）重新缩放到 0–1，以便比较。
评估与 Spearman ρ 的相关性，并使用 AUC 和 F1 分数评估二元分类性能。

实验结果

研究问题

RQ1在零-shot 设置下，ChatGPT 是否能够对大量新闻机构的可信度进行打分？
RQ2ChatGPT 的评分在多语言和讽刺来源上的相关性与人类专家判断有多一致？
RQ3ChatGPT 的评分是否可作为识别低可信度新闻机构的有效分类器？
RQ4英语域与非英语域（包括讽刺网站）的表现有何差异？

主要发现

ChatGPT 对 7,523 个域名中的 7,282 个进行了评分；有 241 个域名因为缺乏信息而产生错误。
ChatGPT 的评分与人类专家评分之间存在中等相关性（Spearman ρ = 0.54，p < 0.001）。
与 NewsGuard 和 MBFC 的二元标签相比，ChatGPT 获得 AUC 0.89。
以接近 0.5 的阈值获得最佳 F1 分数（约 0.73 对 NewsGuard，0.63 对 MBFC）。
相关性随语言而异；英语媒体与 NewsGuard（约 ρ ≈ 0.51）和 MBFC（约 ρ ≈ 0.60 总体）存在显著相关性，而非英语媒体也存在显著相关性（如 MBFC 非英语 ρ ≈ 0.65；意大利语 ρ ≈ 0.38）。
ChatGPT 在识别讽刺网站方面表现出一定能力（MBFC 讽刺名单中识别率 77.4%），并且可以用情境化论证来支撑回答。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。