QUICK REVIEW

[论文解读] The simulation of judgment in LLMs

Edoardo Loru, Jacopo Nudo|arXiv (Cornell University)|Feb 6, 2025

Law, AI, and Intellectual Property被引用 3

一句话总结

论文研究大型语言模型如何判断新闻可信度与偏见，将输出与专家标准对比，分析语言标记，并引入一个具备代理性的工作流来研究决策过程。

ABSTRACT

Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment. We use news domains purely as a controlled benchmark for evaluative tasks, focusing on the underlying mechanisms rather than on news classification per se. To enable direct comparison, we implement a structured agentic framework in which both models and nonexpert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, our findings show consistent differences in the observable criteria guiding model evaluations, suggesting that lexical associations and statistical priors could influence evaluations in ways that differ from contextual reasoning. This reliance is associated with systematic effects: political asymmetries and a tendency to confuse linguistic form with epistemic reliability--a dynamic we term epistemia, the illusion of knowledge that emerges when surface plausibility replaces verification. Indeed, delegating judgment to such systems may affect the heuristics underlying evaluative processes, suggesting a shift from normative reasoning toward pattern-based approximation and raising open questions about the role of LLMs in evaluative processes.

研究动机与目标

基准评估state-of-the-art LLMs在相对于专家评估的可信度与政治取向编码方面的表现。
在大规模域集合中将LLM分类与NewsGuard和MBFC进行比较。
识别驱动LLM可靠性判断的语言标记与关键词。
通过代理性工作流探索LLMs是否依赖内部先验或外部信息来进行判断。

提出的方法

在2302个域上对三种LLM（Gemini 1.5 Flash、GPT-4o mini、LLaMA 3.1 405B）进行零-shot、闭卷提示的主页内容评估。
将LLM输出与NewsGuard和MBFC的可靠性与政治取向专家基准进行比较。
分析关键词：classification、determinant、summary keywords及其 rank-frequency 分布。
探究一个代理性工作流，在该工作流中LLMs检索外部信息并与其他模型互动以精炼判断。
在仅使用域名URL作为提示时评估表现，以区分内容基础与先验知识效应。

实验结果

研究问题

RQ1State-of-the-art LLMs相对于专家基准如何对可信度与政治取向进行分类？
RQ2哪些语言标记与关键词驱动LLM的可信度判断？
RQ3LLM的分类在可靠性与政治取向上是否与专家评估一致，包括错误分类模式？
RQ4一个具代理能力的信息检索工作流是否能揭示LLMs如何达成可信度判断，以及它们是否依赖外部数据还是内部先验？

主要发现

LLMs能够准确识别不可靠来源，在各模型间的一致性介于85%到97%之间；可靠性分类则波动较大，尤其是GPT-4o mini。
就MBFC的可信度等级而言，模型在Low/High上的一致性>90%，但中等可信度来源的分类不稳定（GPT-4o mini和LLaMA 3.1倾向标注为不可靠）。
右翼媒体更易被错判为不可靠，而中间/左倾媒体更常被模型评为可靠。
关键词分析显示，可靠域通常与中立/透明语言和客观表述相关；不可靠域则与耸人听闻和偏见相关；判定性关键词强调本地新闻与可靠性相关，而政治化术语与不可靠性相关。
一个具代理性的工作流表明，模型通过外部信息收集可以改进判断，尽管在可靠/不可靠组之间的使用标准趋于一致，但在政治取向方面存在变异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。