QUICK REVIEW

[论文解读] Assessing Large Language Models on Climate Information

Jannis Bulian, Mike S. Schäfer|arXiv (Cornell University)|Oct 4, 2023

Topic Modeling被引用 17

一句话总结

论文提出了一个用于气候信息的LLMs 的基于原理的评估框架，区分呈现性充足性和认识论充足性，结果显示模型流畅，但内容质量滞后，尤其在准确性、完整性和不确定性方面。

ABSTRACT

As Large Language Models (LLMs) rise in popularity, it is necessary to assess their capability in critically relevant domains. We present a comprehensive evaluation framework, grounded in science communication research, to assess LLM responses to questions about climate change. Our framework emphasizes both presentational and epistemological adequacy, offering a fine-grained analysis of LLM generations spanning 8 dimensions and 30 issues. Our evaluation task is a real-world example of a growing number of challenging problems where AI can complement and lift human performance. We introduce a novel protocol for scalable oversight that relies on AI Assistance and raters with relevant education. We evaluate several recent LLMs on a set of diverse climate questions. Our results point to a significant gap between surface and epistemological qualities of LLMs in the realm of climate communication.

研究动机与目标

开发一个以科学传播为导向的框架，用于评估LLMs在气候信息上的表现。
评估信息的呈现方式（呈现性充足性）以及其反映科学知识的准确性（认识论充足性）。
提供一个可扩展的人机协作协议（AI Assistance），以受过教育的评审者提升评分质量。
比较若干近期的LLMs，识别气候传播中的优点与局限。

提出的方法

定义涵盖呈现性和认识论方面的八个评估维度，以识别多达30个不同的输出问题。
从Wikipedia派生的提示、Skeptical Science的神话，以及Google Trends的问题中组建一个300题的数据集。
让LLMs（主要是GPT-4）用3-4句话的段落来回答问题，并提取要点及支持证据。
使用 AI Assistance 来辅助评审者，在简短教程和资格认证后，从受过教育的非专家评审者处收集评分。
分析不同模型（如 GPT-4、ChatGPT-3.5、InstructGPT 变体、PaLM2、Falcon-180B-Chat）的评分，以评估呈现性与认识论性能。
探索基于归因的评估（AIS），以考察所引证来源与认识论质量之间的一致性。

Figure 12: Screenshot of the last of 4 tutorial questions with the correct answer selected.

实验结果

研究问题

RQ1当前的LLMs在气候信息方面在呈现性充足性（风格、清晰度、语言正确性、语气）与认识论充足性（准确性、具体性、完整性、不确定性）方面的表现如何？
RQ2AI-Assistance对人类评审者发现LLM输出问题的能力以及总体评分质量的影响如何？
RQ3基于归因的评估（AIS）是否与模型输出的认识论质量相关？
RQ4不同LLMs在跨越多样化问题来源的本地化、最新性和全面气候信息方面的表现有何差异？
RQ5在用LLMs传达气候信息方面存在哪些局限性和潜在改进？

主要发现

LLMs具有流畅性，表面质量较强，但认识论质量在所有模型中均落后。
即便呈现性较强，语气和实用性方面仍显著不足。
准确性、具体性、完整性和不确定性通常低于平均水平，3-4句的简短回答难以实现全面覆盖。
AI Assistance 能增加评审者检测到的问题数量，从而提升评估质量。
基于归因的信号（AIS）并不能可靠预测整体认识论质量，表明需要更广泛的评估方法。
Falcon-180B-Chat 在所测试模型中在认识论质量方面表现突出。

Figure 13: Screenshot of the instructions to the raters, provided at the beginning of the first rating session.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。