QUICK REVIEW

[论文解读] Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

Keyon Vafa, Ashesh Rambachan|arXiv (Cornell University)|Jun 3, 2024

Computational and Text Analysis Methods被引用 5

一句话总结

本文定义并量化了一个影响人们如何部署 LLM 的人类泛化函数，收集了跨 79 个任务的 18,972 个样本来建模这些信念，演示了信念变化的可预测性（使用 NLP 方法，尤其是 BERT），并指出在高风险情境下更大模型可能与人类部署产生不对齐。

ABSTRACT

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these deployment decisions are made by people, and in particular, people's beliefs about where an LLM will perform well. We model such beliefs as the consequence of a human generalization function: having seen what an LLM gets right or wrong, people generalize to where else it might succeed. We collect a dataset of 19K examples of how humans make generalizations across 79 tasks from the MMLU and BIG-Bench benchmarks. We show that the human generalization function can be predicted using NLP methods: people have consistent structured ways to generalize. We then evaluate LLM alignment with the human generalization function. Our results show that -- especially for cases where the cost of mistakes is high -- more capable models (e.g. GPT-4) can do worse on the instances people choose to use them for, exactly because they are not aligned with the human generalization function.

研究动机与目标

说明为何评估 LLM 需要对人类部署决策建模，而非固定基准。
定义并形式化支配在观察到 LLM 输出后信念更新的人类泛化函数。
在 MMLU 和 BBH 任务上实证收集并分析大规模的人类泛化数据集。
证明人类泛化可以被 NLP 模型预测，并评估 LLM 与这些泛化的一致性。

提出的方法

将部署与人类信念 b(x|f) 以及人类部署分布 h(x|f) 联系起来的形式化框架。
使用带 bandit 指导的调查设计，从 MMLU 和 BBH 的 79 项任务中收集 18,972 个人类泛化样本。
将信念变化建模为二元预测任务 Δ(x|x′,f)，并评估多种预测变量（先前正确性、固定嵌入 + XGBoost、BERT、Llama-2 变体、GPT-3.5、GPT-4）。
评估使用 NLL 和 AUC 来评估模型在多大程度上预测人类信念更新（重点关注观察到信念变化的基准）。

实验结果

研究问题

RQ1在观察到单个回答后，人们如何在相关问题之间泛化 LLM 的能力？
RQ2NLP 模型是否能够预测何时人类对 LLM 能力的信念会改变？
RQ3在不同部署风险假设下，不同 LLMs 与人类泛化函数的对齐程度如何？

主要发现

人类泛化函数是稀疏的：大多数问题对在观察到回答后不会更新信念。
信念变化预测是可行的：基于文本的模型（尤其是 BERT）优于非文本基线，在保留数据上的最佳 AUC 约为 0.81。
在高风险情境下，较大模型可能与人类泛化不对齐，尽管总体能力更强，但部署性能可能更差（例如 GPT-4）。
使用文本信息模型可提升信念变化的预测，表明现有 NLP 表征包含与人们如何推理模型能力相关的结构。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。