[论文解读] Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
本文介绍 BIG-bench,一个大型多样化基准,涵盖204项任务(来自 450 位作者,跨 132 个机构),用于量化并外推语言模型能力,超越现有基准,比较密集和稀疏变换器在不同规模下对照人类评测者。
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
研究动机与目标
- 激发理解当前及未来不久能力与限制的需要,以用于研究和社会影响
- 开发一个大规模、困难、多样化的基准,揭示新能力和超越现有基准的潜在危害
- 提供人类评估基线与专家评估,以与不同规模的模型性能比较
- 使模型规模增加时性能外推,以预测突破并指导研究方向
- 促进任务开放贡献与通过开源 GitHub 工作流实现透明评估
提出的方法
- 构建 BIG-bench,涵盖语言学、数学、常识推理、生物、物理、社会偏见、软件开发等204项任务
- 定义一个支持 JSON 和程序化任务的 API,用于零-shot 和少-shot 评估
- 评估密集和稀疏变换器模型(BIG-G、BIG-G sparse、GPT-3 类、PaLM),规模跨六个数量级
- 将任务指标标准化到0–100的共同尺度以便聚合分析
- 计算校准指标(期望校准误差、布里尔分数)用于不同规模的模型预测
- 引入人类专家评审建立强基线并评估模型与人之间的差距
实验结果
研究问题
- RQ1随着模型规模和提示数的增加,语言模型在多样任务上的整体与任务特定表现如何改变?
- RQ2模型能力是否在本质上不同于人类表现,且缩放趋势是否预测潜在突破?
- RQ3模型校准和置信度与规模的关系,以及与人类评估者的比较?
- RQ4密集与稀疏两大模型类别在BIG-bench任务上的表现和效率差异有多大?
- RQ5现有基准在捕捉未来能力与社会偏见动态方面的局限性?
主要发现
- 随着规模和 shots 的增加,聚合模型表现提升,但在 BIG-bench 任务上仍显著低于人类表现
- 模型校准随规模提升而改善,但在各任务中的校准分数仍不完美
- 密集与稀疏模型类别显示出相似的性能趋势,稀疏性在某些方面有益
- 某些任务表现出与知识/记忆相关的渐进性、可预测的改进;其他任务在临界规模出现“突破”性行为,通常需要多步推理或脆弱度指标
- 在模糊情境下,社会偏见往往随规模增大而上升,尽管提示可以缓解这一效应
- 即便是大语言模型,也仍然脆弱,非英语语言的表现因任务而异
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。