[论文解读] Partial success in closing the gap between human and machine vision
该研究表明,对分布外失真(out-of-distribution distortions)的鲁棒性正被现代模型越来越匹配甚至超越,尤其是数据丰富且基于变换器的模型,但在人类与机器之间仍存在图像层面的错误模式差距。一个包含17个OOD数据集和85,120次试验的大规模心理物理基准评估了多种模型族,以量化接近人类般视觉能力的进展。
A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants. We then investigated a range of promising machine learning developments that crucially deviate from standard supervised CNNs along three axes: objective function (self-supervised, adversarially trained, CLIP language-image training), architecture (e.g. vision transformers), and dataset size (ranging from 1M to 1B). Our findings are threefold. (1.) The longstanding distortion robustness gap between humans and CNNs is closing, with the best models now exceeding human feedforward performance on most of the investigated OOD datasets. (2.) There is still a substantial image-level consistency gap, meaning that humans make different errors than models. In contrast, most models systematically agree in their categorisation errors, even substantially different ones like contrastive self-supervised vs. standard supervised models. (3.) In many cases, human-to-model consistency improves when training dataset size is increased by one to three orders of magnitude. Our results give reason for cautious optimism: While there is still much room for improvement, the behavioural difference between human and machine vision is narrowing. In order to measure future progress, 17 OOD datasets with image-level human behavioural data and evaluation code are provided as a toolbox and benchmark at: https://github.com/bethgelab/model-vs-human/
研究动机与目标
- 评估在人机视觉在分布外数据上的鲁棒性差距是否在缩小。
- 评估不同的 ML 发展(目标函数、架构、数据规模)如何影响人机对齐。
- 提供一个基准工具箱和数据集,以跟踪该领域未来的进展。
提出的方法
- 从 90 名人类观察者处收集了 85,120 次心理物理试验,涵盖 17 个旨在测试扭曲鲁棒性的 OOD 数据集。
- 比较了涵盖 CNN、自监督、对抗训练、视觉变换器,以及大数据/带噪标签的模型的 52 个模型。
- 使用 OOD 准确率和三种对齐度量来评估模型:Accuracy difference A(m)、Observed consistency O(m) 与 Error consistency E(m)。
- 开放了模型对人类工具箱,以将新模型与人类数据进行基准比较。
- 使用 WordNet 层级将 ImageNet 1000 类映射到 16 个类别,以便人类-模型可比。
实验结果
研究问题
- RQ1在广泛的 OOD 条件下,现代 ML 模型是否缩小了与人类在扭曲鲁棒性方面的差距?
- RQ2目标函数、架构和训练数据规模如何影响跨图像的人机对齐?
- RQ3在分布外条件下,机器和人类在单个图像上的错误模式在多大程度上是共享的或不同的?
主要发现
- 在大规模数据上训练的最佳模型在大多数 OOD 数据集上达到或超过人类前馈准确性。
- 仍存在显著的图像级一致性差距:模型和人类在不同的图像上犯错,尽管数据丰富的模型在某些数据集上可缩小这一差距。
- 自监督模型在鲁棒性方面对监督基线的提升有限,显著的改进主要归因于数据增强选择。
- 对抗训练模型增强了鲁棒性,但也可能对非对抗扰动更脆弱,且表现出更强的纹理偏好。
- 视觉变换器和大规模数据显著提高了 OOD 性能,其中 CLIP 在某些指标上达到近似人类的错误模式。
- 该论文提供了一个工具箱和 17 个 OOD 数据集,用于基准未来进展并量化人机行为对齐。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。