QUICK REVIEW

[论文解读] Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks

Andrey Malinin, Neil Band|arXiv (Cornell University)|Jul 15, 2021

Anomaly Detection Techniques and Applications被引用 45

一句话总结

一个大型多模态数据集（表格天气、机器翻译和车辆运动），具有真实世界分布转变，用于基准不确定性估计和鲁棒性；基线集成在各任务中显示出更高的鲁棒性和不确定性。

ABSTRACT

There has been significant research done on developing methods for improving robustness to distributional shift and uncertainty estimation. In contrast, only limited work has examined developing standard datasets and benchmarks for assessing these approaches. Additionally, most work on uncertainty estimation and robustness has developed new techniques based on small-scale regression or image classification tasks. However, many tasks of practical interest have different modalities, such as tabular data, audio, text, or sensor data, which offer significant challenges involving regression and discrete or continuous structured prediction. Thus, given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary. This will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselines. In this work, we propose the Shifts Dataset for evaluation of uncertainty estimates and robustness to distributional shift. The dataset, which has been collected from industrial sources and services, is composed of three tasks, with each corresponding to a particular data modality: tabular weather prediction, machine translation, and self-driving car (SDC) vehicle motion prediction. All of these data modalities and tasks are affected by real, "in-the-wild" distributional shifts and pose interesting challenges with respect to uncertainty estimation. In this work we provide a description of the dataset and baseline results for all tasks.

研究动机与目标

引入一个标准化的真实世界基准（Shifts），跨越多种数据模态，以研究对分布转变和预测不确定性的鲁棒性。
提供具有内域和转移分割的规范数据划分，以模拟部署时的分布变化。
使用集成方法提供基线结果，确立跨任务的性能和不确定性基准。
提出通过保留曲线（错误-保留和 F1-保留）及相关的 AUC 指标，综合评估鲁棒性和不确定性。

提出的方法

从行业来源构建三个大规模任务：表格天气预测、机器翻译和自动驾驶车辆运动预测。
采用基于集成的基线以获得鲁棒的不确定性估计和具有竞争力的预测性能。
使用错误-保留曲线和 F1-保留曲线来联合评估对分布转变的鲁棒性和不确定性质量（R-AUC、F1-AUC、F1@95%）。
定义将数据分为内域和转移集合的规范分区，以反映真实的分布转变。
通过任务相关的指标评估不确定性（例如回归的 RMSE/MAE、分类的准确率/宏F1、MT 的 BLEU/eGLEU/maxGLEU，以及运动的 cNLL/minADE/minFDE）。

实验结果

研究问题

RQ1模型对分布转变的鲁棒性在现实世界的多模态任务中如何退化？
RQ2基于集成的不确定性估计在转移下与实际误差的相关性有多强？
RQ3哪些不确定性度量最能检测跨模态的分布外输入？
RQ4在天气、翻译和车辆运动任务中，集成与单模型在性能和不确定性方面有何相对权衡？
RQ5基于保留的评估是否能够可靠描述在分布转变下的人机混合决策？

主要发现

在天气预测和机器翻译基线中，集成模型持续优于单模型，提升RMSE/MAE（天气）和 BLEU/eGLEU（MT）。
在天气预测中，集成的 RMSE 相对改进在 dev-in、dev-out、eval-in、eval-out、eval 分区都兑现，对不确定性保留指标（R-AUC 和 F1-AUC）的提升也显著，胜过单模型。
捕捉知识不确定性的度量（如 EPKL、MI、RMI）在回归和分类任务中通常为 OOD 检测带来更高的 ROC-AUC，而总不确定性度量（如 tvar、Conf、Entropy）在 F1-AUC 和 F1@95% 上表现更优。
对于 MT，集成实现比单模型更高的 R-AUC 和 F1-AUC，BLEU/eGLEU 相关性表明鲁棒性提升；在域内与转移数据上的 ROC-AUC 也更偏好集成。
车辆运动预测任务在 60 万场景上引入多领域不确定性评估（cNLL、minADE、minFDE、加权变体），对比 BC 与 DIM RIP 基线，采用不同的集成规模和不确定性方法，突出集成不确定性在连续多轨迹预测中的作用。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。