QUICK REVIEW

[论文解读] Underspecification Presents Challenges for Credibility in Modern Machine Learning

Alexander D’Amour, Katherine Heller|arXiv (Cornell University)|Nov 6, 2020

Machine Learning in Healthcare参考文献 117被引用 430

一句话总结

论文认为 ML 流水线中的未充分指定（underspecification）会导致具有相似 iid 性能的预测器在部署时表现截然不同，并且它提供跨多个领域的压力测试证据以促进有纪律的评估与设计。

ABSTRACT

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

研究动机与目标

在 ML 流水线中明确定义未充分指定及其对部署可信度的影响。
表明接近 iid 最优的预测器可能编码不同的归纳偏置，导致部署行为分歧。
在计算机视觉、医学影像、自然语言处理、基于电子健康记录的预测和基因组学等领域实证展示未充分指定。
提出压力测试和约束作为解决方案，以确保真实世界部署中的可置信归纳偏置。

提出的方法

将 ML 流水线中的未充分指定概念形式化为多种预测器在近似最优的 iid 性能下共存。
使用简单的分析模型（流行病学、随机特征模型、聚合遗传风险分数）来说明具有相似训练性能的不同预测器如何产生不同的部署结果。
对跨领域的生产级深度学习流水线应用包含分层、偏移和对比评估的压力测试协议。
记录在计算机视觉、医学影像、NLP 以及电子健康记录上的未充分指定的实证证据。
主张通过训练与评估技术，在不牺牲 iid 性能的前提下，将流水线约束朝向可置信的归纳偏置。

实验结果

研究问题

RQ1在 ML 流水线中什么是未充分指定，它如何影响部署可信度？
RQ2具有相似 iid 性能的预测器是否可能因为不同的归纳偏置而在部署中出现分歧？
RQ3压力测试如何揭示多样化 ML 应用中的未充分指定？

主要发现

未充分指定在现代机器学习中广泛存在，并会导致部署敏感行为，这些行为并未被 iid 评估捕捉。
压力测试（分层、偏移和对比评估）揭示了预测器行为的变异性，标准的 iid 测试无法捕捉。
在分布发生偏移或对抗性偏移时，具有近似相同 iid 风险的不同预测器可能表现出本质上不同的风险。
即使保持 iid 性能，一些预测器也可能对特定偏移变得脆弱，削弱可信度。
这一问题在各领域普遍存在：计算机视觉、医学影像、NLP、基于电子健康记录的风险预测，以及医学基因组学。
通过有针对性的训练/评估策略解决未充分指定，可以在不必然损害 iid 性能的情况下提升可信度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。