QUICK REVIEW

[論文レビュー] Underspecification Presents Challenges for Credibility in Modern Machine Learning

Alexander D’Amour, Katherine Heller|arXiv (Cornell University)|Nov 6, 2020

Machine Learning in Healthcare参考文献 117被引用数 430

ひとこと要約

本論文は、MLパイプラインにおける underspecification が、同様の iid パフォーマンスを持つ予測子を展開時に非常に異なる挙動にさせると主張し、複数のドメインにわたるストレステストのエビデンスを提供して、規律ある評価と設計を促しています。

ABSTRACT

ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

研究の動機と目的

MLパイプラインにおける underspecification を定義し、それが展開時の信頼性へ与える影響を説明する。
近い iid 最適予測子が、異なる帰納的バイアスをコード化し、展開挙動が分岐することを示す。
コンピュータビジョン、医用画像、NLP、EHRベースの予測、ゲノミクス全体で、実証的に underspecification を示す。
実世界の展開に対して信頼できる帰納的バイアスを保証するために、ストレステストと制約を remedy として提案する。

提案手法

MLパイプラインにおける underspecification を、近似的に最適な iid パフォーマンスを達成する複数の予測子として形式化する。
異なるレベルのトレーニング成績を示す、単純な解析モデル（疫学、ランダム特徴モデル、polygenic risk scores）を用いて、同様のトレーニング性能を持つ別々の予測子が展開時に異なる結果を生み出す様子を説明する。
vision、medical imaging、NLP、EHRベースの深層学習パイプラインに対して、生産グレードのパイプラインをストレステストするプロトコルを適用する。
computer vision、medical imaging、NLP、電子カルテを用いたリスク予測、および genomics における underspecification の実証的証拠を記録する。
iid パフォーマンスを損なうことなく、パイプラインを信頼できる帰納的バイアスへ制約する訓練・評価手法を提案する。

実験結果

リサーチクエスチョン

RQ1MLパイプラインにおける underspecification とは何か、そしてそれは展開の信頼性にどう影響するか？
RQ2異なる帰納的バイアスによって、iid 性能が類似していても展開時に予測子は分岐することがあり得るか？
RQ3ストレステストはさまざまな ML アプリケーションにおける underspecification をいかに明らかにできるか？

主な発見

Underspecification は現代の ML において広く見られ、iid 評価では捉えられない展開時に感受性のある挙動を生み出す。
ストレステスト（層別、移動、対照的評価を含む）は、標準の iid テストが見逃す予測子挙動のばらつきを明らかにする。
ほぼ同一の iid リスクを持つ異なる予測子は、分布のシフトや敵対的シフトの下で、実質的に異なるリスクを示し得る。
iid パフォーマンスが維持されていても、特定のシフトに対して脆弱になる予測子があり、信頼性を損なう。
問題はドメインを超えて持続しており、コンピュータビジョン、医用画像、NLP、EHRベースのリスク予測、および医療ゲノミクスに及ぶ。
ターゲットを絞った訓練・評価戦略を通じて underspecification に対処することは、iid パフォーマンスを必ずしも犠牲にせず信頼性を高めることができる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。