QUICK REVIEW

[論文レビュー] Towards Ground Truth Evaluation of Visual Explanations

Ahmed Osman, Leila Arras|arXiv (Cornell University)|Mar 16, 2020

Multimodal Machine Learning Applications参考文献 51被引用数 12

ひとこと要約

この論文は、制御された環境で視覚的説明の評価が可能な、明確な真のラベル付きピクセルレベルの関連性を備えた合成的でCLEVRに類似した視覚的質問応答データセットを紹介する。このベンチマークを用いて、2つの新しい指標を提案し、関係ネットワークモデルの予測を説明する際に、Layer-wise Relevance PropagationがGradient x Input やIntegrated Gradientsを上回ることを示した。

ABSTRACT

Several methods have been proposed to explain the decisions of neural networks in the visual domain via saliency heatmaps (aka relevances/feature importance scores). Thus far, these methods were mainly validated on real-world images, using either pixel perturbation experiments or bounding box localization accuracies. In the present work, we propose instead to evaluate explanations in a restricted and controlled setup using a synthetic dataset of rendered 3D shapes. To this end, we generate a CLEVR-alike visual question answering benchmark with around 40,000 questions, where the ground truth pixel coordinates of relevant objects are known, which allows us to validate explanations in a fair and transparent way. We further introduce two straightforward metrics to evaluate explanations in this setup, and compare their outcomes to standard pixel perturbation using a Relation Network model and three decomposition-based explanation methods: Gradient x Input, Integrated Gradients and Layer-wise Relevance Propagation. Among the tested methods, Layer-wise Relevance Propagation was shown to perform best, followed by Integrated Gradients. More generally, we expect the release of our dataset and code to support the development and comparison of methods on a well-defined common ground.

研究の動機と目的

深層学習における視覚的説明手法の信頼性と透明性の欠如を是正する。
質問ごとにピクセルレベルで真の関連性が分かっている制御された合成データセットを構築する。
実世界のデータの曖昧さが存在しない環境で、視覚的説明のための新しい評価指標を開発・検証する。
Gradient x Input、Integrated Gradients、Layer-wise Relevance Propagationといった代表的な説明手法の性能を、制御された条件下で比較する。
今後の説明手法の開発と公平な比較を促進するため、公開可能なベンチマークを提供する。

提案手法

著者らは、CLEVRに類似した約40,000組のレンダリング済み3次元シーンの合成データセットを生成した。
各質問に対して、真の関連ピクセル（すなわち、質問に言及されたオブジェクト）がピクセルレベルで明示的にラベル付けされた。
説明マップと真の関連性の整合性を定量化するため、2つの新しい評価指標を導入した。
関係ネットワークモデルをデータセット上で学習させ、その予測結果を、Gradient x Input、Integrated Gradients、Layer-wise Relevance Propagationという3つの分解ベースの手法で説明した。
提案された指標と標準的なピクセルの除去技術を用いて、説明の評価を比較した。
データセットとコードをすべて公開し、再現性および今後のベンチマーク作成を支援した。

実験結果

リサーチクエスチョン

RQ1真の関連性が分かっている合成データセット上で評価された場合、異なる説明手法の性能はどのようになるか？
RQ2ピクセルレベルの関連性を想定した新しい指標は、説明評価の公平性と透明性を向上させられるか？
RQ3真の関連性評価と除去ベースの評価を用いた場合、説明手法の性能にどのような差が生じるか？
RQ4どの説明手法が画像内の真の関連オブジェクトと最も一致するマップを生成するか？
RQ5制御された合成環境は、視覚的説明の信頼性と解釈可能性の高い評価をどの程度可能にするか？

主な発見

Layer-wise Relevance Propagationは、評価された手法の中で真の関連性との整合性が最も高かった。
Integrated Gradientsは強く性能を示し、説明の正確性で2位を記録した。
Gradient x Inputは最も効果が低く、微細な関連性を捉える能力の制限が示された。
提案された評価指標は、標準的なピクセルの除去手法よりも性能の差をより信頼性高く検出できた。
合成データセットのおかげで、真の関連性が分かっているため、透明性があり再現可能で公平な説明手法の評価が可能になった。
データセットとコードの公開により、視覚的説明研究分野における標準化されたベンチマークと手法開発が促進されると期待される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。