QUICK REVIEW

[論文レビュー] IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers

Bowen Pan, Rameswar Panda|arXiv (Cornell University)|Jun 23, 2021

Explainable Artificial Intelligence (XAI)参考文献 64被引用数 68

ひとこと要約

IA-RED2は、視覚トランスフォーマー向けに解釈可能で入力依存の冗長性削減を導入し、情報量の少ないパッチを動的に削除します。これにより、最大で1.4xの画像速度アップと最大で4xの動画速度アップを実現し、精度の低下は最小限に抑えられます（<0.7%）。

ABSTRACT

The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory costs. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$). We start by observing a large amount of redundant computation, mainly spent on uncorrelated input patches, and then introduce an interpretable module to dynamically and gracefully drop these redundant patches. This novel framework is then extended to a hierarchical structure, where uncorrelated tokens at different stages are gradually removed, resulting in a considerable shrinkage of computational cost. We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4x speed-up for state-of-the-art models like DeiT and TimeSformer, by only sacrificing less than 0.7% accuracy. More importantly, contrary to other acceleration approaches, our method is inherently interpretable with substantial visual evidence, making vision transformer closer to a more human-understandable architecture while being lighter. We demonstrate that the interpretability that naturally emerged in our framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results. Project Page: http://people.csail.mit.edu/bpan/ia-red/.

研究の動機と目的

解釈性を損なわずに効率を向上させるため、視覚トランスフォーマーにおける冗長な計算の削減を動機づける。
入力ごとの重要性に基づいて、情報量の少ない入力パッチを動的に破棄する解釈可能なモジュールを提案する。
IA-REDを階層的なフレームワークへ拡張し、複数のトランスフォーマー段階でトークンをプルーニングする。
画像および動画タスク、さらには異なるバックボーンに対して、モデル非依存的な適用性を示す。

提案手法

各パッチトークンに情報量を示すスコアを割り当てるマルチヘッドインタープリターを導入する。
閾値以下のスコアを持つトークンを削除して、MSA/FFNブロック前の入力シーケンス長を削減する。
事前訓練済みのViT上で階層的、カリキュラムベースの方式でインタープリターを訓練し、精度と効率のバランスを取る報酬を用いたREINFORCEを使用する。
層をまたいで解釈可能性信号を集約し、パッチレベルのヒートマップ（視覚的証拠）を生成する。
速度、精度、解釈可能性指標の観点で、ベースライン（random、MemNet、raw attention）およびデータ依存のスパーストランスフォーマーと比較する。

実験結果

リサーチクエスチョン

RQ1視覚トランスフォーマーにおける冗長性は、精度を損なうことなく、1入力あたりどれだけ安全に削減できるか？
RQ2解釈性は、効率性主導のトークンプルーニングの副産物として生じうるか？
RQ3IA-RED2フレームワークは、画像および動画タスクおよび異なるトランスフォーマーのバックボーンに対して一般化するか？
RQ4階層的・入力依存のプルーニングにおける速度アップと精度のトレードオフはどのようになるか？
RQ5IA-RED2は既存の解釈可能性手法と標準的な視覚ベンチマーク上でどう比較されるか？

主な発見

DeiTを用いた画像認識で最大1.4xの速度向上を達成し、精度損失は0.7%未満。
TimeSformerを用いた動画アクション認識で最大4xの加速を達成し、精度を大幅に維持。
IA-RED2は解釈可能なヒートマップを生成し、ImageNet-Segの弱教師付きセグメンテーションでraw attentionやGradCAMを上回ることがある（ピクセル精度70.36、mAcc 64.86、mIoU 49.42）。
アブレーションでは、3-group IA-RED2 (D=3) が ImageNet-1K で有利な精度-速度のトレードオフを提供（Top-1 79.1%）。
重みプリューニングと組み合わせると、微調整なしで1.7xの速度アップと1.7%の精度低下を達成。
データレベルの冗長性削減はモデルレベルのプルーニングと補完的であり、組み合わせると加算的な向上をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。