QUICK REVIEW

[論文レビュー] Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks

Yige Li, Xixiang Lyu|arXiv (Cornell University)|Jan 15, 2021

Adversarial Robustness in Machine Learning参考文献 53被引用数 139

ひとこと要約

論文は Neural Attention Distillation (NAD) を紹介します。これは、教師ネットワークを用いてバックドアを持つ学生ネットワークのファインチューニングを導き、アテンションマップを一致させるように指導する蒸留ベースの防御です。NAD は 5% のクリーンデータだけで、複数の攻撃に対してバックドアトリガーを効果的に消去し、クリーン精度を保ちます。

ABSTRACT

Deep neural networks (DNNs) are known vulnerable to backdoor attacks, a training time attack that injects a trigger pattern into a small proportion of training data so as to control the model's prediction at the test time. Backdoor attacks are notably dangerous since they do not affect the model's performance on clean examples, yet can fool the model to make incorrect prediction whenever the trigger pattern appears during testing. In this paper, we propose a novel defense framework Neural Attention Distillation (NAD) to erase backdoor triggers from backdoored DNNs. NAD utilizes a teacher network to guide the finetuning of the backdoored student network on a small clean subset of data such that the intermediate-layer attention of the student network aligns with that of the teacher network. The teacher network can be obtained by an independent finetuning process on the same clean subset. We empirically show, against 6 state-of-the-art backdoor attacks, NAD can effectively erase the backdoor triggers using only 5\% clean training data without causing obvious performance degradation on clean examples. Code is available in https://github.com/bboylyg/NAD.

研究の動機と目的

トリガーがクリーンな精度には影響しないが予測を乗っ取るバックドア攻撃に対して堅牢な防御を動機づける。
知識蒸留とニューラルアテンション移転を用いたファインチューニングフレームワークを開発し、バックドアトリガーを消去する。
NAD が限られたクリーンデータの下で複数のバックドア攻撃タイプに対して機能し、クリーン精度を維持できることを示す。

提案手法

アテンション演算子 A を定義し、レイヤーの活性化からアテンションマップを生成する。
NAD 損失を、残差グループ全体で教師と生徒のアテンションマップ間の正規化された L2 距離として計算する（Equation 2）。
バックドアを持つモデルを小さなクリーンデータのサブセットでファインチューニングして教師ネットワークを訓練する。
総損失 L_total を、学生のクロスエントロピーと NAD 損失を組み合わせて最適化する（Equation 3）。
5% のクリーンデータと 10 training epochs を用い、学生をファインチューニングして教師を形成する。

実験結果

リサーチクエスチョン

RQ1NAD は最小限のクリーンデータで、 diverse なバックドア攻撃を通じてバックドアトリガーを除去できるか。
RQ2Attention ベースの蒸留は、ASR およびクリーン ACC の観点で標準的なファインチューニングや他のトリガー除去法と比較してどうか。
RQ3どのアテンション表現と教師-学生の構成の組み合わせが最良の除去性能をもたらすか。

主な発見

NAD は平均的な攻撃成功率（ASR）を約 99–100% から約 7.2% に低下させ、クリーン精度の低下は約 2.7% と最小限に抑える（6つの攻撃を跨いで）。
攻撃タイプの中で、NAD は標準的なファインチューニング、ファインチューニング＋プルーニング、モード連結修復を一貫して上回り、バックドア消去性能が高い。
Attention ベースの蒸留（特に A_sum^2 ）は、直接的な特徴蒸留よりも強力なバックドアの除去と、バックドア領域と良性領域の分離を明確に提供する。
NAD は数エポック程度ですばやく収束し、異なる教師-学生のアーキテクチャ選択やクリーンデータ量の変動にも効果を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。