QUICK REVIEW

[論文レビュー] Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering

Bryant Chen, Wilka Carvalho|arXiv (Cornell University)|Nov 8, 2018

Adversarial Robustness in Machine Learning被引用数 238

ひとこと要約

Activation Clustering (AC) を導入し、信頼できるデータセットを必要とせず DNN にバックドアを挿入する有害な訓練データを検出し、モデルを修復する手法。AC は最後の層の活性化を分析して Poisoned データと Clean データを分離し、自動化された回復オプションを提供する。

ABSTRACT

While machine learning (ML) models are being increasingly trusted to make decisions in different and varying areas, the safety of systems using such models has become an increasing concern. In particular, ML models are often trained on data from potentially untrustworthy sources, providing adversaries with the opportunity to manipulate them by inserting carefully crafted samples into the training set. Recent work has shown that this type of attack, called a poisoning attack, allows adversaries to insert backdoors or trojans into the model, enabling malicious behavior with simple external backdoor triggers at inference time and only a blackbox perspective of the model itself. Detecting this type of attack is challenging because the unexpected behavior occurs only when a backdoor trigger, which is known only to the adversary, is present. Model users, either direct users of training data or users of pre-trained model from a catalog, may not guarantee the safe operation of their ML-based system. In this paper, we propose a novel approach to backdoor detection and removal for neural networks. Through extensive experimental results, we demonstrate its effectiveness for neural networks classifying text and images. To the best of our knowledge, this is the first methodology capable of detecting poisonous data crafted to insert backdoors and repairing the model that does not require a verified and trusted dataset.

研究の動機と目的

Neural networks での毒 poisoning とバックドア攻撃に対する安全上の懸念を動機づける。
信頼できるデータを必要とせず、毒されたサンプルを検出するデータ駆動型の防御を提案する。
ネットワークの活性化に基づいて毒されたデータと正当なデータを分離する Activation Clustering を開発する。
クラスターを要約する機構と、バックドアを修復する効率的な手段を提供する。

提案手法

信頼できないデータを含む訓練データで DNN を訓練し、潜在的に毒されたサンプルを含める。
全訓練サンプルの最終隠れ層の活性化を抽出する。
各ラベルの活性化に対して次元削減（ICA）を適用し、次に k-means（k=2）でクラスタリングする。
Exclusionary Reclassification（ExRe）、Relative Size、Silhouette Score といった分析手法を用いて、どのクラスタに毒されたデータが含まれるかを特定する。
検証を補助するため、クラスタを要約する（視覚データなら画像スプライト、テキストデータならLDAトピック）。
毒されたデータを削除するか、毒されたサンプルのラベルを元のクラスに再ラベルして再訓練することで修復する。

実験結果

リサーチクエスチョン

RQ1Activation Clustering は信頼できるデータセットなしで毒されたデータを正当なデータと確実に識別できるのか。
RQ2AC は多模態クラスや複数のバックドア、複数ソースからの毒化に対してどれくらい堅牢か。
RQ3自動化された基準（ExRe、相対サイズ、シルエットスコア）は領域を横断して毒されたクラスタを識別するのに最適か。
RQ4AC は検出だけでなく、最小限の再訓練でバックドアを修復できるのか。
RQ5クラスタの視覚的/テキスト的要約は人間の Poisoning の検証を補助するのか。

主な発見

AC は MNIST において poisoned の検出でほぼ完璧な性能を達成（100% F1 およびクラスごとの約100% 精度、毒化レベル 10%、15%、33% に対して）。
生データクラスタリングは AC より性能が大幅に劣り、例えば MNIST 全体の AC 精度 99.97% に対して生データクラスタリングは 58.61%。
LISA および Rotten Tomatoes のテキストデータでは、AC は Tested シナリオで毒されたサンプルの検出において約100% の精度と F1 を達成。
AC は多模態のターゲットクラスや複数の毒源に対しても堅牢で、様々な設定で約99.9–100% の精度と F1 を維持。
Exclusionary Reclassification（ExRe）は毒されたクラスタとその出所クラスを一貫して識別し、他のクラスタ解析指標より優れていた。
毒されたデータを再ラベルして訓練を続ける方法は、スクラッチから再訓練するよりも収束が早く（14エポック）、バックドアを効果的に除去しつつ標準精度を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。