QUICK REVIEW

[論文レビュー] Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples.

Nicolas Papernot, Patrick McDaniel|arXiv (Cornell University)|Feb 8, 2016

Adversarial Robustness in Machine Learning参考文献 31被引用数 300

ひとこと要約

この論文は、攻撃者がターゲットモデルのアーキテクチャやパラメータ、トレーニングデータにアクセスできない状況でも、ターゲットモデルの出力のみをクエリすることで代替モデルを学習し、その代替モデルを用いて実用的なブラックボックス攻撃を実現する手法を提示している。この手法により、MetaMindの実世界のDNN APIで84.24%の誤分類率を達成し、異なるモデル間での adversarial examples の転送性が有効に機能することが示された。

ABSTRACT

Advances in deep learning have led to the broad adoption of Deep Neural Networks (DNNs) to a range of important machine learning problems, e.g., guiding autonomous vehicles, speech recognition, malware detection. Yet, machine learning models, including DNNs, were shown to be vulnerable to adversarial samples-subtly (and often humanly indistinguishably) modified malicious inputs crafted to compromise the integrity of their outputs. Adversarial examples thus enable adversaries to manipulate system behaviors. Potential attacks include attempts to control the behavior of vehicles, have spam content identified as legitimate content, or have malware identified as legitimate software. Adversarial examples are known to transfer from one model to another, even if the second model has a different architecture or was trained on a different set. We introduce the first practical demonstration that this cross-model transfer phenomenon enables attackers to control a remotely hosted DNN with no access to the model, its parameters, or its training data. In our demonstration, we only assume that the adversary can observe outputs from the target DNN given inputs chosen by the adversary. We introduce the attack strategy of fitting a substitute model to the input-output pairs in this manner, then crafting adversarial examples based on this auxiliary model. We evaluate the approach on existing DNN datasets and real-world settings. In one experiment, we force a DNN supported by MetaMind (one of the online APIs for DNN classifiers) to mis-classify inputs at a rate of 84.24%. We conclude with experiments exploring why adversarial samples transfer between DNNs, and a discussion on the applicability of our attack when targeting machine learning algorithms distinct from DNNs.

研究の動機と目的

ターゲットモデルのアーキテクチャ、パラメータ、トレーニングデータにアクセスできない状況でも、リモートホストされた深層ニューラルネットワークに対して adversarial examples を作成できることを示すこと。
代替モデルから生成された adversarial examples がブラックボックスターゲットモデルに転送可能かどうかを調査すること。
生産用APIを含む実世界の環境における攻撃の有効性を評価すること。
異なる深層学習モデル間で adversarial examples が転送される背後にある要因を明らかにすること。
深層ニューラルネットワークを超えた機械学習モデルに対しても、この攻撃戦略が一般化可能かどうかを評価すること。

提案手法

攻撃者は、選択した入力をターゲットDNNに送信し、その出力を収集することで、ターゲットの振る舞いを模倣する代替モデルを訓練する。
ブラックボックスクエリを通じて収集した入力-出力ペアを用いて代替モデルを訓練することで、ターゲットモデルの内部構造を知らない状態でも adversarial examples を生成可能となる。
代替モデル上で、FGSM や PGD などの標準的な adversarial 攻撃手法を用い、代替モデルの勾配に基づいて adversarial examples を生成する。
生成された adversarial examples をターゲットモデルに適用し、誤分類が発生するかをテストする。
異なるアーキテクチャやトレーニングデータを持つモデル間でも、adversarial examples の転送性が観察されることを活用する。
この手法は、標準的なDNNデータセットおよび実世界のAPI（MetaMindのオンラインDNN分類器サービスを含む）でも評価されている。

実験結果

リサーチクエスチョン

RQ1出力クエリのみが利用可能な状況で、ブラックボックスDNNに対して効果的に adversarial examples を生成できるか？
RQ2代替モデルがターゲットDNNの振る舞いをどれほど正確に再現できるか。その再現性が、成功した adversarial 攻撃を可能にするか？
RQ3実世界の環境において、代替モデルで生成した adversarial examples が実際にターゲットモデルに転送される効果はどの程度か？
RQ4異なるDNNアーキテクチャやトレーニングデータ間で adversarial examples が転送される要因は何か？
RQ5この攻撃戦略は、深層ニューラルネットワークを超えた他の機械学習モデルに対しても一般化可能か？

主な発見

MetaMindのオンラインDNN分類器で84.24%の誤分類率を達成し、実世界のブラックボックス環境において高い実用的有効性を示した。
代替モデルがターゲットモデルの振る舞いをうまく再現しており、高い成功率で誤分類を引き起こす adversarial examples の生成が可能であった。
代替モデル上で生成された adversarial examples が実際にターゲットモデルに効果的に転送され、実際の場面で転送性が確認された。
ターゲットモデルのアーキテクチャが異なったり、異なるデータで学習されていたとしても、この攻撃手法は機能し、転送性の強靭性を示した。
モデルのパラメータやトレーニングデータが一切入手できない状況でも攻撃が有効であるため、完全なブラックボックス環境でも実現可能であることが証明された。
結果から、adversarial examples の転送性は、リモートかつセキュアにデプロイされたDNNシステムにおいても、根本的な脆弱性である可能性が示唆された。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。