QUICK REVIEW

[論文レビュー] A View on Vulnerabilites: The Security Challenges of XAI (Academic Track)

Moustafa Alzantot, Yash Sharma|arXiv (Cornell University)|Apr 21, 2018

Adversarial Robustness in Machine Learning参考文献 23被引用数 141

ひとこと要約

本稿では、自然言語における意味的・構文的に類似した adversarial examples を生成するブラックボックス型で、集団ベースの遺伝的アルゴリズムを提案している。感情分析では97%の成功率を達成し、テクスト帰属関係分析では70%の成功率を示した。この手法は人間の解釈可能性を保ち、92.3%の adversarial examples が人間によって元の文と同じように分類された。また、adversarial training が堅牢性を向上させないことが示され、攻撃の強度と多様性が浮き彫りになった。

ABSTRACT

Deep neural networks (DNNs) are vulnerable to adversarial examples, perturbations to correctly classified examples which can cause the model to misclassify. In the image domain, these perturbations are often virtually indistinguishable to human perception, causing humans and state-of-the-art models to disagree. However, in the natural language domain, small perturbations are clearly perceptible, and the replacement of a single word can drastically alter the semantics of the document. Given these challenges, we use a black-box population-based optimization algorithm to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively. We additionally demonstrate that 92.3% of the successful sentiment analysis adversarial examples are classified to their original label by 20 human annotators, and that the examples are perceptibly quite similar. Finally, we discuss an attempt to use adversarial training as a defense, but fail to yield improvement, demonstrating the strength and diversity of our adversarial examples. We hope our findings encourage researchers to pursue improving the robustness of DNNs in the natural language domain.

研究の動機と目的

画像とは異なり、離散的かつ人間が感知可能な単語レベルの摂動を持つ自然言語分野における adversarial examples の生成という課題に対処すること。
勾配に依存しないブラックボックス攻撃手法を開発し、不透明なモデルに対しても適用可能であるようにすること。
adversarial examples が意味的・構文的に整合的であることを保証し、人間の解釈可能性を維持すること。
このような攻撃に対するモデルの堅牢性を評価し、特に adversarial training を防御策としてテストすること。
人間水準の類似性でさえも、最先端のモデルを信頼性高くだますことができることを示すこと。

提案手法

adversarial examples の生成に、ブラックボックス型で勾配フリーの最適化を実行する遺伝的アルゴリズムを用いる。
GloVe埋め込みとカウンターフィットティングを用いて、類義語の置換を選択する Perturb subroutine を採用する。
埋め込みの近接性と文脈に配慮したフィルタリングを用いて、意味的・構文的類似性に制約を課す。
交差と変異の操作を適用し、攻撃成功度が向上するように、候補となる adversarial 文の集団を進化させる。
最大の語の変更数を制限（IMDBでは20%、SNLIでは25%）して、摂動の大きさを制御する。
モデルの予測結果と感情・類似性に関する人間評価を用いて、攻撃の成功を検証する。

実験結果

リサーチクエスチョン

RQ1単語レベルの摂動が離散的かつ人間が感知可能なため、自然言語分野でも効果的な adversarial examples を生成できるか？
RQ2勾配フリーで集団ベースの最適化手法が、ブラックボックス攻撃モデル下で効果的に adversarial examples を生成できるか？
RQ3生成された adversarial examples は、人間の認識や意味的整合性において、元の文とどれほど類似しているか？
RQ4adversarial training は、このような攻撃に対して NLP モデルの堅牢性を向上させるか？
RQ5人間のアノテーターは、adversarial examples を元の文と同等に感じ取れるか、感情と意味の両面で？

主な発見

IMDB 感情分析タスクでは97%の成功率を達成し、最小限の語の変更で予測を反転させることに成功した。
SNLI テキスト帰属関係タスクでは70%の成功率を示し、短い仮説文に対しても有効性を示した。
20名のアノテーターによる評価で、92.3%の adversarial examples が同じ感情に分類された。これは、知覚的類似性が確認されたことを示している。
元の文と adversarial 例の間の平均類似度評価は4段階中2.23であり、わずかな知覚的差異があることを示している。
adversarial training は堅牢性を向上させず、adversarial examples を用いて訓練されたにもかかわらず、テストセットで同じ攻撃に対して依然として脆弱であった。
遺伝的アルゴリズムは、成功確率と語の変更効率の両面で、グリーディベースラインを著しく上回った。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。