QUICK REVIEW

[論文レビュー] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Erfan Shayegani, Md Abdullah Al Mamun|arXiv (Cornell University)|Oct 16, 2023

Adversarial Robustness in Machine Learning被引用数 35

ひとこと要約

本調査は大規模言語モデルに対する敵対的攻撃を検討し、学習構造、攻撃タイプ、脅威モデル、そして防御で分類する。2023–2024年の文献と、クローズド/オープンソースのモデルの両方に焦点を当てている。

ABSTRACT

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

研究の動機と目的

ますます高性能化するLLMが複雑なシステムに組み込まれるにつれて、セキュリティ上の懸念を動機づけ、枠組み化する。
学習構造（単一モード、マルチモード、拡張、連合学習、マルチエージェント）による敵対的攻撃論文の分類。
攻撃タイプ、脅威モデル、エンドツーエンドの攻撃目標を特徴づけ、堅牢な設計を導く。
防御策とオープンリソースを要約し、LLMセキュリティに不慣れな研究者を支援する。

提案手法

自然言語処理およびセキュリティの観点から、LLMに対する敵対的攻撃研究の体系的な文献調査。
敵対的攻撃概念の構造化された類型と分類法を提供する（学習構造、注入源、攻撃タイプ、攻撃者のアクセス、目標）。
ジャイルブレイク、プロンプト注入、マルチモーダル/複雑システム攻撃にわたる知見を統合する。
安全性の整合性と攻撃が整合性の弱点をどのように利用するかを比較し、テキストベース、マルチモーダル、連合学習レベルでの防御を論じる。

実験結果

リサーチクエスチョン

RQ1異なる学習構造にまたがってLLMに影響を及ぼす主な敵対的攻撃クラスは何か。
RQ2単一モードとマルチモーダルLLM、そして新たなシステムアーキテクチャ間で攻撃モダリティはどう異なるか。
RQ3提案された脅威モデルと防御戦略は何か、またギャップはどこに残るか。
RQ4LLMの脆弱性を研究する研究者を支援する資源と枠組みは何があるか。

主な発見

ジャイルブレイクとプロンプト注入は、初期から継続して敵対的研究を主導してきた中心的な単一モード攻撃カテゴリである。
敵対的攻撃研究は、単一モードLLMs、マルチモーダルLLMs、拡張型、連合学習型、マルチエージェントLLMsなどの学習構造を軸に組織されている。
攻撃者のアクセス、注入源、攻撃タイプ、攻撃目標を組み合わせた分類が、LLMの脆弱性を研究するために用いられる脅威モデルの枠組みとなる。
この調査は安全性整合性の弱点を実用的な攻撃表面と結びつけ、テキスト、マルチモーダル、連合学習の防御戦略を論じる。
手動のジャイルブレイクプロンプトから自動化・拡張可能な攻撃生成および防御検討への進展を強調する。
新しい研究者がこの学際領域に参入するのを支援する資源とプレゼンテーション（例：ACL’24資料）が提供されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。