QUICK REVIEW

[논문 리뷰] Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks

Erfan Shayegani, Md Abdullah Al Mamun|arXiv (Cornell University)|2023. 10. 16.

Adversarial Robustness in Machine Learning인용 수 35

한 줄 요약

본 설문은 대형 언어 모델에 대한 적대적 공격을 분류하고, 학습 구조, 공격 유형, 위협 모델 및 방어에 초점을 두어 2023–2024년 문헌을 다루며 폐쇄형 및 오픈 소스 모델 모두를 다룬다.

ABSTRACT

Large Language Models (LLMs) are swiftly advancing in architecture and capability, and as they integrate more deeply into complex systems, the urgency to scrutinize their security properties grows. This paper surveys research in the emerging interdisciplinary field of adversarial attacks on LLMs, a subfield of trustworthy ML, combining the perspectives of Natural Language Processing and Security. Prior work has shown that even safety-aligned LLMs (via instruction tuning and reinforcement learning through human feedback) can be susceptible to adversarial attacks, which exploit weaknesses and mislead AI systems, as evidenced by the prevalence of `jailbreak' attacks on models like ChatGPT and Bard. In this survey, we first provide an overview of large language models, describe their safety alignment, and categorize existing research based on various learning structures: textual-only attacks, multi-modal attacks, and additional attack methods specifically targeting complex systems, such as federated learning or multi-agent systems. We also offer comprehensive remarks on works that focus on the fundamental sources of vulnerabilities and potential defenses. To make this field more accessible to newcomers, we present a systematic review of existing works, a structured typology of adversarial attack concepts, and additional resources, including slides for presentations on related topics at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24).

연구 동기 및 목표

점점 더 강력해지는 LLM이 복잡한 시스템에 통합됨에 따라 보안 문제를 동기화하고 프레이밍한다.
학습 구조(단일 모달, 다중 모달, 보강, 연합, 다중 에이전트)로 적대적 공격 문헌을 분류한다.
공격 유형, 위협 모델 및 엔드-투-엔드 공격 목표를 특징지어 견고한 설계 지침을 제시한다.
LLM 보안에 익숙하지 않은 연구자들을 돕기 위해 방어책과 공개 리소스를 요약한다.

제안 방법

자연어 처리(NLP) 및 보안 관점에서 LLM에 대한 적대적 공격 연구의 체계적 문헌 고찰.
적대적 공격 개념의 구조화된 유형학과 분류법(학습 구조, 주입 소스, 공격 유형, 공격자 접근성, 목표)을 제공한다.
탈옥(jailbreaking) 및 프롬프트 주입, 다중 모달/복합 시스템 공격에 걸친 발견을 합성한다.
안전 정렬(safety alignment)과 공격이 정렬 약점을 어떻게 악용하는지 비교하고, 텍스트 수준, 다중 모달 수준 및 연합학습 수준의 방어책을 논의한다.

실험 결과

연구 질문

RQ1다양한 학습 구조에 걸쳐 LLM에 영향을 미치는 주요 적대적 공격 클래스는 무엇인가?
RQ2단일 모달 LLM과 다중 모달 LLM 및 새로운 시스템 아키텍처 간의 공격 양상은 어떻게 다르며, 어떤 차이가 있는가?
RQ3제안된 위협 모델과 방어 전략은 무엇이며, 남은 간극은 어디에 있는가?
RQ4연구자들이 LLM 취약성을 연구하는 데 도움이 되는 자원과 프레임워크는 무엇인가?

주요 결과

탈옥과 프롬프트 주입은 초기 및 지속적인 적대적 연구를 주도한 핵심 단일 모달 공격 범주다.
적대적 공격 연구는 학습 구조를 기준으로 구성되어 있으며, 단일 모달 LLM, 다중 모달 LLM 및 보강, 연합, 다중 에이전트 LLM과 같은 신흥 시스템을 포함한다.
공격자 접근성, 주입 소스, 공격 유형 및 공격 목표를 결합한 분류 체계가 LLM 취약성 연구에서 사용되는 위협 모델을 형성한다.
본 설문은 안전 정렬의 약점을 실용적인 공격 표면과 연결하고 텍스트, 다중 모달 및 연합 학습 방어 전략을 논의한다.
수동적 탈옥 프롬프트에서 자동화되고 확장 가능한 공격 생성 및 방어 고려 사항으로의 진행을 강조한다.
새 연구자들이 이 학제 간 분야에 진입하도록 돕는 자료 및 프리젠테이션(예: ACL’24 자료)이 제공된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.