QUICK REVIEW

[논문 리뷰] JailGuard: A Universal Detection Framework for LLM Prompt-based Attacks

Xiaoyu Zhang, Cen Zhang|arXiv (Cornell University)|2023. 12. 17.

Adversarial Robustness in Machine Learning인용 수 9

한 줄 요약

JailGuard는 입력 변형을 생성하고 응답의 발산을 측정하여 다중 모달(이미지와 텍스트) LLM의 탈취(jailbreaking) 공격을 탐지하는 mutation 기반 프레임워크로, 두 모달리티에서 높은 탐지 정확도를 달성하고 베이스라인을 능가합니다.

ABSTRACT

The systems and software powered by Large Language Models (LLMs) and Multi-Modal LLMs (MLLMs) have played a critical role in numerous scenarios. However, current LLM systems are vulnerable to prompt-based attacks, with jailbreaking attacks enabling the LLM system to generate harmful content, while hijacking attacks manipulate the LLM system to perform attacker-desired tasks, underscoring the necessity for detection tools. Unfortunately, existing detecting approaches are usually tailored to specific attacks, resulting in poor generalization in detecting various attacks across different modalities. To address it, we propose JailGuard, a universal detection framework deployed on top of LLM systems for prompt-based attacks across text and image modalities. JailGuard operates on the principle that attacks are inherently less robust than benign ones. Specifically, JailGuard mutates untrusted inputs to generate variants and leverages the discrepancy of the variants' responses on the target model to distinguish attack samples from benign samples. We implement 18 mutators for text and image inputs and design a mutator combination policy to further improve detection generalization. The evaluation on the dataset containing 15 known attack types suggests that JailGuard achieves the best detection accuracy of 86.14%/82.90% on text and image inputs, outperforming state-of-the-art methods by 11.81%-25.73% and 12.20%-21.40%.

연구 동기 및 목표

LLM 및 MLLM에 대한 적극적이고 다중 모달 탈옥 보호의 필요성을 제시한다.
이미지와 텍스트 모달리티에서 동작하는 mutation 기반 탐지 프레임워크를 제안한다.
공격 입력이 정상 입력보다 섭동에 대한 강건성이 낮음을 입증한다.
최초의 다중 모달 탈옥 데이터셋을 구성하고 JailGuard를 최신 방어 기법과 비교 평가한다.

제안 방법

이미지 및 텍스트 입력에 대한 mutation 기반 탐지 프레임워크로 JailGuard를 도입한다.
쿼리 변형을 생성하기 위한 19개의 변이기(사진 10개, 텍스트 9개)로 Variant Generator를 개발한다.
효과를 높이기 위해 3가지 고급 텍스트 변이기(Targeted Replacement, Targeted Insertion, Rephrasing)를 사용한다.
코사인 유사도 기반 유사도 행렬과 응답 분포 간 KL 발산으로 변이 응답의 발산을 계산한다.
임계값이 적용된 발산 탐지기를 사용하여 입력을 정상 또는 탈옥 공격으로 분류한다.

Figure 1 . Motivation Case of JailGuard (Red Highlights Toxic Contents and Some of Them are Blocked)

실험 결과

연구 질문

RQ1RQ1: JailGuard가 텍스트 및 이미지 입력에 대한 탈옥 공격을 탐지하는 데 얼마나 효과적인가?
RQ2RQ2: JailGuard가 다양한 유형의 탈옥 공격을 탐지하고 방어할 수 있는가?
RQ3RQ3: JailGuard 구성요소(변이 생성기와 공격 탐지기)의 효과는 얼마나 되는가?
RQ4RQ4: 생성된 변형의 수가 탐지 성능에 어떤 영향을 미치는가?

주요 결과

방법	정확도 (%)	재현율 (%)
무작위 마스크	75.00	75.00
가우시안 블러	82.50	76.25
수평 뒤집기	73.75	81.25
수직 뒤집기	85.00	78.75
자르기 및 크기 조정	78.13	81.25
무작위 흑백	80.63	77.50
무작위 회전	89.38	78.75
색상 변동	85.00	80.00
무작위 솔라라이제이션	89.38	80.00
무작위 포스터라이제이션	82.50	70.00
평균	82.13	77.88
Baseline Content Detector	55.56	29.17
SmoothLLM-Insert	70.14	41.67
SmoothLLM-Swap	66.67	34.72
SmoothLLM-Patch	70.14	41.67
Average Baselines	65.62	36.81
Random Replacement	77.78	75.00
Random Insertion	79.17	77.78
Random Deletion	79.17	76.39
Synonym Replacement	73.61	84.72
Punctuation Insertion	75.00	70.83
Translation	78.47	84.72
Targeted Replacement	82.64	88.89
Targeted Insertion	84.03	81.94
Rephrasing	85.42	91.67
Average Text	79.48	81.33

JailGuard는 변이기에서 이미지 입력에 대해 평균 82.13%, 텍스트 입력에 대해 79.48%의 탐지 정확도를 달성한다.
JailGuard의 재현율 평균은 이미지 입력에서 77.88%, 텍스트 입력에서 81.33%로, 낮은 위양성(False Negative)으로 강력한 공격 탐지를 시사한다.
텍스트 데이터의 경우 고급 변이기(Targeted Replacement, Targeted Insertion, Rephrasing)가 베이스라인보다 우수하며, Rephrasing은 85.42%의 정확도와 91.67%의 재현율을 달성한다.
이미지 데이터의 경우 일부 변이기가 최대 89.38%의 정확도와 80.00–81.25%의 재현율을 달성하여 이미지 베이스라인과의 비교에서 우위를 보인다.
JailGuard는 텍스트 입력에서 최신 방어 기법 대비 탐지 정확도에서 최대 15.28%의 우위를 보인다.
최초의 다중 모달 탈옥 데이터셋(텍스트 및 이미지 304개 항목)을 구성하여 방어 기법을 평가했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.