QUICK REVIEW

[논문 리뷰] Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks

Zhaohan Xi, Tianyu Du|arXiv (Cornell University)|2023. 09. 23.

Adversarial Robustness in Machine Learning인용 수 12

한 줄 요약

MDP를 도입하는 경량의 플러그인형 방어로, 프롬프트 기반의 소수 샷 PLM에서 백도어 오염을 감지한다. 오염이 마스킹 민감도에 미치는 영향을 소수 샷 데이터의 분포 앵커를 사용해 측정한다.

ABSTRACT

Pre-trained language models (PLMs) have demonstrated remarkable performance as few-shot learners. However, their security risks under such settings are largely unexplored. In this work, we conduct a pilot study showing that PLMs as few-shot learners are highly vulnerable to backdoor attacks while existing defenses are inadequate due to the unique challenges of few-shot scenarios. To address such challenges, we advocate MDP, a novel lightweight, pluggable, and effective defense for PLMs as few-shot learners. Specifically, MDP leverages the gap between the masking-sensitivity of poisoned and clean samples: with reference to the limited few-shot data as distributional anchors, it compares the representations of given samples under varying masking and identifies poisoned samples as ones with significant variations. We show analytically that MDP creates an interesting dilemma for the attacker to choose between attack effectiveness and detection evasiveness. The empirical evaluation using benchmark datasets and representative attacks validates the efficacy of MDP.

연구 동기 및 목표

프롬프트 기반 PLM에서 소수 샷 설정의 백도어 위협 연구를 동기화한다.
재학습이나 큰 데이터세트를 필요로 하지 않는 소수 샷 프롬프트에 맞춘 방어를 제안한다.
정상 샘플과 오염 샘플 간의 마스킹 민감도 차이를 활용해 백도어를 탐지한다.
마스킹 불변성 손실로 프롬프트를 최적화해 정상 샘플의 안정성을 강화하는 선택적 개선책을 제시한다.
다양한 데이터셋과 백도어 공격에서의 효과를 입증한다.

제안 방법

제약된 소수 샷 데이터로 분포 앵커를 사용하는 마스킹 민감도 탐지기로서 MDP를 형식화한다.
각 앵커를 PLM의 어휘 토큰에 대한 전체 언어 모델링 분포로 표현한다.
테스트 샘플의 마스킹 민감도를 정량화하기 위해 앵커에 대한 KL-발산 기반 좌표를 계산한다.
샘플 표현이 마스킹 하에서 어떻게 이동하는지 측정하기 위해 Kendall 순위 상관관계를 사용한다.
정리되지 않은 경우 마스킹 불변성 손실로 프롬프트를 최적화해 정상 샘플의 안정성을 강화한다.
공격자의 공격 효과성과 탐지 회피성 사이의 트레이드오프를 이론적으로 정당화한다.

Figure 1 : Illustration of the threat model: the attacker injects a backdoor into the PLM $f$ ; the victim user adapts $f$ as a few-shot learner in the downstream task; the attacker activates the backdoor via feeding $f$ with poisoned samples.

실험 결과

연구 질문

RQ1소수 샷 PLM이 재학습이나 대규모 데이터 없이도 텍스트 백도어 공격으로부터 방어될 수 있는가?
RQ2소수 샷 데이터로 고정된 마스킹 민감도가 프롬프트 기반 학습에서 오염 샘플과 정상 샘플을 구별하는가?
RQ3마스킹 불변성 강화를 위한 프롬프트 최적화가 방어 성능에 어떻게 영향을 미치는가?
RQ4MDP하에서 공격자의 탐지 회피 능력을 지배하는 이론적 한계는 무엇인가?

주요 결과

MDP는 다섯 데이터세트와 여러 공격에 걸쳐 기준치보다 더 낮은 허용 잘못 수락(FAR) 및 잘못 거부(FRR)율을 달성한다.
MDP는 SST-2 및 CR 데이터셋에서 SOS 공격에 대해 몇 가지 경우에서 거의 완벽한 방어를 보인다.
클래스당 최대 16개의 예시로부터 앵커를 사용하면 STRIP, ONION, RAP 대비 FAR 및 fPRR 이점과 함께 강력한 탐지 성능을 발휘한다.
마스킹 불변성 최적화는 다운스트림 태스크 성능을 해치지 않으면서 정상 샘플의 안정성을 향상시킨다.
분석 결과는 MDP 하에서 백도어 효과와 탐지 회피 사이에 근본적인 트레이드오프가 존재함을 보인다.
연속 프롬프트가 MDP의 효과에 대해 이산 프롬프트보다 우수한 성능을 보인다.

Figure 2 : Overview of MDP: it detects a given sample ${X}_{\mathrm{in}}^{\mathrm{test}}$ as poisoned or clean by measuring the variation of its representational change with respect to a set of distributional anchors ${\mathcal{A}}$ .

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.