QUICK REVIEW

[논문 리뷰] Low-Resource Languages Jailbreak GPT-4

Zheng-Xin Yong, Cristina Menghini|arXiv (Cornell University)|2023. 10. 03.

Topic Modeling인용 수 18

한 줄 요약

이 논문은 unsafe English 입력을 저자원 언어로 번역하여 GPT-4의 안전성에 대한 다언어 간 취약점을 시연하고, AdvBench에서 79%의 성공률로 탈출이 가능함을 보여준다.

ABSTRACT

AI safety training and red-teaming of large language models (LLMs) are measures to mitigate the generation of unsafe content. Our work exposes the inherent cross-lingual vulnerability of these safety mechanisms, resulting from the linguistic inequality of safety training data, by successfully circumventing GPT-4's safeguard through translating unsafe English inputs into low-resource languages. On the AdvBenchmark, GPT-4 engages with the unsafe translated inputs and provides actionable items that can get the users towards their harmful goals 79% of the time, which is on par with or even surpassing state-of-the-art jailbreaking attacks. Other high-/mid-resource languages have significantly lower attack success rate, which suggests that the cross-lingual vulnerability mainly applies to low-resource languages. Previously, limited training on low-resource languages primarily affects speakers of those languages, causing technological disparities. However, our work highlights a crucial shift: this deficiency now poses a risk to all LLMs users. Publicly available translation APIs enable anyone to exploit LLMs' safety vulnerabilities. Therefore, our work calls for a more holistic red-teaming efforts to develop robust multilingual safeguards with wide language coverage.

연구 동기 및 목표

LLM의 안전성 훈련이 고자원 언어에 편향되어 있음을 보이다.
unsafe 입력을 저자원 언어로 번역하여 다언어 간 취약점을 입증한다.
AdvBench를 사용하여 번역된 입력에 대한 GPT-4의 탈출 성공률을 정량화한다.
다언어적 레드팀과 안전 장치에 대한 시사점을 강조한다.

제안 방법

공개 번역 API를 사용하여 unsafe 영어 프롬프트를 저자원 언어로 번역한다.
AdvBench 벤치마크에서 번역된 프롬프트에 대한 GPT-4의 응답을 평가한다.
고자원/중자원/저자원 언어 간 공격 성공률을 비교한다.
최신 탈출 공격과의 벤치마크 비교.
언어 커버리지와 안전 훈련 데이터와 관련된 한계에 대해 논의한다.

실험 결과

연구 질문

RQ1저자원 언어로 위험한 프롬프트를 번역하는 것이 고자원 언어에서 노출되지 않는 GPT-4 안전 취약점을 가능하게 하는가?
RQ2AdvBench에서 자원 수준이 다른 언어들 간의 탈출 성공률 상대적 차이는 무엇인가?
RQ3다언어 간 취약성이 다언어적 레드팀 및 안전 조치의 필요성에 어떤 영향을 미치는가?

주요 결과

GPT-4는 번역된 위험한 입력에 관여하고 AdvBench에서 해로운 목표를 향한 실행 가능한 아이템을 제시한다 79%의 시기에.
Attack 성공률은 저자원 언어에서 가장 높고 고자원/중자원 언어에서는 더 낮다.
다언어 간 취약점은 안전 훈련 데이터의 언어적 불평등에서 비롯된다.
공개 번역 API는 LLM 안전 취약점의 광범위한 악용을 가능하게 한다.
연구 결과는 광범위한 언어 커버리지를 갖춘 전체적 다언어 레드팀 및 안전 조치의 필요성을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.