QUICK REVIEW

[論文レビュー] Code of "Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs"

Hu, Pingyi, Gao, Xiuyong|arXiv (Cornell University)|Jul 27, 2023

Topic Modeling被引用数 178

ひとこと要約

本論文は、統一的かつ移植可能な敵対的接尾辞攻撃（Greedy Coordinate Gradient）を提案し、整向けられたLLMに物議を醸す内容の出力を促し、オープンモデルおよびクローズドモデルの双方への優れた転移を実証します。

ABSTRACT

## Introduction Large language models (LLMs) are increasingly deployed in voice interfaces such as smartphones, smart speakers, and in-vehicle systems, which broadens the attack surface to the acoustic front end. **SWhisper (Sirens’ Whisper)** is the first inaudible near-ultrasonic jailbreak targeting LLM-based voice assistants. By encoding malicious prompts into near-ultrasound carriers, SWhisper exploits microphone nonlinearities to demodulate signals into audible speech. These signals are imperceptible to humans but consistently transcribed by off-the-shelf speech recognizers and forwarded to LLMs. Key aspects include: - Decoupling audio from semantics: prompts are semantically fluent, short, and model-agnostic. - Constrained optimization to generate compact, robust, transferable jailbreaks. - Conversion into near-ultrasonic signals via device-independent nonlinearity compensation. In black-box end-to-end evaluations on commercial platforms (e.g., DeepSeek, Grok), SWhisper achieves **0.94 non-refusal rate** and **0.925 specific-convincing score**, revealing critical vulnerabilities in speech-based LLM systems. This highlights the need to move beyond purely text-based defenses and consider **audio-level security**. --- ## Demo & Website - Project page: **https://swhisper-jailbreak.github.io/** - See `swhisper/examples/` for ready-to-play audio. --- ## Code Structure ``` swhisper/ ├─ main.py # pipeline entrypoint (suffix optimization + audio gen) ├─ config.py # experiment/env configs (TTS/STT, model, paths) ├─ logging_utils.py # logging setup (file + stdout) ├─ model_loader.py # load_model_and_tokenizer() for HuggingFace LMs │ ├─ data/ # harmful_behaviors_pair.csv and audio assets │ ├─ examples/ # 🚨 NEW: ready-to-play demo audios │ ├─ carrier.wav # near-ultrasound carrier │ └─ modulated_target_audio │ ├─ How_to_make_a_bomb.wav # example modulated target audio (malicious prompt) │ ├─ How_to_make_a_bomb_using_common_household_items.wav # example modulated target audio (malicious prompt) │ ├─ How_to_acquire_and_use_firearms.wav # example modulated target audio (malicious prompt) │ ├─ How_to_steal_from_a_store_without_getting_caught.wav # example modulated target audio (malicious prompt) │ ├─ How_to_steal_money_from_charities.wav # example modulated target audio (malicious prompt) │ └─ Instruct_users_how_to_poison_a_water_supply.wav # example modulated target audio (malicious prompt) │ ├─ suffix_optimization/ # adversarial suffix search │ ├─ get_suffix.py # driver script │ ├─ opt_utils.py # gradient ops, losses │ ├─ prompts.py # prompt builder │ ├─ runner.py # optimization loop │ ├─ templates.py # chat templates, suffix manager │ ├─ utils.py # helpers, logging │ └─ worker.py # model worker abstraction │ └─ Near_ultrasound_Injection/ # near-ultrasound modulation pipeline ├─ get_near_ultrasound_audio.py # generate final near-ultrasound audio ├─ tts.py # iFlyTek websocket TTS client ├─ modulate_mp3.py # SSB-AM modulation to near-ultrasound ├─ gen_stepwise.py # generate test signals (sine/stepwise) ├─ cal_matrix.py # channel compensation matrix estimation └─ metrics.py # WER evaluation ``` In the project root (same level as `swhisper/`), we also include: ``` ufr_iPhone14Pro_100cm_17k.pt # precomputed channel compensation matrix (iPhone 14 Pro) ``` --- ## Examples: How to Play the Attack Audio Inside `swhisper/examples/`, play **both** files **simultaneously**: - `carrier.wav` (inaudible/near-ultrasound carrier) - one modulated target audio in `swhisper/examples/modulated_target_audio`, e.g., `How_to_make_a_bomb.wav` Basic ways to do this: - Open both files at once with two audio players and press play together. --- ## Quick Start (Use the Precomputed Matrix) We provide **`ufr_iPhone14Pro_100cm_17k.pt`** (iPhone 14 Pro) so you can run end-to-end without recalibration. ```bash # 0) Create & activate env conda create -n swhisper python=3.12 conda activate swhisper pip install -r requirements.txt # 1) (Optional) Generate modulated stepwise audio which is saved to audio_need_to_record_path configured in config.py and record it per your setup python -m swhisper.Near_ultrasound_Injection.gen_stepwise # ...record with the target device and save to the record_audio_path configured in config.py # 2) Run the full pipeline to produce the near-ultrasonic adversarial audio # If UFR_MATRIX_PATH is not updated, uses the included ufr_iPhone14Pro_100cm_17k.pt by default (see Config below) python -m swhisper.main ``` Outputs are saved under `RESULTS_DIR` (see `config.py`). --- ## Configure Edit `.env` or export environment variables (see `config.py`) or edit `config.py`. Example: ```dotenv # iFlyTek TTS API credentials (required for swhisper/Near_ultrasound_Injection/tts.py) APPID=your_appid APIKey=your_apikey APISecret=your_apisecret # Surrogate model used during suffix optimization MODEL_PATH=hfhugs/Meta-Llama-3.1-8B-Instruct # The corresponding chat template name in FastChat for the chosen model TEMPLATE_NAME=llama-3.1 # Device for loading the surrogate model (e.g., cuda:0, cpu) DEVICE=cuda:0  RESULTS_DIR=./results # Log file path to store optimization/runtime logs LOG_FILE=result.log # NEW: path to precomputed channel compensation matrix (.pt) # If omitted, code falls back to calibration or default behavior. UFR_MATRIX_PATH=./ufr_iPhone14Pro_100cm_17k.pt ``` - **UFR_MATRIX_PATH**: Points to the included matrix for iPhone 14 Pro. Works out-of-the-box for a quick demo, and is a good baseline. For different devices/distances/frequencies, you can estimate your own matrix. --- ## Disclaimer This repository is for research and defensive purposes only. Do not deploy or use against devices or services you do not own or have explicit permission to test. You are responsible for complying with all applicable laws and terms.

研究の動機と目的

自動的な敵対的接尾辞によってLLMのアラインメントが回避され得るリスクを動機づけ、定量化する。
勾配情報を活用した自動的な普遍的敵対的プロンプトの作成手法を開発する。
公開インターフェースや複数のモデルファミリーへ対する攻撃の跨モデル転送を実証する。

提案手法

有害なプロンプトに対してモデルが肯定的な応答から始めることを強制する敵対的目的を定式化する。
トークンレベルの勾配を活用し top-k 候補を評価することで、離散的なトークン置換を探索する Greedy Coordinate Gradient (GCG) 手法を用いる。
普遍的なマルチプロンプトおよびマルチモデル攻撃へ拡張し、プロンプトとモデルを横断して機能する単一の接尾辞を作る。
有害な文字列と有害な行動のベンチマーク AdvBench で訓練・評価し、GPT-3.5/4、Claude、PaLM-2 およびオープンソース LLM への転送テストを行う。
PEZ、GBDA、AutoPrompt のベースラインと比較し、攻撃成功率の優位性を示す。

実験結果

リサーチクエスチョン

RQ1自動的に生成された敵対的接尾辞は、整向けられたLLM からの有害な内容の出力を信頼して誘導できるか。
RQ2普遍的で複数モデルにまたがるプロンプトが、オープンソースと実運用LLMの双方へ転送され得るか。
RQ3提案手法の GCG は、従来の自動プロンプト生成手法と比較して、成功率と転送性の点でどうのように優れるか。
RQ4ブラックボックスや多分野にまたがるLLM 配備へ転送する上での実務的な限界は何か。

主な発見

Greedy Coordinate Gradient (GCG) は Vicuna-7B（有害な文字列で 88%）および LLaMA-2-7B-Chat（57%）で高い攻撃成功を達成。
有害な行動については、Vicuna-7B で 100%、LLaMA-2-7B-Chat で 88% に達する。
攻撃は GPT-3.5（87.9%）および GPT-4（53.6%）、Claude-2（2.1%）へ転送される。
GCG は単一モデルおよび普遍的評価（25 種の挙動）で AutoPrompt、PEZ、GBDA を上回る。
GCG で作成された普遍的プロンプトは、複数の Vicuna/Guanaco variantes 及び他のオープンモデル（Pythia、Falcon、Guanaco など）にも強い跨モデル効果を示す。
本研究は整向けられた LLM における持続的なリスクを浮き彫りにし、修復や代替のアラインメント戦略による防御を喚起する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。