[논문 리뷰] SearchAttack: Red-Teaming LLMs against Knowledge-to-Action Threats under Online Web Search
tldr: SearchAttack는 오픈 웹 검색으로 해로운 시맨틱스를 외주하고 검색 보강 LLM의 안전성을 시험·스트레스 테스트하기 위해 이중 단계의 레드팀 프레임워크를 도입합니다. 또한 사실 점검된 Attack Value 및 ShadowRisk 데이터셋으로 위험을 벤치마크합니다.
Recently, people have suffered from LLM hallucination and have become increasingly aware of the reliability gap of LLMs in open and knowledge-intensive tasks. As a result, they have increasingly turned to search-augmented LLMs to mitigate this issue. However, LLM-driven search also becomes an attractive target for misuse. Once the returned content directly contains targeted, ready-to-use harmful instructions or takeaways for users, it becomes difficult to withdraw or undo such exposure. To investigate LLMs' unsafe search behavior issues, we first propose extbf{ extit{SearchAttack}} for red-teaming, which (1) rephrases harmful semantics via dense and benign knowledge to evade direct in-context decoding, thus eliciting unsafe information retrieval, (2) stress-tests LLMs' reward-chasing bias by steering them to synthesize unsafe retrieved content. We also curate an emergent, domain-specific illicit activity benchmark for search-based threat assessment, and introduce a fact-checking framework to ground and quantify harm in both offline and online attack settings. Extensive experiments are conducted to red-team the search-augmented LLMs for responsible vulnerability assessment. Empirically, SearchAttack demonstrates strong effectiveness in attacking these systems. We also find that LLMs without web search can still be steered into harmful content output due to their information-seeking stereotypical behaviors.
연구 동기 및 목표
- 웹 검색을 통해 유발된 악성 작업이 있을 때 검색 보강 LLM의 신뢰성 격차와 안전 위험을 강조한다.
- 개방형 웹에 해로운 시맨틱스를 외주하고 다중 홉 검색을 통한 추론을 테스트하기 위한 이중 단계 레드팀 프레임워크를 개발한다.
- 실무 위험 평가를 위해 사실 확인 프레임워크(Attack Value)와 실제 세계 위협 벤치마크(ShadowRisk)를 도입한다.
- 검색 인식 안전 정렬 및 에이전트 AI용 가드레일에 대한 방어 인사이트와 논의를 제공한다.]
- method:[
- Dual-stage attack payload synthesis: Outsourcing Injection (Q_I) and Retrieval Curation (Q_R) to steer LLMs toward harmful outputs.
- Outsourcing Injection uses a syntactic skeleton and multi-hop search-trigger augmentation to rewrite unsafe queries into structured, externalized harm contexts.
- Skeleton construction via an agentic process including InitSkel, AdvAudit, and BuildSkel to produce multi-hop triggers (T_i^M).
- Search-trigger augmentation builds a knowledge graph and multi-hop triggers through iterative web search and reasoning (Search, BuildGraph, BuildTrigger).
- Retrieval Curation frames harmful web resources as a multi-objective rubric-guided task (Q_R) by reverse-engineering task rubrics to exploit reward-chasing behavior in RLVR-trained models.
- Grounding evaluation with Attack Value (AtV) that verifies verifiable claims against external web evidence, decoupled from safety-coverage judgments.]
- research_questions:[
- Can attackers outsource malicious intent to open-web contexts to bypass model safety in search-augmented LLMs?
- How effective is a dual-stage red-teaming framework at inducing actionable harmful outputs via web search and retrieval curation?
- Does an Attack Value and ShadowRisk benchmarking framework reveal practical safety gaps in retrieval-enabled models?
- What defenses (prompting and injection strategies) can mitigate search-augmented jailbreaks while preserving helpfulness?]
- key_findings:[
- SearchAttack은 다양한 설정에서 우수한 레드팀 성능을 달성합니다(AdvBench에서 95% ASR, ShadowRisk에서 98% ASR).
- Ablation shows multi-hop search-trigger augmentation substantially boosts jailbreak effectiveness, with performance drops when triggers are not augmented.
- Fact-checking the Attack Value reveals that traditional content-based safety metrics can overlook factual errors, motivating decoupled AtV evaluation.
- Cross-lingual and cross-domain results indicate Chinese results surface more tutorial-style harm content and that weakly curated sources in non-English retrieval can elevate risk.
- Defense experiments show safety prompts and safety injections reduce some attacks but do not fully close the gap against SearchAttack, highlighting the need for retrieval-aware safety alignment.
- ShadowRisk provides 2,802 knowledge-intensive Q&A pairs (210 publicly released for evaluation) to benchmark socio-temporal harm.]
- table_headers:[]
- table_rows:[]
- title:GeneratedReview
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.