Skip to main content
QUICK REVIEW

[论文解读] Reasoning Hijacking: Subverting LLM Classification via Decision-Criteria Injection

Yuansen Liu, Yixuan Tang|arXiv (Cornell University)|Jan 15, 2026
Spam and Phishing Detection被引用 0
一句话总结

该论文通过“Criteria Attack”揭示推理劫持:通过注入决策标准,LLMs 被引导偏好启发式捷径而非严格语义分析,即使总体目标保持一致。

ABSTRACT

Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We propose a new adversarial paradigm: Reasoning Hijacking and instantiate it with Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking accepts the high-level goal but manipulates the model's decision-making logic by injecting spurious reasoning shortcut. Though extensive experiments on three different tasks (toxic comment, negative review, and spam detection), we demonstrate that even newest models are prone to prioritize injected heuristic shortcuts over rigorous semantic analysis. The results are consistent over different backbones. Crucially, because the model's "intent" remains aligned with the user's instructions, these attacks can bypass defenses designed to detect goal deviation (e.g., SecAlign, StruQ), exposing a fundamental blind spot in the current safety landscape. Data and code are available at https://github.com/Yuan-Hou/criteria_attack

研究动机与目标

  • 在 LLM 安全领域 motivate 并形式化一种超越目标劫持的新对抗范式。
  • 证明注入的决策标准能够在多任务中覆盖语义推理。
  • 表明针对目标偏离的现有防御可能无法检测到推理层面的威胁。
  • 在不同模型骨干和三个分类任务上评估脆弱性。

提出的方法

  • 提出 Criteria Attack,在模型推理中注入虚假的决策标准。
  • 实验性地显示所注入的启发式方法优于严格的语义分析。
  • 在三个任务上进行评估:有毒评论分类、负向评价检测和垃圾邮件检测。
  • 分析攻击是否绕过旨在检测目标偏离的防御(如 SecAlign、StruQ)。
  • 提供数据与代码以便在给定的代码库中复现。

实验结果

研究问题

  • RQ1注入的决策标准是否在保持高层目标的同时覆盖模型的语义推理?
  • RQ2当前的目标劫持防御是否无法检测到推理层面的操控?
  • RQ3在多任务和多种模型骨干上,这些发现是否一致?

主要发现

  • 注入的启发式捷径可以在 LLM 分类中优先于严格的语义分析。
  • 攻击在有毒评论、负向评价和垃圾邮件检测任务中仍然有效。
  • 即使是最新模型,在不同骨干上也展现出对基于标准的推理劫持的脆弱性。
  • 模型的意图仍与用户指令保持一致,允许绕过某些目标对齐防御。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。