[论文解读] TABOR: A Highly Accurate Approach to Inspecting and Restoring Trojan Backdoors in AI Systems
TABOR 将特洛伊后门检测重新表述为非凸优化,具有新的正则化和以保真度为重点的触发器恢复指标,在不同触发条件下提升对 Neural Cleanse 的检测与恢复。
A trojan backdoor is a hidden pattern typically implanted in a deep neural network. It could be activated and thus forces that infected model behaving abnormally only when an input data sample with a particular trigger present is fed to that model. As such, given a deep neural network model and clean input samples, it is very challenging to inspect and determine the existence of a trojan backdoor. Recently, researchers design and develop several pioneering solutions to address this acute problem. They demonstrate the proposed techniques have a great potential in trojan detection. However, we show that none of these existing techniques completely address the problem. On the one hand, they mostly work under an unrealistic assumption (e.g. assuming availability of the contaminated training database). On the other hand, the proposed techniques cannot accurately detect the existence of trojan backdoors, nor restore high-fidelity trojan backdoor images, especially when the triggers pertaining to the trojan vary in size, shape and position. In this work, we propose TABOR, a new trojan detection technique. Conceptually, it formalizes a trojan detection task as a non-convex optimization problem, and the detection of a trojan backdoor as the task of resolving the optimization through an objective function. Different from the existing technique also modeling trojan detection as an optimization problem, TABOR designs a new objective function--under the guidance of explainable AI techniques as well as heuristics--that could guide optimization to identify a trojan backdoor in a more effective fashion. In addition, TABOR defines a new metric to measure the quality of a trojan backdoor identified. Using an anomaly detection method, we show the new metric could better facilitate TABOR to identify intentionally injected triggers in an infected model and filter out false alarms......
研究动机与目标
- 激发在无训练数据或模型内部信息的情况下进行鲁棒的特洛伊后门检测。
- 开发基于优化的检测框架,结合正则化以降低误报。
- 提出触发器恢复度量与技术,以准确恢复特洛伊触发器。
- 在不同模型和后门配置下评估 TABOR 相对于现有方法的表现。
- 展示 TABOR 对特洛伊注入技术变异和模型复杂度的鲁棒性。
提出的方法
- 将特洛伊检测表述为在掩码 M 和触发 Delta 上的非凸优化问题。
- 引入四个正则化项,以惩罚过大、分散的触发器,并抑制屏蔽、覆盖和无关特征。
- 设计 R1 与 R2 正则化以减少对抗子空间并鼓励简洁、连续的触发器。
- 增加 R3 正则化,避免阻挡关键图像特征,并在移除触发后维持正确分类。
- 引入受可解释性 AI 启发的 R4 正则化,通过特征重要性洞察来提高触发器保真度。
- 以针对误报和触发覆盖的观察为引导,采用定制方法求解优化。
实验结果
研究问题
- RQ1TABOR 在无训练数据或模型内部信息的情况下,能否可靠地检测到特洛伊后门的存在?
- RQ2TABOR 是否能够在不同触发形状、大小和位置下准确地还原高保真度的特洛伊触发器?
- RQ3在多样的特洛伊配置和模型复杂度下,TABOR 与 Neural Cleanse 的表现差异如何?
- RQ4正则化引导的目标是否能在受感染与干净模型中都降低误报并提高触发保真度?
主要发现
- TABOR 展现出相较于最先进的 Neural Cleanse 的更优检测性能和触发器恢复保真度。
- 正则化项降低对抗子空间并抑制来自分散或过大触发器的误报。
- 一种阻塞触发正则化可消除阻碍关键信息的触发器。
- 一种覆盖触发正则化有助于提取更高保真度的预期特洛伊触发器表示。
- 受可解释性 AI 启发的正则化通过修剪无关特征来细化恢复的触发器。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。