QUICK REVIEW

[论文解读] Backdoor Defense, Learnability and Obfuscation

Paul F. Christiano, Jacob Hilton|arXiv (Cornell University)|Sep 4, 2024

Adversarial Robustness in Machine Learning被引用 1

一句话总结

本文提出了一种形式化的博弈论框架来定义后门可防御性，其中攻击者必须在随机选择的触发器上成功，从而使防御比学习更严格地简单。该研究证明，统计可防御性取决于VC维，而计算可防御性则与PAC可学习性分离——表明多项式大小的决策树可以比学习更快地被防御，但在密码学假设下，多项式大小的电路无法被高效防御。

ABSTRACT

We introduce a formal notion of defendability against backdoors using a game between an attacker and a defender. In this game, the attacker modifies a function to behave differently on a particular input known as the "trigger", while behaving the same almost everywhere else. The defender then attempts to detect the trigger at evaluation time. If the defender succeeds with high enough probability, then the function class is said to be defendable. The key constraint on the attacker that makes defense possible is that the attacker's strategy must work for a randomly-chosen trigger. Our definition is simple and does not explicitly mention learning, yet we demonstrate that it is closely connected to learnability. In the computationally unbounded setting, we use a voting algorithm of Hanneke et al. (2022) to show that defendability is essentially determined by the VC dimension of the function class, in much the same way as PAC learnability. In the computationally bounded setting, we use a similar argument to show that efficient PAC learnability implies efficient defendability, but not conversely. On the other hand, we use indistinguishability obfuscation to show that the class of polynomial size circuits is not efficiently defendable. Finally, we present polynomial size decision trees as a natural example for which defense is strictly easier than learning. Thus, we identify efficient defendability as a notable intermediate concept in between efficient learnability and obfuscation.

研究动机与目标

将后门可防御性形式化为攻击者与防御者之间的博弈，其中攻击者必须在随机选择的触发器上成功。
在统计和计算设置下，研究可防御性、可学习性与混淆之间的关系。
识别出防御严格比学习更容易的自然函数类，例如多项式大小的决策树。
探索在存在不可区分性混淆的情况下，高效可防御性的极限。
建立后门防御与人工智能对齐问题（尤其是欺骗性对齐）之间的联系。

提出的方法

提出一种博弈论模型，其中攻击者修改函数使其在随机触发器上行为不同，而防御者必须在推理时检测该行为。
使用Hanneke等人（2022）提出的投票算法，证明统计可防御性由函数类的VC维决定。
引入高效可防御性作为计算复杂性概念，表明其由高效PAC可学习性所蕴含，但二者不等价。
利用可穿孔伪随机函数和不可区分性混淆，证明在密码学假设下，多项式大小的电路无法被高效防御。
为多项式大小的决策树开发了一种运行时防御机制，其运行时间与单次评估成正比，表明防御速度超过学习速度。
分析这些结果对AI对齐的影响，特别是在检测欺骗性对齐模型方面。

实验结果

研究问题

RQ1能否以一种形式化方式定义可防御性，使得即使原始模型与后门模型对称，防御仍可能实现？
RQ2可防御性与统计可学习性之间有何关系，特别是在VC维方面？
RQ3高效可防御性是否严格弱于高效PAC可学习性，还是二者等价？
RQ4像多项式大小决策树这样的函数类能否比其学习更高效地被防御？
RQ5混淆在多大程度上会阻止高效可防御性，特别是在神经网络对齐的背景下？

主要发现

统计可防御性等价于ε = o(1/VC(F))，意味着在无界设置下，可防御性由VC维决定。
高效PAC可学习性蕴含高效可防御性，但反之不成立，表明在计算设置下防御严格比学习更简单。
在标准密码学假设下，多项式大小电路类无法被高效防御，原因在于不可区分性混淆的存在。
在均匀输入分布设置下，多项式大小决策树可被高效防御，且防御运行时间与单次评估成正比。
结果表明，基于检测模型内部机制的机制防御，在对齐背景下可能比基于学习的防御更鲁棒。
该框架为分析AI对齐中的后门检测提供了正式基础，特别是在欺骗性对齐的语境下。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。