QUICK REVIEW

[论文解读] Deep k-Nearest Neighbors: Towards Confident, Interpretable and Robust Deep Learning

Nicolas Papernot, Patrick McDaniel|arXiv (Cornell University)|Mar 13, 2018

Adversarial Robustness in Machine Learning参考文献 88被引用 368

一句话总结

论文提出 Deep k-Nearest Neighbors (DkNN)，一种混合分类器，利用 DNN 的分层最近邻来通过 conformal prediction 提供置信度、可解释性及鲁棒性（包括对抗性输入的鲁棒性）。

ABSTRACT

Deep neural networks (DNNs) enable innovative applications of machine learning like image recognition, machine translation, or malware detection. However, deep learning is often criticized for its lack of robustness in adversarial settings (e.g., vulnerability to adversarial inputs) and general inability to rationalize its predictions. In this work, we exploit the structure of deep learning to enable new learning-based inference and decision strategies that achieve desirable properties such as robustness and interpretability. We take a first step in this direction and introduce the Deep k-Nearest Neighbors (DkNN). This hybrid classifier combines the k-nearest neighbors algorithm with representations of the data learned by each layer of the DNN: a test input is compared to its neighboring training points according to the distance that separates them in the representations. We show the labels of these neighboring points afford confidence estimates for inputs outside the model's training manifold, including on malicious inputs like adversarial examples--and therein provides protections against inputs that are outside the models understanding. This is because the nearest neighbors can be used to estimate the nonconformity of, i.e., the lack of support for, a prediction in the training data. The neighbors also constitute human-interpretable explanations of predictions. We evaluate the DkNN algorithm on several datasets, and show the confidence estimates accurately identify inputs outside the model, and that the explanations provided by nearest neighbors are intuitive and useful in understanding model failures.

研究动机与目标

利用 DNN 的模块化表示在所有层面评估预测与训练数据的一致性。
提供可靠的置信估计，反映对训练流形的非一致性。
通过暴露解释预测的训练样本来提高可解释性。
通过在各层检测非一致性预测来增强对对抗性输入的鲁棒性。

提出的方法

对一个测试输入，计算训练好的 DNN 产生的 l 层表示。
在每一层，使用 locality-sensitive hashing (LSH) 找到最近的 k 个训练表示。
将每一层的 k 个最近邻的标签收集到多重集 Ωλ。
使用 conformal prediction 基于 Ωλ 和校准数据计算非符合度 α(x,j)。
为每个类别 j 计算 p-values p_j(z)，输出 p 值最高的预测，以及相关的置信度和可信度。

实验结果

研究问题

RQ1如何利用 DNN 的逐层表示来评估一个预测与训练数据的一致性？
RQ2能否产生一个经校准的置信度量，反映对训练流形的非一致性？
RQ3逐层最近邻解释是否提高可解释性并有助于检测对抗性或分布外输入？
RQ4通过确保预测在网络内的多种表示上得到支持，是否提升鲁棒性？

主要发现

DkNN 产生的可信度估计，在识别远离训练流形的输入方面优于标准 DNN 的置信度。
对分布外或几何变换输入，DkNN 的可信度低于 10%，而 DNN 为 20%–50%。
最近邻解释在各层提供直观、可被人类理解的预测。
DkNN 通过低可信度识别对抗性样本，且自适应攻击通常需要扰动输入语义以改变预测。
当预测在各层得到训练流形的支持时，预测保持完整性，表明鲁棒性与可解释性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。