QUICK REVIEW

[论文解读] Use of a Capsule Network to Detect Fake Images and Videos

Huy H. Nguyen, Junichi Yamagishi|arXiv (Cornell University)|Oct 28, 2019

Digital Media Forensic Detection参考文献 69被引用 132

一句话总结

本文提出 Capsule-Forensics，一种基于胶囊网络的检测器，能够对多种假图像/视频攻击实现泛化，在参数远少于 CNN 基线的情况下达到竞争力的准确率，并通过对胶囊激活的可视化进行分析。

ABSTRACT

The revolution in computer hardware, especially in graphics processing units and tensor processing units, has enabled significant advances in computer graphics and artificial intelligence algorithms. In addition to their many beneficial applications in daily life and business, computer-generated/manipulated images and videos can be used for malicious purposes that violate security systems, privacy, and social trust. The deepfake phenomenon and its variations enable a normal user to use his or her personal computer to easily create fake videos of anybody from a short real online video. Several countermeasures have been introduced to deal with attacks using such videos. However, most of them are targeted at certain domains and are ineffective when applied to other domains or new attacks. In this paper, we introduce a capsule network that can detect various kinds of attacks, from presentation attacks using printed images and replayed videos to attacks using fake videos created using deep learning. It uses many fewer parameters than traditional convolutional neural networks with similar performance. Moreover, we explain, for the first time ever in the literature, the theory behind the application of capsule networks to the forensics problem through detailed analysis and visualization.

研究动机与目标

解决针对图像和视频的多样操纵（包括深度伪造、面部重现和 CGI）而具备通用性、轻量级检测器的需求，并能够跨攻击类型迁移。
利用胶囊网络来保持分层特征，并在比传统 CNN 更少参数的情况下提升取证任务的性能。
提供对胶囊网络如何在取证输入上工作的理论与可视化分析，以证明其在取证问题上的适用性。

提出的方法

通过对图像进行补丁切片或从视频中提取帧来预处理输入；可选地裁剪人脸区域以实现面部聚焦检测。
在胶囊网络前使用基于 VGG-19 的特征提取器，提取至第三个最大池化层作为正则化的前端。
实现 Capsule-Forensics 架构，具有多个初级胶囊（3 或 10），每个胶囊由一个 2D 卷积、一个统计汇聚层和一个 1D 卷积组成，输入到两个输出胶囊（real 和 fake）。
在训练过程中应用带有两种正则化的动态路由（路由矩阵中的随机噪声和 dropout），再加上 squash 激活以稳定学习。
使用交叉熵损失和 Adam 优化器进行训练；汇聚帧/补丁分数（对视频对帧取平均）以产生最终决策。

实验结果

研究问题

RQ1Capsule-Forensics 是否能够在一个框架内检测包括 CGI、打印/重播攻击以及 deepfake/reenactment 视频在内的多种操控内容？
RQ2在使用带正则化的胶囊网络和更大输入时，是否能在减少参数量的同时提升跨攻击的检测效果，相较于 CNN 基线？
RQ3学习得到的胶囊如何对应于被操控的区域，动态路由揭示随时间胶囊之间的一致性？
RQ4除了二元的 real/fake 外，是否可行让胶囊网络具备多类能力，以区分具体的操控类型（Deepfakes、Face2Face、FaceSwap）？

主要发现

网络	二分类准确率 (%)	二分类 EER (%)	多类分类准确率 (%)	参数数量
XceptionNet (299×299) [27]	91.46	9.98	91.33	20,811,050
Capsule-Forensics (old) (128×128) [28]	87.73	15.69	85.89	2,796,889
Capsule-Forensics (old) + Noise (128×128) [28]	88.11	15.71	87.12	2,796,889
Capsule-Forensics light (300×300)	90.02	10.95	87.51	2,796,889
Capsule-Forensics light + Noise (300×300)	91.12	11.60	87.54	2,796,889
Capsule-Forensics (300×300)	91.65	11.36	88.51	3,896,638
Capsule-Forensics + Noise (300×300)	91.48	11.62	89.98	3,896,638
Capsule-Forensics light + Dropout (300×300)	91.36	11.61	89.19	2,796,889
Capsule-Forensics light + Dropout + Noise (300×300)	91.28	11.38	88.44	2,796,889
Capsule-Forensics + Dropout (300×300)	92.20	10.96	90.51	3,896,638
Capsule-Forensics + Dropout + Noise (300×300)	92.02	10.26	91.22	3,896,638
Capsule-Forensics + Dropout + Noise (video)	93.11	10.26	92.90	3,896,638

Capsule-Forensics 使用 300x300 输入与增强设置，在使用参数显著少于 XceptionNet（≈3.9M vs. ≈20.8M）的情况下，达到有竞争力的二分类准确率（≈91.65%）和鲁棒的 EER（≈11.36%）。
在路由阶段加入随机噪声和 dropout 正则化可提升性能，尤其在输入尺寸更大和初级胶囊数量更多时。
将初级胶囊数量增加到 10 并应用 dropout/噪声可获得很强的多类性能，Capsule-Forensics + Dropout + Noise (300x300) 达到 ≈91.22% 的多类准确率和 ≈10.26% 的 EER。
视频帧聚合进一步提升二分类和多类准确率，例如 Capsule-Forensics + Dropout + Noise (video) 达到二分类准确率 93.11% 和多类准确率 92.90%。
与 XceptionNet 相比，经过优化的 Capsule-Forensics 在接近相同的二分类准确率的同时，参数约少五倍；在多类设置中对不同操控类型的表现也更为均衡。
在大规模数据集上对完全 CGI 与真实照片的鉴别中，旧版和新版 Capsule-Forensics 都优于基线，在所报道的设置中在大型 CGI/PI 数据集达到 100% 的准确率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。