QUICK REVIEW

[论文解读] Deepfake Video Detection Using Convolutional Vision Transformer

Deressa Wodajo, Atnafu, Solomon|arXiv (Cornell University)|Feb 22, 2021

Digital Media Forensic Detection参考文献 65被引用 138

一句话总结

本文提出了一种卷积视觉变换器（CViT），将基于 CNN 的特征学习与 Vision Transformer 相结合用于 Deepfake 检测，在 DFDC 数据集上达到 91.5% 的准确率和 0.91 的 AUC。它强调数据预处理以及在多样化的 DFDC 派生数据集上的训练。

ABSTRACT

The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access to the general public have raised concern from all concerned bodies to their possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vision Transformer for the detection of Deepfakes. The Convolutional Vision Transformer has two components: Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset.

研究动机与目标

鼓励在易获得的生成工具和多样化环境中进行稳健的 Deepfake 检测。
开发一个通用检测器，通过 CNN 和 Transformer 共同学习局部和全局特征。
强调全面的数据预处理和多样化的训练数据以提升泛化能力。
在多个 Deepfake 数据集上评估 CViT，并与现有模型进行比较。

提出的方法

两组件 CViT：一个基于 CNN 的特征学习（17 层卷积，输出为 512x7x7）后接 Vision Transformer (ViT) 分类器。
将人脸提取并调整为 224x224 RGB，并进行数据增强以准备输入。
ViT 组件将若干块（七个）嵌入为 1x1024 的序列，带有位置嵌入；编码器中有 8 个注意力头。
训练使用二元交叉熵损失，采用 Adam 优化器（学习率=0.001，权重衰减=1e-7）训练 50 个 epoch；批量大小 32。
数据集准备：训练 162,174 张 / 验证 24,898 张 / 测试 24,898 张图像（70/15/15 的划分，并通过数据增强将总数扩展至 308,130）。
评估包括准确率、AUC 和对数损失；使用 face_recognition 进行人脸筛选后，人脸提取的可靠性有所提高。

实验结果

研究问题

RQ1CViT 能否在多样化的真实世界场景和数据集上有效检测 Deepfake？
RQ2将基于 CNN 的局部特征学习与基于 Transformer 的全局注意力相结合，是否能在检测性能上优于基线？
RQ3数据预处理如何影响 Deepfake 检测性能，面部检测的可靠性发挥着怎样的作用？
RQ4CViT 在 DFDC 之外的多个 Deepfake 数据集上的泛化性能如何？

主要发现

CViT 在 400 个未见 DFDC 视频上实现 91.5% 的准确率和 0.91 的 AUC，损失为 0.32。
在 FaceForensics++ 变体上，CViT 的表现各异：69%（FaceSwap）、91%（DeepFakeDetection）、93%（Deepfake）、46%（FaceShifter）、60%（NeuralTextures）。
与 CNN+RNN-GRU 基线相比，CViT 在 DFDC 上具有竞争力（91.5% vs CNN+RNN-GRU 的 91.88%，见表 2）。
使用多个人脸检测器（BlazeFace、MTCCN、face_recognition）并选择最佳筛选器（face_recognition）将 DFDC 的准确率从 69.5%（无筛选）提升至 91.5%。
作者承认仍有改进空间，并提出增加更多数据集以提升多样性和鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。