Skip to main content
QUICK REVIEW

[论文解读] Self-supervised Transformer for Deepfake Detection

Hanqing Zhao, Wenbo Zhou|arXiv (Cornell University)|Mar 2, 2022
Face recognition and analysis被引用 20
一句话总结

一个自监督音视频对比学习框架通过视频编码器结合3D CNN、2D CNN和Transformer后端学习鲁棒的嘴唇动作表征,提升深度伪造检测的泛化性和鲁棒性,无需监督的口型识别预训练。

ABSTRACT

The fast evolution and widespread of deepfake techniques in real-world scenarios require stronger generalization abilities of face forgery detectors. Some works capture the features that are unrelated to method-specific artifacts, such as clues of blending boundary, accumulated up-sampling, to strengthen the generalization ability. However, the effectiveness of these methods can be easily corrupted by post-processing operations such as compression. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks may provide useful features for deepfake detection. For example, lip movement has been proved to be a kind of robust and good-transferring highlevel semantic feature, which can be learned from the lipreading task. However, the existing method pre-trains the lip feature extraction model in a supervised manner, which requires plenty of human resources in data annotation and increases the difficulty of obtaining training data. In this paper, we propose a self-supervised transformer based audio-visual contrastive learning method. The proposed method learns mouth motion representations by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. After pre-training with our method, the model will then be partially fine-tuned for deepfake detection task. Extensive experiments show that our self-supervised method performs comparably or even better than the supervised pre-training counterpart.

研究动机与目标

  • 推动能够在未见的伪造方法和后处理方法上泛化的鲁棒深度伪造检测。
  • 利用自监督预训练以降低相较于有监督口型识别预训练的标注成本。
  • 通过音视频的一致性学习唇动表征,以迁移到深度伪造检测。
  • 评估跨数据集和跨操作/操纵的泛化能力以及对常见损坏的鲁棒性。

提出的方法

  • 具备前端3D卷积和2D卷积、后端时序Transformer的两阶段时空视频编码器,用于唇动表征。
  • 使用InfoNCE进行跨模态对比学习,将音频(基于 wav2vec2)和视觉唇动特征对齐到同一空间。
  • 视频和音频编码器通过MLP头将特征投射到共享空间;正样本对为同步的音视频片段,负样本为其他对。
  • 微调时冻结前端和适配器,使用对每个Transformer层设定的受控学习率来训练分类头,以保持预训练知识。
  • 预训练使用 VoxCeleb2 和 AVSpeech-scale 数据;在 FaceForensics++ 上微调并在跨数据集基准上进行评估。

实验结果

研究问题

  • RQ1在没有监督口型识别数据的情况下,唇动的自监督音视频预训练是否能够迁移到鲁棒的深度伪造检测?
  • RQ2所提出的预训练是否提升对未见操作方法以及跨数据集的泛化?
  • RQ3预训练数据规模如何影响检测性能和跨数据集迁移?
  • RQ4在本任务中,带有 Transformer 后端的架构是否优于 MSTCN 的唇动表征?

主要发现

  • 所提出的自监督预训练在深度伪造检测中达到与有监督预训练相当或更优的性能。
  • 预训练模型对未见的伪造方法和跨数据集情景显示出比若干基线更强的泛化能力。
  • 增加预训练数据量可提升同数据集和跨数据集的AUC性能,特别是在较大骨干网络时。
  • 该方法对常见视频损坏具有鲁棒性,使用自监督预训练时尤为明显。
  • 在相似设置和数据条件下,带 Transformer 的架构优于 MSTCN 基线,且更大的前端模型进一步提升跨数据集性能。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。