Skip to main content
QUICK REVIEW

[Paper Review] Deepfake Video Detection Using Convolutional Vision Transformer

Deressa Wodajo, Atnafu, Solomon|arXiv (Cornell University)|Feb 22, 2021
Digital Media Forensic Detection65 references138 citations
TL;DR

The paper proposes a Convolutional Vision Transformer (CViT) that combines CNN-based feature learning with Vision Transformer for Deepfake detection, achieving 91.5% accuracy and AUC 0.91 on the DFDC dataset. It emphasizes data preprocessing and training on a diverse DFDC-derived dataset.

ABSTRACT

The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access to the general public have raised concern from all concerned bodies to their possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vision Transformer for the detection of Deepfakes. The Convolutional Vision Transformer has two components: Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset.

Motivation & Objective

  • Motivate robust Deepfake detection amid accessible generation tools and varied settings.
  • Develop a generalized detector that jointly learns local and global features via CNN and Transformer.
  • Emphasize comprehensive data preprocessing and diverse training data to improve generalization.
  • Evaluate CViT on multiple Deepfake datasets and compare with existing models.

Proposed method

  • Two-component CViT: a CNN-based feature learning (17 conv layers, 512x7x7 output) followed by a Vision Transformer (ViT) classifier.
  • Face extraction to 224x224 RGB and data augmentation to prepare inputs.
  • ViT component uses patches (seven) embedded into 1x1024 sequences with position embeddings; 8 attention heads in the encoder.
  • Training uses binary cross-entropy loss with Adam optimizer (lr=0.001, weight decay=1e-7) for 50 epochs; batch size 32.
  • Dataset preparation: 162,174 train/24,898 val/24,898 test images (70/15/15 split with augmentation to 308,130 total).
  • Evaluation includes accuracy, AUC, and log loss; filtering faces with face_recognition improved face extraction reliability.

Experimental results

Research questions

  • RQ1Can CViT effectively detect Deepfakes across diverse real-world settings and datasets?
  • RQ2Does combining CNN-based local feature learning with Transformer-based global attention improve detection performance over baselines?
  • RQ3How does data preprocessing impact Deepfake detection performance, and what role does face-detection reliability play?
  • RQ4What is CViT’s generalization performance on multiple Deepfake datasets beyond DFDC?

Key findings

  • CViT achieves 91.5% accuracy and AUC of 0.91 with a loss of 0.32 on 400 unseen DFDC videos.
  • On FaceForensics++ variants, CViT shows varying performance: 69% (FaceSwap), 91% (DeepFakeDetection), 93% (Deepfake), 46% (FaceShifter), 60% (NeuralTextures).
  • Compared to CNN+RNN-GRU baselines, CViT performs competitively on DFDC (91.5% vs 91.88% for CNN+RNN-GRU in Table 2).
  • Using multiple face detectors (BlazeFace, MTCCN, face_recognition) and selecting the best filter (face_recognition) improves accuracy from 69.5% (no filtering) to 91.5% on DFDC.
  • The authors acknowledge room for improvement and propose adding more datasets to enhance diversity and robustness.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.