QUICK REVIEW

[論文レビュー] Deepfake Video Detection Using Convolutional Vision Transformer

Deressa Wodajo, Atnafu, Solomon|arXiv (Cornell University)|Feb 22, 2021

Digital Media Forensic Detection参考文献 65被引用数 138

ひとこと要約

要約: 本論文は、Convolutional Vision Transformer (CViT) を提案し、CNN ベースの特徴学習と Vision Transformer を組み合わせて Deepfake 検出を行い、DFDC データセットで 91.5% の精度と AUC 0.91 を達成します。データ前処理と多様な DFDC 派生データセットでの訓練を強調します。

ABSTRACT

The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access to the general public have raised concern from all concerned bodies to their possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vision Transformer for the detection of Deepfakes. The Convolutional Vision Transformer has two components: Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset.

研究の動機と目的

アクセス可能な生成ツールと多様な設定の中で頑健な Deepfake 検出を動機付ける。
CNN と Transformer による局所特徴学習と全体特徴を同時に学習するGeneralized detector を開発する。
包括的なデータ前処理と多様な訓練データを強調し、一般化性能を向上させる。
CViT を複数の Deepfake データセットで評価し、既存モデルと比較する。

提案手法

Two-component CViT: CNNベースの特徴学習（17 層の畳み込み、出力は 512x7x7）に続き Vision Transformer（ViT）分類器。
入力準備のために顔を 224x224 RGB に抽出し、データ拡張を実施。
ViT コンポーネントはパッチを 1x1024 の系列へ埋め込みし、位置埋め込みを付与；エンコーダーは 8 ヘッドの注意機構を使用。
訓練はバイナリクロスエントロピー損失と Adam オプティマイザを用い（lr=0.001, weight decay=1e-7）、50 エポック、バッチサイズ 32。
データセット準備: 162,174 枚の訓練/24,898 枚の検証/24,898 枚のテスト画像（70/15/15 の分割に拡張を加え 308,130 枚合計へ）
評価には精度、AUC、対数損失を含む；face_recognition を用いた顔検出のフィルタリング向上により立証済み信頼性を得る。

実験結果

リサーチクエスチョン

RQ1CViT は現実世界の多様な設定とデータセットで Deepfake を効果的に検出できるか？
RQ2CNN ベースの局所特徴学習と Transformer ベースの全体的注意を組み合わせることで、ベースラインより検出性能が向上するか？
RQ3データ前処理は Deepfake 検出性能にどのように影響するか、顔検出の信頼性はどのような役割を果たすか？
RQ4DFDC 以外の複数の Deepfake データセットで CViT の一般化性能はどうなるか？

主な発見

CViT は 91.5% の精度と 0.91 の AUC、未知の DFDC 動画 400 件で損失 0.32 を達成。
FaceForensics++ の変種では CViT の性能は様々で、FaceSwap 69%、DeepFakeDetection 91%、Deepfake 93%、FaceShifter 46%、NeuralTextures 60%。
CNN+RNN-GRU ベースラインと比較して、DFDC での CViT は競争力がある（表 2 の CNN+RNN-GRU が 91.88% に対して CViT は 91.5%）。
複数の顔検出器（BlazeFace、MTCCN、face_recognition）を使用し、最適なフィルタ（face_recognition）を選択することで、フィルタなしの 69.5% から DFDC で 91.5% へと精度が向上。
著者はさらなる改善の余地を認め、多様性と頑健性を高めるためにデータセットを追加する案を提案している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。