QUICK REVIEW

[論文レビュー] DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythms

Qi Hua, Qing Guo|arXiv (Cornell University)|Jun 13, 2020

Non-Invasive Vital Sign Monitoring参考文献 68被引用数 52

ひとこと要約

DeepRhythmは、モーション拡張された視覚信号とデュアル空間-時間注意機構を用いて表情動画の心拍リズムを監視することでDeepFakesを検出し、データセット全体で精度と頑健性を向上させます。

ABSTRACT

As the GAN-based face image and video generation techniques, widely known as DeepFakes, have become more and more matured and realistic, there comes a pressing and urgent demand for effective DeepFakes detectors. Motivated by the fact that remote visual photoplethysmography (PPG) is made possible by monitoring the minuscule periodic changes of skin color due to blood pumping through the face, we conjecture that normal heartbeat rhythms found in the real face videos will be disrupted or even entirely broken in a DeepFake video, making it a potentially powerful indicator for DeepFake detection. In this work, we propose DeepRhythm, a DeepFake detection technique that exposes DeepFakes by monitoring the heartbeat rhythms. DeepRhythm utilizes dual-spatial-temporal attention to adapt to dynamically changing face and fake types. Extensive experiments on FaceForensics++ and DFDC-preview datasets have confirmed our conjecture and demonstrated not only the effectiveness, but also the generalization capability of \emph{DeepRhythm} over different datasets by various DeepFakes generation techniques and multifarious challenging degradations.

研究の動機と目的

Pixel-domainアーティファクトを超えた堅牢なDeepFake検出を、表情動画のリアルタイム心拍リズムを活用して動機づける。
心拍信号を強調するモーション拡張空間-時間表現（MMSTR）を導入する。
顔のダイナミクスや偽造タイプの多様性に適応するデュアル空間-時間注意ネットワークを設計する。
FaceForensics++とDFDC-previewデータセットで手法の有効性と頑健性を示す。
DeepRhythmがDeepFake生成技術と劣化に対して一般化することを示す。

提案手法

心拍信号を強調するMMSTマップを生成するモーション拡張空間-時間表現（MMSTR）を計算する。
空間的（事前・適応）と時間的（ブロックレベル・フレームレベル）要素に分解してアテンションを因数分解するデュアル空間-時間アテンション機構をモデル化する。
MMSTマップをCNN（ResNet18）に入力して実画像/偽画像分類を行うエンドツーエンド網に、フレームレベルのアテンションにMeso-4、ブロックレベルの時系列アテンションにLSTMなどの補助部品を組み込む。
アテンションをt（時間）とs（空間）に分解してy = phi((t · s^T) ⊙ X)と表現し、s = s_p + s_a、t = t_b + t_fとする。
JPEG・ブラー・ノイズ・時間サンプリングといった劣化に対する一般化と頑健性を検証するため、FaceForensics++のサブセットで学習し、DFDC-previewでクロスデータセット評価を行う。
アブレーションは、STとMMSTR入力、単一 vs デュアルアテンション、エンドツーエンド vs 段階的訓練を比較し、MMSTRとデュアルアテンション設計からの利得を示す。

実験結果

リサーチクエスチョン

RQ1動画から捉えた心拍リズムの手がかりを用いて、複数のDeepFake手法を跨いで実画像と偽画像を識別できるか？
RQ2モーション拡張表現（MMSTR）は従来の空間-時間表現より心拍差をよりよく明らかにするか？
RQ3デュアル空間-時間アテンションは、単一アテンションや非アテンションのベースラインと比較して、顔のダイナミクス・遮蔽・劣化に対する頑健性を改善するか？
RQ4DeepRhythmはデータセット（FaceForensics++とDFDC-preview）および欺瞞技術全般にどれくらい一般化するか？
RQ5提案フレームワークにおけるエンドツーエンド訓練とモジュラー訓練の寄与はどの程度か？

主な発見

学習データ	評価データ	DFD	DF	F2F	FS	ALL	DFDC
train on sub-datasets	Bayer & Stamm (baseline)	0.52	0.503	0.505	0.505	0.501	-	-
train on sub-datasets	Inception ResNet V1	0.794	0.783	0.788	0.778	0.919	-	-
train on sub-datasets	Xception	0.98	0.995	0.985	0.98	0.965	-	-
train on sub-datasets	MesoNet	0.804	0.979	0.985	0.995	0.958	-	-
train on sub-datasets	DeepRhythm (ours)	0.987	1.0	0.995	1.0	0.975	-	-
train on ALL dataset	Bayer & Stamm (baseline)	0.5	0.5	0.5	0.5	0.5	0.5	-
train on ALL dataset	Inception ResNet V1	0.638	0.566	0.462	0.774	0.597	0.5	-
train on ALL dataset	Xception	0.984	0.984	0.97	0.978	0.612	0.5	-
train on ALL dataset	MesoNet	0.822	0.813	0.783	0.909	0.745	-	-
train on ALL dataset	DeepRhythm (ours)	0.975	0.997	0.989	0.978	0.98	0.641	-

DeepRhythmは、サブデータセットで学習した場合、FaceForensics++のサブセットでBayer、Inception-ResNet V1、Xception、MesoNetなどの最先端ベースラインより高い精度を達成する。
ALLデータで学習した場合、DFDC-previewで競争力のある精度を達成し、Xceptionを上回り、クロスデータセット設定でいくつかのベースラインを上回る。
MMSTR（モーション拡張STR）は標準的なSTRより識別力を大幅に向上させ、STベースラインに対して顕著な効果を示す。
デュアル空間アテンション（事前＋適応）とデュアル時系列アテンション（ブロックレベル＋フレームレベル）は大きな性能向上をもたらし、エンドツーエンド訓練が最良の結果を生む（DR-mmst-APBF-e2e）。
アブレーションにより、MMSTR単独でもSTより平均精度を約0.217向上させ、適応・事前の空間アテンションが各約0.061–0.0632の寄与、デュアル時系列アテンションが追加の大きな利益をもたらし、最も強力なエンドツーエンドモデルへと結実する。
DeepRhythmはJPEG・ブラー・ノイズ・時間サンプリングといった劣化に対して頑健で、これらの条件下でもベースラインより高い性能を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。