QUICK REVIEW

[論文レビュー] Recurrent Convolutional Strategies for Face Manipulation Detection in Videos

Ekraam Sabir, Jiaxin Cheng|arXiv (Cornell University)|May 2, 2019

Digital Media Forensic Detection参考文献 47被引用数 337

ひとこと要約

本論文は、顔整列を用いた再帰-畳み込みフレームワークを提案し、動画中の改ざん顔を検出する。時系列情報を活用することで FaceForensics++ で最先端の精度を達成。

ABSTRACT

The spread of misinformation through synthetically generated yet realistic images and videos has become a significant problem, calling for robust manipulation detection methods. Despite the predominant effort of detecting face manipulation in still images, less attention has been paid to the identification of tampered faces in videos by taking advantage of the temporal information present in the stream. Recurrent convolutional models are a class of deep learning models which have proven effective at exploiting the temporal information from image streams across domains. We thereby distill the best strategy for combining variations in these models along with domain specific face preprocessing techniques through extensive experimentation to obtain state-of-the-art performance on publicly available video-based facial manipulation benchmarks. Specifically, we attempt to detect Deepfake, Face2Face and FaceSwap tampered faces in video streams. Evaluation is performed on the recently introduced FaceForensics++ dataset, improving the previous state-of-the-art by up to 4.55% in accuracy.

研究の動機と目的

時空間の手がかりに加えて時系列的一貫性を活用することで、動画中の改ざん顔の検出を動機づける。
顔の前処理（アライメント）が検出精度に与える影響を評価する。
動画操作ベンチマークで検出性能を最大化するためのアーキテクチャ選択（バックボーンCNNと再帰的設計）を探る。

提案手法

ランドマークベースのアライメントまたは Spatial Transformer Networks (STN) を用いて、動画フレームから顔領域を切り抜き整列する。
整列された切り抜きのシーケンスである face tubes に対して動作する再帰-畳み込み検出器を構築する。
DenseNet 系・ResNet 系などのバックボーンCNNを用い、GRU ベースの再帰を追記する実験を行う。
マイクロ、メソ、マクロ特徴を捉えるために、単一の再帰と多段階再帰を比較する。
FF++に対して実データ/偽データの二値監督でエンドツーエンド学習を行う； Adam最適化子を学習率1e-4で使用。

実験結果

リサーチクエスチョン

RQ1動画における時系列情報は、フレームレベルの手掛かりを超えて顔操作検出を改善できるか？
RQ2このタスクにおいて、明示的なランドマークベースのアライメントは、暗黙的なアライメント（STN）より優れているか？
RQ3どのバックボーン（DenseNet 対 ResNet）と時系列戦略（単一再帰対多段階再帰、双方向対単方向）が、操作タイプ全体で最良の性能を発揮するか？
RQ4多段階再帰は有益か、それとも FF++ データ量を考えると過学習のリスクがあるか？

主な発見

Table 1: モデル変種、フレーム数、操作タイプ別の精度（FF++ ベンチマーク）。	Table 2: アライメントと再帰変化が性能に与える影響。
Deepfake	1	93.46	94.8	94.5	96.1	96.4	-	-
Deepfake	5	-	94.6	94.7	96.0	96.7	94.9	96.9
Face2Face	1	89.8	90.25	90.65	89.31	87.18	-	-
Face2Face	5	-	90.25	89.8	92.4	93.21	93.05	94.35
FaceSwap	1	92.72	91.34	91.04	93.85	96.1	-	-
FaceSwap	5	-	90.95	93.11	95.07	95.8	95.4	96.3

ランドマークベースのアライメントと双方向GRU再帰を組み合わせたDenseNetが最良の性能を達成。
顔のアライメントは、アライメントなしのベースラインより検出精度を向上させる。
5フレームの入力など、フレーム列を用いる方が単一フレーム入力より優れる。
雙方向再帰は単方向再帰より優れる。
STNベースのアライメントと多再帰戦略は性能を改善せず、安定性を損なうか過学習につながる可能性がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。