QUICK REVIEW

[논문 리뷰] Deepfake Video Detection Using Convolutional Vision Transformer

Deressa Wodajo, Atnafu, Solomon|arXiv (Cornell University)|2021. 02. 22.

Digital Media Forensic Detection참고 문헌 65인용 수 138

한 줄 요약

논문은 CNN 기반 특징 학습과 Vision Transformer를 결합한 Convolutional Vision Transformer(CViT)을 제안하여 Deepfake 탐지를 수행하고, DFDC 데이터셋에서 91.5% 정확도와 AUC 0.91을 달성합니다. 데이터 전처리 및 다양한 DFDC 유도 데이터셋에서의 훈련을 강조합니다.

ABSTRACT

The rapid advancement of deep learning models that can generate and synthesis hyper-realistic videos known as Deepfakes and their ease of access to the general public have raised concern from all concerned bodies to their possible malicious intent use. Deep learning techniques can now generate faces, swap faces between two subjects in a video, alter facial expressions, change gender, and alter facial features, to list a few. These powerful video manipulation methods have potential use in many fields. However, they also pose a looming threat to everyone if used for harmful purposes such as identity theft, phishing, and scam. In this work, we propose a Convolutional Vision Transformer for the detection of Deepfakes. The Convolutional Vision Transformer has two components: Convolutional Neural Network (CNN) and Vision Transformer (ViT). The CNN extracts learnable features while the ViT takes in the learned features as input and categorizes them using an attention mechanism. We trained our model on the DeepFake Detection Challenge Dataset (DFDC) and have achieved 91.5 percent accuracy, an AUC value of 0.91, and a loss value of 0.32. Our contribution is that we have added a CNN module to the ViT architecture and have achieved a competitive result on the DFDC dataset.

연구 동기 및 목표

접근 가능한 생성 도구와 다양한 설정 속에서 강건한 Deepfake 탐지를 추진한다.
CNN과 Transformer를 통해 로컬 및 글로벌 특징을 공동으로 학습하는 일반화된 탐지기를 개발한다.
일관된 데이터 전처리와 다양한 학습 데이터로 일반화 성능을 향상시키는 것을 강조한다.
다양한 Deepfake 데이터셋에서 CViT를 평가하고 기존 모델과 비교한다.

제안 방법

Two-component CViT: CNN 기반 특징 학습(17 conv layers, 512x7x7 출력) 후 Vision Transformer(ViT) 분류기가 적용된다.
입력 준비를 위해 얼굴 추출을 224x224 RGB로 하고 데이터 증강을 적용한다.
ViT 구성요소는 패치(일곱 개)를 1x1024 시퀀스로 임베딩하고 위치 임베딩을 사용하며; 인코더에 8개의 어텐션 헤드를 가진다.
학습은 Adam 옵티마이저(lr=0.001, weight decay=1e-7)와 이진 교차 엔트로피 손실을 사용하여 50 에포크; 배치 크기 32.
데이터셋 구성: 162,174 훈련/24,898 검증/24,898 테스트 이미지(70/15/15 스플릿에 증강으로 308,130 총계).
평가에는 정확도, AUC, 로그손실이 포함되며; face_recognition으로 얼굴을 필터링한 것이 얼굴 추출 신뢰성을 향상시켰다.

실험 결과

연구 질문

RQ1CViT가 다양한 실제 환경과 데이터셋에서 Deepfake를 효과적으로 탐지할 수 있는가?
RQ2CNN 기반 로컬 특징 학습과 Transformer 기반 글로벌 어텐션의 결합이 Baseline 대비 탐지 성능을 개선하는가?
RQ3데이터 전처리가 Deepfake 탐지 성능에 어떤 영향을 미치며 얼굴 검출 신뢰도가 어떤 역할을 하는가?
RQ4DFDC를 넘어 여러 Deepfake 데이터셋에서 CViT의 일반화 성능은 어떤가?

주요 결과

CViT는 400개의 보지 않은 DFDC 비디오에서 91.5% 정확도와 0.91의 AUC, 손실 0.32를 달성한다.
FaceForensics++ 변형에서 CViT의 성능은 다르게 나타난다: 69% (FaceSwap), 91% (DeepFakeDetection), 93% (Deepfake), 46% (FaceShifter), 60% (NeuralTextures).
CNN+RNN-GRU 베이스라인과 비교할 때, CViT는 DFDC에서 경쟁력 있게 작동한다(표 2의 CNN+RNN-GRU가 91.88%인 반면 CViT는 91.5%).
여러 얼굴 탐지기(BlazeFace, MTCCN, face_recognition)를 사용하고 최적 필터(face_recognition)를 선택하면 정확도가 69.5%(필터링 없음)에서 DFDC에서 91.5%로 향상된다.
저자들은 개선 여지가 있음을 인정하고 다양성 및 강건성을 높이기 위해 더 많은 데이터셋을 추가하자는 제안을 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.