QUICK REVIEW

[논문 리뷰] Interactive Face Video Coding: A Generative Compression Framework

Bolin Chen, Zhao Wang|arXiv (Cornell University)|2023. 02. 20.

Face recognition and analysis인용 수 10

한 줄 요약

이 논문은 Interactive Face Video Coding (IFVC)을 도입하여 얼굴을 초밀도이면서 의미론적으로 의미 있는 3D 얼굴 표현(IDI)로 인코딩하고 GAN 기반 디코더를 사용해 초저비트레이트에서 상호작용적이고 프라이버시 보존 얼굴 비디오를 렌더링합니다. 이는 VVC 및 이전 생성 방법과 비교하여 우수한 rate-distortion 성능을 달성하는 한편, 직접적인 의미 수준의 상호작용을 가능하게 합니다.

ABSTRACT

In this paper, we propose a novel framework for Interactive Face Video Coding (IFVC), which allows humans to interact with the intrinsic visual representations instead of the signals. The proposed solution enjoys several distinct advantages, including ultra-compact representation, low delay interaction, and vivid expression/headpose animation. In particular, we propose the Internal Dimension Increase (IDI) based representation, greatly enhancing the fidelity and flexibility in rendering the appearance while maintaining reasonable representation cost. By leveraging strong statistical regularities, the visual signals can be effectively projected into controllable semantics in the three dimensional space (e.g., mouth motion, eye blinking, head rotation, head translation and head location), which are compressed and transmitted. The editable bitstream, which naturally supports the interactivity at the semantic level, can synthesize the face frames via the strong inference ability of the deep generative model. Experimental results have demonstrated the performance superiority and application prospects of our proposed IFVC scheme. In particular, the proposed scheme not only outperforms the state-of-the-art video coding standard Versatile Video Coding (VVC) and the latest generative compression schemes in terms of rate-distortion performance for face videos, but also enables the interactive coding without introducing additional manipulation processes. Furthermore, the proposed framework is expected to shed lights on the future design of the digital human communication in the metaverse.

연구 동기 및 목표

인터랙티브한 얼굴 비디오 코딩을 통해 초저 지연 및 의미적으로 제어 가능한 재구성을 모티브로 삼는다.
얼굴 의미를 위한 초소형 편집 가능한 표현 공간을 개발한다.
추가 조작 단계 없이 시맨틱 수준에서의 인터랙티브를 가능하게 한다.
compact한 표현에서 고품질 프레임을 합성하기 위해 딥 생성 모델을 활용한다.

제안 방법

2D 얼굴 프레임을mouth, eye, head pose, translation 등과 같은 14 차원의 얼굴 시맨틱 공간으로 투사한다.
표준 이미지 코덱(VVC intra coding)을 사용해 키-리퍼런스 프레임을 인코딩한다.
시맨틱 잔차와 컨텍스트 기반 엔트로피 코딩(PPM)을 통해 인터 프레임을 인코딩하여 압축 비트스트림을 형성한다.
3D 얼굴 메시를 시맨틱으로 재구성하기 위해 WM3DR 기반 모델로 디코딩하고 SPADE 가이드 CSSFT-GAN으로 프레임을 렌더링한다.
메시 기반 모션 추정으로 조밀한 모션 필드와 프레임 생성을 위한 얼굴 주의 맵을 생성한다.
디코더 측에서 시맨틱 매개변수를 편집하여 인터랙티브한 조작을 가능하게 한다.

실험 결과

연구 질문

RQ1IFVC가 인터랙티브 컨트롤이 가능한 초저 bitrate 얼굴 비디오 코딩을 달성할 수 있는가?
RQ2IDI 기반 3D 시맨틱 표현이 현실적인 재구성과 조작을 위한 충분한 충실도와 유연성을 제공하는가?
RQ3IFVC가 얼굴 비디오에 대해 비트율-왜곡 성능에서 VVC 및 기존 생성 압축 방식과 어떻게 비교되는가?
RQ4디코더 전용 시맨틱 비트스트림 조작이 품질을 유지하면서 프라이버시를 보존할 수 있는가?
RQ5GAN 기반 디코더가 시맨틱 표현에서 고품질 프레임을 렌더링하는 데 얼마나 효과적인가?

주요 결과

IFVC는 14차원 시맨틱 매개변수 공간의 압축을 활용하여 초저 비트레이트에서 고품질 얼굴 비디오 재구성을 제공한다.
이 프레임워크는 얼굴 비디오에 대한 rate-distortion 성능에서 VVC 및 최근 생성 압축 방법을 능가한다.
IDI 표현은 입 모션, 눈 깜박임, 머리 회전 및 머리 이동의 제어 가능한 조작을 가능하게 한다.
수정 가능한 비트스트림은 추가 조작 단계 없이 얼굴 시맨틱 편집을 지원한다.
디코더는 밀도 모션 필드와 주의 맵을 갖춘 GAN 기반 합성을 사용하여 생생한 재구성을 달성한다.
이 접근법은 텍스처 템플릿이나 가상 참조로부터 렌더링을 가능하게 하여 프라이버시 보존 활용을 지원한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.