QUICK REVIEW

[논문 리뷰] GPAvatar: Generalizable and Precise Head Avatar from Image(s)

Xuangeng Chu, Li Yu|arXiv (Cornell University)|2024. 01. 18.

Advanced Vision and Imaging인용 수 5

한 줄 요약

GPAvatar는 하나 이상의 이미지로부터 애니메이션 가능한 3D 머리 아바타를 단일 순전파로 재구성하며, 동적 포인트 기반 표현 필드와 Multi Tri-planes Attention 융합 모듈을 사용해 테스트 시 최적화 없이도 정밀한 표현 제어와 다중 뷰 일관성을 달성합니다.

ABSTRACT

Head avatar reconstruction, crucial for applications in virtual reality, online meetings, gaming, and film industries, has garnered substantial attention within the computer vision community. The fundamental objective of this field is to faithfully recreate the head avatar and precisely control expressions and postures. Existing methods, categorized into 2D-based warping, mesh-based, and neural rendering approaches, present challenges in maintaining multi-view consistency, incorporating non-facial information, and generalizing to new identities. In this paper, we propose a framework named GPAvatar that reconstructs 3D head avatars from one or several images in a single forward pass. The key idea of this work is to introduce a dynamic point-based expression field driven by a point cloud to precisely and effectively capture expressions. Furthermore, we use a Multi Tri-planes Attention (MTA) fusion module in the tri-planes canonical field to leverage information from multiple input images. The proposed method achieves faithful identity reconstruction, precise expression control, and multi-view consistency, demonstrating promising results for free-viewpoint rendering and novel view synthesis.

연구 동기 및 목표

처음 보게 되는 아이덴티티에도 일반화되는 정확하고 표현력이 풍부한 3D 머리 아바타 재구성을 촉진한다.
정교한 표현 제어를 허용하면서 아이덴티티 보존의 충실성을 가능하게 한다.
테스트 시 최적화 없이도 다중 뷰 일관성과 가려짐에 대한 강건성을 달성한다.
아이덴티티와 표현을 분리하기 위해 포인트 기반 표현 필드와 다중 이미지 융합을 활용한다.
실시간 또는 인터랙티브한 응용에 적합한 빠른 추론 파이프라인을 제공한다.

제안 방법

3D 머리 아바타 표상을 위한 백본으로 트라이-플레인 캐노니컬 피처 스페이스를 사용한다.
학습 가능한 가중치를 갖는 FLAME 정점 포인트들로 구성된 Point-based Expression Field (PEF)를 도입해 동적 표현을 포착한다.
가까운 포인트 기반 회귀를 이용한 상대 위치 인코딩으로 NeRF 샘플링 프로세스에 PEF를 바인딩한다.
다중 입력 이미지의 정보를 Multi Tri-planes Attention (MTA) 모듈로 융합해 가려지거나 부족한 영역을 개선한다.
두 단계의 계층적 샘플링으로 볼륨을 렌더링하고 고품질 출력을 위한 가벼운 초해상도 모듈을 적용한다.
저해상도 및 고해상도 재연출에서 L1 및 지각적 손실과 밀도 규범 손실을 포함한 엔드투엔드 학습으로 artefact를 줄인다.

Figure 1: Our GPAvatar is able to reconstruct 3D head avatars from even a single input ( i.e. , one-shot), with strong generalization and precise expression control. The leftmost images are the inputs, and the subsequent images depict reenactment results. Inset images display the corresponding drivi

실험 결과

연구 질문

RQ1GPAvatar가 테스트 시 최적화 없이도 하나 또는 소수의 이미지로부터 unseen identities에 일반화할 수 있는가?
RQ2포인트 기반 표현 필드가 3DMM- 또는 NeRF 기반 대안보다 더 미세하고 자연스러운 표현 제어를 가능하게 하는가?
RQ3다중 입력 이미지를 Multi Tri-planes Attention으로 통합하는 것이 가려짐 또는 극단적인 포즈 하에서 재연출 품질에 어떤 영향을 미치는가?
RQ4PEF와 MTA가 합성 및 표현 정확도에 대한 객관적 지표에 어떠한 영향을 미치는가(데이터셋 간)?
RQ5표준 하드웨어에서 실용적인 재연출 및 자유 시점 렌더링을 위한 속도는 충분한가?

주요 결과

GPAvatar는 단일 순전파에서 충실한 아이덴티티 재구성과 정밀한 표현 제어를 제공한다.
PEF는 자연스러운 교차 아이덴티티 표현 제어를 제공하고 기준선 대비 표현 정확도(AED, AKD)를 향상시킨다.
MTA는 다수의 입력에서 정보를 효과적으로 융합하여 상세를 개선하고 평균화로 인한 흐림 없이 가려짐을 처리한다.
VFHQ 및 HDTF에서 이 방법은 자기 자신 간 재연출 및 교차 아이덴티티 재연출 설정에서 최첨단 합성 품질과 표현 제어를 달성한다.
A100 GPU에서 추론은 약 15 FPS로 실행되며, 학습은 약 50 GPU 시간(150k 이터레이션)으로 완료된다.
Ablation 연구는 PEF와 MTA가 상당한 이점을 제공함을 보여주며, 글로벌 포인트 샘플링이 국부적이거나 순전히 지역적 방법보다 우수하다.

Figure 2: Differences from existing state-of-the-art methods. Existing methods may over-process expression information or use expression features, leading to expression detail loss. Our approach avoids this loss with a point-based expression field, and our method flexibly accepts single or multiple

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.