[논문 리뷰] Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation
확산 기반 프레임워크(D3DP)가 2D 키포인트 입력으로부터 여러 3D 포즈 가설을 생성하고, 새로운 관절 단위 재투영 기반 방법(JPMA)으로 이를 집계하여 정확한 단일 3D 포즈를 산출합니다; 공개 벤치마크에서 상태-of-the-art의 결정적(det) 및 확률적 방법을 상회합니다.
In this paper, a novel Diffusion-based 3D Pose estimation (D3DP) method with Joint-wise reProjection-based Multi-hypothesis Aggregation (JPMA) is proposed for probabilistic 3D human pose estimation. On the one hand, D3DP generates multiple possible 3D pose hypotheses for a single 2D observation. It gradually diffuses the ground truth 3D poses to a random distribution, and learns a denoiser conditioned on 2D keypoints to recover the uncontaminated 3D poses. The proposed D3DP is compatible with existing 3D pose estimators and supports users to balance efficiency and accuracy during inference through two customizable parameters. On the other hand, JPMA is proposed to assemble multiple hypotheses generated by D3DP into a single 3D pose for practical use. It reprojects 3D pose hypotheses to the 2D camera plane, selects the best hypothesis joint-by-joint based on the reprojection errors, and combines the selected joints into the final pose. The proposed JPMA conducts aggregation at the joint level and makes use of the 2D prior information, both of which have been overlooked by previous approaches. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the state-of-the-art deterministic and probabilistic approaches by 1.5% and 8.9%, respectively. Code is available at https://github.com/paTRICK-swk/D3DP.
연구 동기 및 목표
- 모노큘러 설정에서 깊이 모호성을 해결하기 위한 확률적 3D 인간 포즈 추정의 동기를 제공합니다.
- 2D 키포인트를 조건으로 다수의 포즈 가설을 생성하는 확산 기반 3D Pose Estimation (D3DP) 프레임워크를 제안합니다.
- Joint-wise Reprojection-based Multi-Hypothesis Aggregation (JPMA)를 도입하여 관절 수준 가설을 하나의 고품질 3D 포즈로 결합합니다.
- 추가적인 inferrence에서 효율성과 정확도의 균형을 맞추는 메커니즘을 제공하고, D3DP가 기존 3D 포즈 백본과의 호환성을 보장합니다.
제안 방법
- Diffusion-based 3D Pose Estimation (D3DP): train a denoiser conditioned on 2D keypoints to recover clean 3D poses from diffused ground truth poses; use multiple inference steps to generate H pose hypotheses with customizable iterations K.
- Training follows DDPM-style loss: L = || y0 - D(y_t, x, t) ||_2 with y_t a noised ground-truth pose and t uniform in [0, T].
- Inference samples H initial poses from Gaussian noise and refines via a denoiser conditioned on 2D keypoints; DDIM-based re-sampling allows iterative refinement with K steps.
- Joint-wise Reprojection-based Multi-Hypothesis Aggregation (JPMA): reproject 3D pose hypotheses to the 2D camera plane using known/estimated intrinsics, compute joint-wise reprojection errors to select the best hypothesis per joint, and assemble the final 3D pose.
- JPMA leverages 2D priors and performs aggregation at the joint level, yielding higher upper-bound performance than pose-level aggregation.
- Architecture: uses MixSTE as the backbone for the denoiser and fuses 2D keypoints with noisy 3D poses via simple concatenation; employs a sinusoidal timestep embedding.
실험 결과
연구 질문
- RQ1Can diffusion models effectively generate multiple plausible 3D pose hypotheses from 2D keypoints for monocular 3D pose estimation?
- RQ2Does joint-level aggregation via reprojection errors improve final 3D pose accuracy over traditional pose-level aggregation or averaging?
- RQ3How do the number of hypotheses (H) and iterations (K) affect accuracy and efficiency in practice?
- RQ4Is the proposed D3DP+JPMA framework compatible with existing deterministic 3D pose estimators as backbones and conditioning schemes?
- RQ5What gains can be achieved on standard benchmarks (Human3.6M, MPI-INF-3DHP, 3DPW) with joint-wise aggregation?
주요 결과
- D3DP achieves state-of-the-art results on Human3.6M in MPJPE for single-hypothesis settings and outperforms several probabilistic baselines when using joint-level aggregation.
- JPMA yields higher upper-bound performance than pose-level aggregation, enabling joint-wise selection of the best hypothesis per joint guided by 2D reprojection errors.
- Increasing hypotheses (H) and iterations (K) improves results under joint-level aggregation, with notable gains when using joint-wise best selections (J-best) over pose-wise best (P-best).
- Compared to deterministic and probabilistic baselines on MPI-INF-3DHP, D3DP reduces MPJPE by notable margins and achieves competitive PCK and AUC scores.
- The method demonstrates compatibility with existing backbones (e.g., MixSTE) and balances accuracy vs. efficiency through controllable H and K parameters.
- Code is released at the authors’ GitHub: https://github.com/paTRICK-swk/D3DP.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.