[论文解读] TransFusion: Cross-view Fusion with Transformer for 3D Human Pose Estimation
Introduces TransFusion, a lightweight transformer-based framework that fuses cues from multiple views for 2D pose refinement and 3D pose estimation, using an epipolar field and geometry positional encoding to encode cross-view correspondences.
Estimating the 2D human poses in each view is typically the first step in calibrated multi-view 3D pose estimation. But the performance of 2D pose detectors suffers from challenging situations such as occlusions and oblique viewing angles. To address these challenges, previous works derive point-to-point correspondences between different views from epipolar geometry and utilize the correspondences to merge prediction heatmaps or feature representations. Instead of post-prediction merge/calibration, here we introduce a transformer framework for multi-view 3D pose estimation, aiming at directly improving individual 2D predictors by integrating information from different views. Inspired by previous multi-modal transformers, we design a unified transformer architecture, named TransFusion, to fuse cues from both current views and neighboring views. Moreover, we propose the concept of epipolar field to encode 3D positional information into the transformer model. The 3D position encoding guided by the epipolar field provides an efficient way of encoding correspondences between pixels of different views. Experiments on Human 3.6M and Ski-Pose show that our method is more efficient and has consistent improvements compared to other fusion methods. Specifically, we achieve 25.8 mm MPJPE on Human 3.6M with only 5M parameters on 256 x 256 resolution.
研究动机与目标
- Improve 3D multi-view human pose estimation by directly enhancing 2D detectors through cross-view fusion.
- Leverage a transformer to fuse information from current and neighboring views beyond epipolar-line constraints.
- Introduce an Epipolar Field and Geometry Positional Encoding to encode cross-view 3D correspondences.
- Demonstrate efficiency and accuracy gains over prior fusion methods on standard multi-view datasets.
提出的方法
- A two-view CNN backbone extracts low-level features from each view.
- A shared transformer encoder fuses features from both views with 2D sine positional encoding and Geometry Positional Encoding (GPE).
- Epi-polar Field guides cross-view attention by modeling cross-view pixel correspondences beyond the epipolar line, enforced via a learned pose loss L_pos.
- GPE encodes 3D directional information using unit rays from camera centers, enabling 3D-aware attention.
- A prediction head outputs 2D joint heatmaps per view; training optimizes heatmap MSE with an additional cross-view correspondence loss.
实验结果
研究问题
- RQ1Can a unified transformer-based architecture improve 2D pose detection by leveraging information from multiple views?
- RQ2Does incorporating geometry-aware positional encoding and epipolar-field guidance enhance cross-view attention for 3D pose estimation?
- RQ3How does TransFusion compare to prior fusion methods in accuracy and parameter efficiency on standard datasets?
主要发现
- TransFusion consistently matches or surpasses prior fusion methods in 2D and 3D pose metrics on Human3.6M and Ski-Pose.
- It achieves 25.8 mm MPJPE on Human3.6M with only 5M parameters, significantly fewer than competing fusion approaches.
- The method benefits from fusing information across entire reference views rather than restricting to epipolar lines, improving performance on occlusion-heavy sequences.
- Ablation studies show 3D geometry positional encoding and epipolar-field guidance are critical for cross-view correspondence and 3D accuracy.
- On Ski-Pose, TransFusion outperforms single-view baselines and remains lightweight compared to large fusion models.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。