[论文解读] Depth Completion as Parameter-Efficient Test-Time Adaptation
CAPA 通过参数高效的测试时自适应将预训练的3D基础模型用于深度完成,仅更新轻量级 PEFT 组件(LoRA 或 VPT)并冻结骨干网络,取得室内外数据集的最先进效果。它还通过序列级参数共享扩展到视频以实现时间一致性。
We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.
研究动机与目标
- 用稀疏的测试时深度线索对被冻结的3D基础模型的几何先验进行 grounding。
- Develop a parameter-efficient adaptation framework that preserves the base model while updating few parameters.
- Extend CAPA to video by sharing parameters across frames to improve temporal consistency.
- Evaluate CAPA across indoor and outdoor datasets and with multiple base models and PEFT strategies.
提出的方法
- 冻结 ViT 基础的 3D foundation 模型骨干,只更新一个紧凑的 PEFT 组件。
- Apply either LoRA (对 W_q、W_k、W_v 的低秩更新) 或 Visual Prompt Tuning(在注意力层前置可学习的提示 token)
- Compute a per-sample affine alignment (scale and shift) with the sparse depth to resolve scale ambiguity, then backpropagate the L1 loss on valid pixels.
- For videos, share the same trainable parameters across frames and optimize with mini-batches to enforce temporal consistency.
- Trainable parameter count is 0.39M for both CAPA variants, with 100 optimization steps per sample.
- CAPA is demonstrated as compatible with VGGT and extends to UniDepthV2 and MoGe-2 base models.
实验结果
研究问题
- RQ1参数高效微调冻结的3D基础模型是否能在测试时使用稀疏线索改进深度完成?
- RQ2在视频帧之间的序列级(共享)自适应是否能提升时间一致性和在稀疏观测下的鲁棒性?
- RQ3就准确性与效率而言,LoRA 与 VPT 在 CAPA 中的比较如何?
- RQ4CAPA 在室内外数据集及不同基模型上的泛化能力如何?
主要发现
| Method | ScanNet AbsRel (%) | 7-Scenes AbsRel (%) | iBims AbsRel (%) | Metropolis AbsRel (%) | Avg Rank |
|---|---|---|---|---|---|
| CAPA LoRA | 1.0 | 0.9 | 1.1 | 2.8 | 1.0 |
| CAPA VPT | 1.1 | 1.0 | 1.0 | 2.6 | 1.1 |
- CAPA 配合 LoRA 或 VPT 在四个数据集(ScanNet、7-Scenes、iBims、Metropolis)上持续优于基线。
- CAPA 相比竞争方法在很多设置下将基础模型的 AbsRel 误差降低大约 2×。
- 序列级自适应相较逐帧调优在时间一致性方面有所提升(更低的 OPW),
- CAPA 在仅更新 0.39M 参数的情况下达到最先进的结果,相较全量微调具有高效性。
- 将 CAPA 与 CAPA 集成时,VGGT 深度误差提升 2–3×。
- 时间与条件鲁棒性增强,CAPA 在有条件与无条件区域之间的误差差距更小。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。