Skip to main content
QUICK REVIEW

[论文解读] Depth Completion as Parameter-Efficient Test-Time Adaptation

Bingxin Ke, Qunjie Zhou|arXiv (Cornell University)|Feb 16, 2026
Advanced Vision and Imaging被引用 0
一句话总结

CAPA 通过参数高效的测试时自适应将预训练的3D基础模型用于深度完成,仅更新轻量级 PEFT 组件(LoRA 或 VPT)并冻结骨干网络,取得室内外数据集的最先进效果。它还通过序列级参数共享扩展到视频以实现时间一致性。

ABSTRACT

We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.

研究动机与目标

  • 用稀疏的测试时深度线索对被冻结的3D基础模型的几何先验进行 grounding。
  • Develop a parameter-efficient adaptation framework that preserves the base model while updating few parameters.
  • Extend CAPA to video by sharing parameters across frames to improve temporal consistency.
  • Evaluate CAPA across indoor and outdoor datasets and with multiple base models and PEFT strategies.

提出的方法

  • 冻结 ViT 基础的 3D foundation 模型骨干,只更新一个紧凑的 PEFT 组件。
  • Apply either LoRA (对 W_q、W_k、W_v 的低秩更新) 或 Visual Prompt Tuning(在注意力层前置可学习的提示 token)
  • Compute a per-sample affine alignment (scale and shift) with the sparse depth to resolve scale ambiguity, then backpropagate the L1 loss on valid pixels.
  • For videos, share the same trainable parameters across frames and optimize with mini-batches to enforce temporal consistency.
  • Trainable parameter count is 0.39M for both CAPA variants, with 100 optimization steps per sample.
  • CAPA is demonstrated as compatible with VGGT and extends to UniDepthV2 and MoGe-2 base models.

实验结果

研究问题

  • RQ1参数高效微调冻结的3D基础模型是否能在测试时使用稀疏线索改进深度完成?
  • RQ2在视频帧之间的序列级(共享)自适应是否能提升时间一致性和在稀疏观测下的鲁棒性?
  • RQ3就准确性与效率而言,LoRA 与 VPT 在 CAPA 中的比较如何?
  • RQ4CAPA 在室内外数据集及不同基模型上的泛化能力如何?

主要发现

MethodScanNet AbsRel (%)7-Scenes AbsRel (%)iBims AbsRel (%)Metropolis AbsRel (%)Avg Rank
CAPA LoRA1.00.91.12.81.0
CAPA VPT1.11.01.02.61.1
  • CAPA 配合 LoRA 或 VPT 在四个数据集(ScanNet、7-Scenes、iBims、Metropolis)上持续优于基线。
  • CAPA 相比竞争方法在很多设置下将基础模型的 AbsRel 误差降低大约 2×。
  • 序列级自适应相较逐帧调优在时间一致性方面有所提升(更低的 OPW),
  • CAPA 在仅更新 0.39M 参数的情况下达到最先进的结果,相较全量微调具有高效性。
  • 将 CAPA 与 CAPA 集成时,VGGT 深度误差提升 2–3×。
  • 时间与条件鲁棒性增强,CAPA 在有条件与无条件区域之间的误差差距更小。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。