QUICK REVIEW

[论文解读] Depth Completion as Parameter-Efficient Test-Time Adaptation

Bingxin Ke, Qunjie Zhou|arXiv (Cornell University)|Feb 16, 2026

Advanced Vision and Imaging被引用 0

一句话总结

CAPA 通过参数高效的测试时自适应将预训练的3D基础模型用于深度完成，仅更新轻量级 PEFT 组件（LoRA 或 VPT）并冻结骨干网络，取得室内外数据集的最先进效果。它还通过序列级参数共享扩展到视频以实现时间一致性。

ABSTRACT

We introduce CAPA, a parameter-efficient test-time optimization framework that adapts pre-trained 3D foundation models (FMs) for depth completion, using sparse geometric cues. Unlike prior methods that train task-specific encoders for auxiliary inputs, which often overfit and generalize poorly, CAPA freezes the FM backbone. Instead, it updates only a minimal set of parameters using Parameter-Efficient Fine-Tuning (e.g. LoRA or VPT), guided by gradients calculated directly from the sparse observations available at inference time. This approach effectively grounds the foundation model's geometric prior in the scene-specific measurements, correcting distortions and misplaced structures. For videos, CAPA introduces sequence-level parameter sharing, jointly adapting all frames to exploit temporal correlations, improve robustness, and enforce multi-frame consistency. CAPA is model-agnostic, compatible with any ViT-based FM, and achieves state-of-the-art results across diverse condition patterns on both indoor and outdoor datasets. Project page: research.nvidia.com/labs/dvl/projects/capa.

研究动机与目标

用稀疏的测试时深度线索对被冻结的3D基础模型的几何先验进行 grounding。
Develop a parameter-efficient adaptation framework that preserves the base model while updating few parameters.
Extend CAPA to video by sharing parameters across frames to improve temporal consistency.
Evaluate CAPA across indoor and outdoor datasets and with multiple base models and PEFT strategies.

提出的方法

冻结 ViT 基础的 3D foundation 模型骨干，只更新一个紧凑的 PEFT 组件。
Apply either LoRA (对 W_q、W_k、W_v 的低秩更新) 或 Visual Prompt Tuning（在注意力层前置可学习的提示 token）
Compute a per-sample affine alignment (scale and shift) with the sparse depth to resolve scale ambiguity, then backpropagate the L1 loss on valid pixels.
For videos, share the same trainable parameters across frames and optimize with mini-batches to enforce temporal consistency.
Trainable parameter count is 0.39M for both CAPA variants, with 100 optimization steps per sample.
CAPA is demonstrated as compatible with VGGT and extends to UniDepthV2 and MoGe-2 base models.

实验结果

研究问题

RQ1参数高效微调冻结的3D基础模型是否能在测试时使用稀疏线索改进深度完成？
RQ2在视频帧之间的序列级（共享）自适应是否能提升时间一致性和在稀疏观测下的鲁棒性？
RQ3就准确性与效率而言，LoRA 与 VPT 在 CAPA 中的比较如何？
RQ4CAPA 在室内外数据集及不同基模型上的泛化能力如何？

主要发现

Method	ScanNet AbsRel (%)	7-Scenes AbsRel (%)	iBims AbsRel (%)	Metropolis AbsRel (%)	Avg Rank
CAPA LoRA	1.0	0.9	1.1	2.8	1.0
CAPA VPT	1.1	1.0	1.0	2.6	1.1

CAPA 配合 LoRA 或 VPT 在四个数据集（ScanNet、7-Scenes、iBims、Metropolis）上持续优于基线。
CAPA 相比竞争方法在很多设置下将基础模型的 AbsRel 误差降低大约 2×。
序列级自适应相较逐帧调优在时间一致性方面有所提升（更低的 OPW），
CAPA 在仅更新 0.39M 参数的情况下达到最先进的结果，相较全量微调具有高效性。
将 CAPA 与 CAPA 集成时，VGGT 深度误差提升 2–3×。
时间与条件鲁棒性增强，CAPA 在有条件与无条件区域之间的误差差距更小。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。