Skip to main content
QUICK REVIEW

[论文解读] S-VGGT: Structure-Aware Subscene Decomposition for Scalable 3D Foundation Models

Xinze Li, Pengxu Chen|arXiv (Cornell University)|Mar 18, 2026
3D Shape Modeling and Analysis被引用 0
一句话总结

S-VGGT 通过将输入帧划分为少量具有共享锚帧的子场景,从而实现全局注意力的二次成本的并行化、帧级高效处理,补充但不牺牲重建质量的令牌级加速。

ABSTRACT

Feed-forward 3D foundation models face a key challenge: the quadratic computational cost introduced by global attention, which severely limits scalability as input length increases. Concurrent acceleration methods, such as token merging, operate at the token level. While they offer local savings, the required nearest-neighbor searches introduce undesirable overhead. Consequently, these techniques fail to tackle the fundamental issue of structural redundancy dominant in dense capture data. In this work, we introduce extbf{S-VGGT}, a novel approach that addresses redundancy at the structural frame level, drastically shifting the optimization focus. We first leverage the initial features to build a dense scene graph, which characterizes structural scene redundancy and guides the subsequent scene partitioning. Using this graph, we softly assign frames to a small number of subscenes, guaranteeing balanced groups and smooth geometric transitions. The core innovation lies in designing the subscenes to share a common reference frame, establishing a parallel geometric bridge that enables independent and highly efficient processing without explicit geometric alignment. This structural reorganization provides strong intrinsic acceleration by cutting the global attention cost at its source. Crucially, S-VGGT is entirely orthogonal to token-level acceleration methods, allowing the two to be seamlessly combined for compounded speedups without compromising reconstruction fidelity. Code is available at https://github.com/Powertony102/S-VGGT.

研究动机与目标

  • 动机与解决在密集捕获数据上,前馈3D基础模型的全局注意力二次复杂度的可扩展性瓶颈。
  • 通过从初始特征构建场景图来开发一个帧级冗余度降低策略。
  • 将帧划分为少量具有共享参考帧的连贯子场景,以实现并行、独立处理。
  • 证明与令牌级加速方法的正交性,并在与令牌合并结合时展示叠加的加速效果。
  • 在标准3D重建数据集上进行评估,以验证在长序列上的加速与重建保真度。

提出的方法

  • 从初始帧特征构建密集场景图以量化结构冗余并指导子场景形成。
  • 通过对每帧补丁令牌取平均值并计算余弦相似度,获得一个帧级密度感知亲和矩阵以得到帧相似性矩阵。
  • 使用可微分的软分配A并结合一致性、平衡与清晰度正则化项(L_coh、L_bal、L_sharp)将帧分组为K个子场景。
  • 通过为所有子场景分配一个公共参考帧(锚帧共享),确保并行处理,使子场景共享统一坐标系。
  • 对每个子场景在共享锚帧下独立处理,从而实现减少注意力计算和并行推理,而不需要后处理对齐。
  • 给出复杂度分析:全局注意力成本从O((NT)^2)降低到O((NT)^2 / K),相似度计算开销为O(N^2),并强调与令牌级加速方法的正交性。
Figure 1: Comparison of VGGT (2.69 FPS) and S-VGGT (10.13 FPS) on a 500-frame scene. S-VGGT achieves a significant speedup by processing subscenes in parallel while maintaining reconstruction quality.
Figure 1: Comparison of VGGT (2.69 FPS) and S-VGGT (10.13 FPS) on a 500-frame scene. S-VGGT achieves a significant speedup by processing subscenes in parallel while maintaining reconstruction quality.

实验结果

研究问题

  • RQ1通过子场景分解的帧级冗余度降低是否能在保持重建保真度的前提下优于全注意力基线?
  • RQ2通过将长序列密集数据划分为具有共同锚帧的子场景,可以获得多少加速和内存节省?
  • RQ3锚帧共享是否能有效在子场景之间维持统一的全局坐标系并避免代价高昂的对齐后处理?
  • RQ4S-VGGT 如何与令牌级加速方法互动,结合令牌合并时是否能提供叠加的加速?
  • RQ5在不同3D重建基准(ScanNet、Neural RGB-D、7Scenes)和长序列输入上,增益是否具有一致性?

主要发现

  • S-VGGT 通过并行处理子场景在保持重建保真度的同时显著加速了推理,降低了全局注意力成本。
  • 锚帧共享实现了子场景在统一坐标系下的对齐,避免了昂贵的几何优化。
  • 基于帧级密度引导的软分组使子场景数量自适应输入冗余,能够高效处理密集与多样化序列。
  • 实验表明在长序列上摄像机位姿估计与更密集的重建方面有显著提升,S-VGGT 的表现优于或等同于 VGGT* 基线。
  • 该方法与令牌级加速方法正交,结合令牌合并技术(如 FastVGGT)可获得叠加的加速。
  • 在长序列(如1000帧扫描)上,S-VGGT 相较强基线实现更快的推理速度(多倍提升)并保持鲁棒几何精度(ATE/ARE/RPE 指标)。
Figure 2: The framework of S-VGGT. The input frames are first embedded into tokens, and frame similarity is calculated to assess redundancy. Frames are then grouped into subscenes via soft assignment, ensuring parallel processing. A shared reference frame across subscenes enables efficient global an
Figure 2: The framework of S-VGGT. The input frames are first embedded into tokens, and frame similarity is calculated to assess redundancy. Frames are then grouped into subscenes via soft assignment, ensuring parallel processing. A shared reference frame across subscenes enables efficient global an

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。