QUICK REVIEW

[论文解读] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

Hongyang Du, Junjie Ye|arXiv (Cornell University)|Jan 30, 2026

3D Shape Modeling and Analysis被引用 0

一句话总结

VideoGPA 引入一个自监督框架，利用几何基础模型将密集的 3D 一致性信号蒸馏到视频扩散模型，通过 Direct Preference Optimization 提高时序稳定性和运动连贯性，且无需人工注释。

ABSTRACT

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

研究动机与目标

说明在视频生成中超越视觉保真度的三维结构一致性需求。
提出一种数据高效的自监督方法，引导视频扩散模型实现三维一致性。
利用几何基础模型推导密集偏好信号以进行无注释训练。
展示在基线方法之上改进的时序稳定性与运动连贯性。

提出的方法

引入 VideoGPA（Video Geometric Preference Alignment）作为一个自监督框架。
使用几何基础模型自动生成密集偏好信号。
应用 Direct Preference Optimization（DPO）以引导生成分布向三维一致性靠拢。
在数据高效的 regime 下运行，所需偏好对极少。
在不依赖人工注释的情况下提高几何可行性。

实验结果

研究问题

RQ1来自基础模型的几何先验能否引导视频扩散模型实现三维一致生成？
RQ2使用自动推导偏好的 Direct Preference Optimization 是否能提升视频的时序稳定性与运动连贯性？
RQ3在无注释的情况下，该方法在实现三维结构一致性方面的数据效率有多高？

主要发现

VideoGPA 显著提升时序稳定性、几何可行性和运动连贯性。
该方法在大量实验中以极少的偏好对即可超越最新基线方法。
在不需要人工注释的情况下，通过自监督信号实现了三维一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。