QUICK REVIEW

[论文解读] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding

Yixuan Lai, He Wang|arXiv (Cornell University)|Jan 4, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

Slot-ID 引入了一种无需微调的身份条件方法，使用短参考视频和 Sinkhorn 路由的槽编码器来生成与提示一致、身份保持的视频，背骨为冻结的扩散器-Transformer。

ABSTRACT

Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and "average" faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.

研究动机与目标

提升文本到视频生成中对身份保持的重视，超越单图像条件化。
提出从短参考视频片段中提取的动态感知身份编码。
将轻量级、与骨干网络兼容的条件化机制整合到冻结的扩散器–Transformer 视频生成器中。

提出的方法

引入一个基于槽的时间身份编码器，从短参考视频中提取 S 个身份槽。
使用 Sinkhorn 路由读取器通过熵最优传输将令牌与参考帧对齐。
通过门控机制将图像锚点令牌和身份槽融合，以控制生成过程中的时间身份影响。
通过在文本提示前置身份令牌，对冻结的 Wan/DiT 视频骨干进行端到端生成的条件化。
以与基础扩散模型一致的潜在空间 v 预测目标进行训练。
在跨注意力投影上应用 LoRA，以实现轻量级适配，同时保持骨干网络冻结。

Figure 2 : Failures from single-image references. (a, c) Reference portraits. (b) Face deformation : view changes warp facial geometry (stretched cheeks/jawline, eye misalignment).

实验结果

研究问题

RQ1一个短参考视频是否能捕捉并编码在大姿态和表情变化下仍然鲁棒的身份动态？
RQ2基于 Sinkhorn 的槽读取器是否能提供稳定、随运动变化鲁棒的身份令牌，从而在不进行个体微调的情况下提升身份保持？
RQ3动态感知的身份条件化如何影响提示保真度与视觉真实感在不同主体和提示中的表现？
RQ4参考帧的时间排序对身份鲁棒性有何影响？

主要发现

Slot-ID 在保持真实感和提示保真度的同时实现了最先进的身份保持。
Sinkhorn 路由的身份槽生成稳定、随运动变化鲁棒的线索，在大姿态变化和表达行为下提升性能。
Slot-ID 在人脸相似度和整体自然度方面超越单图像基线和其他条件化方法。
消融研究显示基于槽的编码器和时间排序对于在运动和遮挡下保持身份至关重要。
人类评估（MOS）显示 Slot-ID 在面部相似度、视觉质量和文本对齐方面评分最高。
该方法保持无需微调的特性，仅对冻结骨干增加轻量级的条件化。

Figure 3 : Pipeline overview. A text prompt, a background-neutral face reference, and a reference video are encoded to provide conditioning signals for generation. A Sinkhorn-routed slot reader then iteratively refines learnable slot queries: (1) compute query–token similarity scores; (2) apply temp

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。