QUICK REVIEW

[論文レビュー] Slot-ID: Identity-Preserving Video Generation from Reference Videos via Slot-Based Temporal Identity Encoding

Yixuan Lai, He Wang|arXiv (Cornell University)|Jan 4, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

Slot-ID は短い参照動画と Sinkhorn ルーティングのスロットエンコーダを用いたチューニング不要の同一性条件付け手法を導入し、凍結拡散-トランスフォーマー骨幹を保持したまま、プロンプト忠実で同一性を保つ動画を生成します。

ABSTRACT

Producing prompt-faithful videos that preserve a user-specified identity remains challenging: models need to extrapolate facial dynamics from sparse reference while balancing the tension between identity preservation and motion naturalness. Conditioning on a single image completely ignores the temporal signature, which leads to pose-locked motions, unnatural warping, and "average" faces when viewpoints and expressions change. To this end, we introduce an identity-conditioned variant of a diffusion-transformer video generator which uses a short reference video rather than a single portrait. Our key idea is to incorporate the dynamics in the reference. A short clip reveals subject-specific patterns, e.g., how smiles form, across poses and lighting. From this clip, a Sinkhorn-routed encoder learns compact identity tokens that capture characteristic dynamics while remaining pretrained backbone-compatible. Despite adding only lightweight conditioning, the approach consistently improves identity retention under large pose changes and expressive facial behavior, while maintaining prompt faithfulness and visual realism across diverse subjects and prompts.

研究の動機と目的

テキストからビデオ生成において単一イメージ条件付けを超えた同一性保持の改善を動機付ける。
短い参照ビデオクリップから抽出されたダイナミクス志向の同一性エンコーディングを提案する。
凍結した拡散–トランスフォーマーのビデオ生成器に軽量でバックボーン適合の条件付け機構を統合する。

提案手法

短い参照動画から S 個の同一性スロットを抽出するスロットベースの時間的同一性エンコーダを導入する。
エントロピック最適輸送を用いて参照フレームとトークンを整列させる Sinkhorn ルーティングリーダを使用する。
画像アンカー・トークンと同一性スロットをゲーティング機構で融合し、生成中の時間的同一性の影響を制御する。
テキストプロンプトの前に同一性トークンを付加して凍結 Wan/DiT ビデオ backbone を条件付けし、エンドツーエンド生成を実現する。
ベースの拡散モデルと一致する潜在空間で vb-prediction 目的を用いて訓練する。
クロスアテンション射影に LoRA を適用して、バックボーンを凍結したまま軽量な適応を可能にする。

Figure 2 : Failures from single-image references. (a, c) Reference portraits. (b) Face deformation : view changes warp facial geometry (stretched cheeks/jawline, eye misalignment).

実験結果

リサーチクエスチョン

RQ1短い参照動画は、大きなポーズや表情変化の下でも頑健な同一性ダイナミクスを捉え、エンコードできるか。
RQ2Sinkhorn ベースのスロットリーダーは、個別の微調整なしで安定した動作的な同一性トークンを提供し、同一性保持を改善できるか。
RQ3ダイナミクス情報を取り入れた同一性条件付けは、プロンプト忠実度と視覚的現実感にどのように影響するか。
RQ4参照フレームの時系列順序が同一性のロバスト性に与える影響はどの程度か。

主な発見

Slot-ID は現状最高レベルの同一性保持を実現しつつ、現実感とプロンプト忠実度を維持する。
Sinkhorn ルーティングされた同一性スロットは、動きに頑健な安定した手掛かりを生成し、大きなポーズ変化や表現行動下での性能を向上させる。
Slot-ID は顔の類似性と全体的な自然さで単一画像のベースラインおよび他の条件付け手法を凌駕する。
アブレーションにより、スロットベースのエンコーダと時系列順序が、運動と遮蔽に跨る同一性を維持するために必須であることが示された。
人間の評価（MOS）は、Face Similarity・Visual Quality・Text Alignment の各カテゴリで Slot-ID が最高位を占める。
本手法はチューニング不要で、凍結したバックボーンに対して軽量な条件付けを追加するだけで済む。

Figure 3 : Pipeline overview. A text prompt, a background-neutral face reference, and a reference video are encoded to provide conditioning signals for generation. A Sinkhorn-routed slot reader then iteratively refines learnable slot queries: (1) compute query–token similarity scores; (2) apply temp

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。