QUICK REVIEW

[論文レビュー] LibraGen: Playing a Balance Game in Subject-Driven Video Generation

Jiahao Zhu, Shanshan Lao|arXiv (Cornell University)|Mar 13, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

LibraGen は intrinsic VGFM の強みと新しい S2V 能力をバランスさせつつ、品質重視のデータ選別、Tune-to-Balance のポストトレーニング、時間依存の動的 CFG を用いて、データ量が限られた状況での優れた複数主体ビデオ生成を実現する。

ABSTRACT

With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.

研究の動機と目的

VGFM を S2V ジェネレーションへ拡張する際のコアトレードオフを特定する（ネイティブ能力の保持 vs. 主体の一貫性）。
高品質な S2V 学習データを構築するための品質優先のデータ選別パイプラインを提案する。
インペア・クロスパアデータ間のトレードオフを調和させる Tune-to-Balance ポストトレーニング・パラダイムを開発する。
バランス最適化のため Consis-DPO と Real-Fake DPO の2つの DPO パイプラインを設計し、それらを統合する。
参照条件付与とテキストプロンプトの影響を時変に制御する時間依存型ダイナミック分類子なしガイダンス戦略を導入する。

提案手法

自動および手動のフィルタリングを用いて、百万規模の生データセットを千規模の高品質・人間に整合したサブセットへ蒸留するデータ選別パイプライン。
MM-DiT 拘束を最小限に抑えて S2V を可能にする、軽量な主体注入をMM-DiT拡散バックボーンへ適用。
推論時にユーザープロンプトと訓練キャプションのギャップを埋めるための2段階プロンプト再表現。
In-pair および cross-pair データを用いた SFT、LoRA マージングを用いて主体忠実性と foundation-model 能力のトレードオフを調整。
2つの DPO パイプライン（Consis-DPO と Real-Fake DPO）を統合してバランスを維持しつつ、正例／負例ペアを慎重に構築。
推論時の時間依存性ダイナミック CFG により、デノイジングステップ全体で参照条件付与とテキストプロンプトの影響を調整。

実験結果

リサーチクエスチョン

RQ1VGFM の固有の動きと美学を損なうことなく、主体一貫性を実現できるか。
RQ2品質重視のデータ選別アプローチは、少量データでの S2V パフォーマンスを改善できるか。
RQ3In-pair と cross-pair のファインチューニングをどうバランスさせ、主体忠実性とプロンプト適合を最適化するか。
RQ4ポストトレーニング最適化（DPO）戦略は、動きや視覚品質を低下させずに主体の一貫性を向上させるか。
RQ5推論時の参照とプロンプトの影響を時間変化ガイダンス戦略でより細かく制御できるか。

主な発見

Motion Quality	Visual Quality	Text Align.
0.5373	0.9924	0.6491
0.4965	0.9865	0.6479
0.3830	0.9853	0.6356
0.3844	0.9873	0.6410
0.5380	0.9930	0.6496

LibraGen は thousand-scale トレーニングデータセット上で、オープンソースおよび商用 S2V モデルの中で最先端の性能を達成。
Motion Smoothness 0.5380 および Motion Quality 0.9930 の指標で、強い運動滑らかさと運動品質を実現。
AES 0.6496、IQA 71.60 の競合的な視覚美学とテキスト整合性（TA 3.594）を達成。
単一・複数主体タスクを通じて優れた主体一貫性を維持し、すべてのベースラインに対して正の GSB 比率を示す（例：MAGREF に対して最大 0.700）。
Consis-DPO と Real-Fake DPO の2つのパイプラインを統合して、主体の一貫性を維持しつつ視覚・運動品質を保持。
推論時のダイナミック CFG は他の指標を犠牲にせず Text Align を向上させる一方、レイテンシは増加。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。