QUICK REVIEW

[논문 리뷰] LibraGen: Playing a Balance Game in Subject-Driven Video Generation

Jiahao Zhu, Shanshan Lao|arXiv (Cornell University)|2026. 03. 13.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

LibraGen은 S2V 비디오 생성을 VGFM의 고유 강점과 새로운 S2V 역량의 균형으로 프레이밍하며, 품질 중심의 데이터 큐레이션, Tune-to-Balance 사후학습, 그리고 시간 의존적 동적 CFG를 활용하여 적은 데이터로도 다중 주제 비디오 생성에서 우수한 성능을 달성한다.

ABSTRACT

With the advancement of video generation foundation models (VGFMs), customized generation, particularly subject-to-video (S2V), has attracted growing attention. However, a key challenge lies in balancing the intrinsic priors of a VGFM, such as motion coherence, visual aesthetics, and prompt alignment, with its newly derived S2V capability. Existing methods often neglect this balance by enhancing one aspect at the expense of others. To address this, we propose LibraGen, a novel framework that views extending foundation models for S2V generation as a balance game between intrinsic VGFM strengths and S2V capability. Specifically, guided by the core philosophy of "Raising the Fulcrum, Tuning to Balance," we identify data quality as the fulcrum and advocate a quality-over-quantity approach. We construct a hybrid pipeline that combines automated and manual data filtering to improve overall data quality. To further harmonize the VGFM's native capabilities with its S2V extension, we introduce a Tune-to-Balance post-training paradigm. During supervised fine-tuning, both cross-pair and in-pair data are incorporated, and model merging is employed to achieve an effective trade-off. Subsequently, two tailored direct preference optimization (DPO) pipelines, namely Consis-DPO and Real-Fake DPO, are designed and merged to consolidate this balance. During inference, we introduce a time-dependent dynamic classifier-free guidance scheme to enable flexible and fine-grained control. Experimental results demonstrate that LibraGen outperforms both open-source and commercial S2V models using only thousand-scale training data.

연구 동기 및 목표

VGFM을 S2V 생성으로 확장할 때의 핵심 트레이드오프 식별(자연스러운 원래 기능 보존 vs. 주체 일관성).
고품질 S2V 학습 데이터를 구성하기 위한 양질 우선(data quality over quantity) 큐레이션 파이프라인 제안.
in-pair 및 cross-pair 데이터를 조화시키기 위한 Tune-to-Balance 사후학습 패러다임 개발.
두 개의 DPO 파이프라인(Consis-DPO 및 Real-Fake DPO)을 설계하고 균형 있는 최적화를 위해 병합.
통제 가능한 추론을 위한 시간 의존적 동적 분류기 없는 가이드 전략 도입.

제안 방법

자동 필터링과 수동 필터링을 사용하여 백만 규모의 원시 데이터셋을 천 규모의 고품질 인간 정렬 하위집합으로 증류하는 데이터 큐레이션 파이프라인.
기본 모델에 최소한의 변화로 S2V를 가능하게 하는 MM-DiT 확산 백본에 경량 주제 주입.
추론 중 사용자 프롬프트와 학습 캡션 간의 간극을 연결하기 위한 두 단계 프롬프트 재구성기.
in-pair 및 cross-pair 데이터로 SFT 수행, LoRA 병합을 사용하여 주체 충실도와 기반 모델 능력 간의 트레이드오프를 조정.
두 개의 DPO 파이프라인(Consis-DPO 및 Real-Fake DPO) 병합으로 균형을 강화하되, 신중하게 구성된 양/음성 샘플과 함께.
추론 중 노이즈 제거 단계 전반에 걸쳐 참조 조건화와 텍스트 프롬프트의 영향력을 조정하기 위한 시간 의존적 동적 CFG.

실험 결과

연구 질문

RQ1VGFM의 고유한 모션 및 미학을 해하지 않으면서 어떻게 강건한 주체 일관의 비디오 생성을 달성할 수 있는가?
RQ2제한된 데이터로도 품질 중심 데이터 큐레이션 접근법이 S2V 성능을 향상시킬 수 있는가?
RQ3주체 충실도와 프롬프트 준수를 최적화하기 위해 in-pair 및 cross-pair 미세조정을 어떻게 균형 있게 수행할 수 있는가?
RQ4사후 학습 최적화(DPO) 전략이 모션이나 시각 품질을 손상시키지 않으면서 주체 일관성을 향상시키는가?
RQ5추론 중 참조와 프롬프트 영향력에 대해 시간에 따라 변하는 가이던스 전략이 더 미세한 제어를 제공할 수 있는가?

주요 결과

Motion Quality	Visual Quality	Text Align.
0.5373	0.9924	0.6491
0.4965	0.9865	0.6479
0.3830	0.9853	0.6356
0.3844	0.9873	0.6410
0.5380	0.9930	0.6496

LibraGen은 천 규모 학습 데이터셋에서 오픈소스 및 상용 S2V 모델 중 최첨단 성능을 달성합니다.
관찰된 지표에서 Motion Smoothness 0.5380 및 Motion Quality 0.9930으로 강한 모션 매끄러움과 모션 품질을 제공합니다.
기준 대비 경쟁력 있는 시각 미학(AES 0.6496, IQA 71.60)과 텍스트 정렬(TA 3.594)을 달성합니다.
단일-주제 및 다중 주제 작업 전반에서 우수한 주체 일관성을 유지하며, 모든 기준선에 대해 양의 GSB 비율로 입증됩니다(예: MAGREF 대비 최대 0.700).
주체 일관성을 보존하면서 시각 및 모션 품질을 유지하기 위해 두 개의 DPO 파이프라인(Consis-DPO 및 Real-Fake DPO)을 병합했습니다.
추론 중 동적 CFG는 다른 지표를 손상시키지 않으면서 텍스트 정렬을 향상시키며, 다만 레이턴시가 증가합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.