[논문 리뷰] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation
ReDiStory는 학습 없는 프롬프트 임베딩 재구성 방법으로, 정체성(identity)과 프레임 특화 프롬프트를 분리하여 프레임 간 간섭을 감소시키고 확산 모델을 바꾸지 않으면서 다중 프레임 시각 이야기에서 주제 일관성을 향상시킨다.
Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory
연구 동기 및 목표
- Identify why inter-frame semantic interference causes identity drift in multi-frame visual storytelling.
- Propose a training-free framework that decouples identity and frame semantics at inference time.
- Show that prompt embedding reorganization improves identity consistency without compromising prompt fidelity.
제안 방법
- Decompose joint identity+frame prompt embeddings into identity and frame-specific components.
- Compute frame-specific embedding decorrelation by removing shared directions across frames via projection onto other frames' embeddings.
- Reconstruct reorganized prompt embeddings and generate each frame with the diffusion model without changing its parameters.
- Operate entirely at inference time with no additional supervision or optimization.
- Analyze computational overhead, which scales quadratically with the number of frames but remains modest relative to diffusion inference.
실험 결과
연구 질문
- RQ1Does decoupling identity-related and frame-specific embeddings reduce cross-frame interference in multi-frame generation?
- RQ2Can inference-time prompt embedding reorganization improve identity consistency while preserving prompt fidelity?
- RQ3What is the computational trade-off of the proposed method compared to baseline training-free approaches?
주요 결과
- ReDiStory yields consistent improvements in identity consistency over the strongest baseline among training-free methods.
- Under the ConsiStory+ benchmark, ReDiStory achieves higher CLIP-I and lower DreamSim than 1Prompt1Story while maintaining prompt fidelity (CLIP-T).
- The method incurs a small overhead in memory and inference time but remains reasonable (e.g., modest increases compared to baselines).
- Ablation shows that removing reorganization or using only identity-related reorganization degrades performance, with full ReDiStory providing the best results.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.