Skip to main content
QUICK REVIEW

[논문 리뷰] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

Ayushman Sarkar, Zhenyu Yu|arXiv (Cornell University)|2026. 02. 01.
Generative Adversarial Networks and Image Synthesis인용 수 0
한 줄 요약

ReDiStory는 학습 없는 프롬프트 임베딩 재구성 방법으로, 정체성(identity)과 프레임 특화 프롬프트를 분리하여 프레임 간 간섭을 감소시키고 확산 모델을 바꾸지 않으면서 다중 프레임 시각 이야기에서 주제 일관성을 향상시킨다.

ABSTRACT

Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory

연구 동기 및 목표

  • Identify why inter-frame semantic interference causes identity drift in multi-frame visual storytelling.
  • Propose a training-free framework that decouples identity and frame semantics at inference time.
  • Show that prompt embedding reorganization improves identity consistency without compromising prompt fidelity.

제안 방법

  • Decompose joint identity+frame prompt embeddings into identity and frame-specific components.
  • Compute frame-specific embedding decorrelation by removing shared directions across frames via projection onto other frames' embeddings.
  • Reconstruct reorganized prompt embeddings and generate each frame with the diffusion model without changing its parameters.
  • Operate entirely at inference time with no additional supervision or optimization.
  • Analyze computational overhead, which scales quadratically with the number of frames but remains modest relative to diffusion inference.

실험 결과

연구 질문

  • RQ1Does decoupling identity-related and frame-specific embeddings reduce cross-frame interference in multi-frame generation?
  • RQ2Can inference-time prompt embedding reorganization improve identity consistency while preserving prompt fidelity?
  • RQ3What is the computational trade-off of the proposed method compared to baseline training-free approaches?

주요 결과

  • ReDiStory yields consistent improvements in identity consistency over the strongest baseline among training-free methods.
  • Under the ConsiStory+ benchmark, ReDiStory achieves higher CLIP-I and lower DreamSim than 1Prompt1Story while maintaining prompt fidelity (CLIP-T).
  • The method incurs a small overhead in memory and inference time but remains reasonable (e.g., modest increases compared to baselines).
  • Ablation shows that removing reorganization or using only identity-related reorganization degrades performance, with full ReDiStory providing the best results.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.