QUICK REVIEW

[논문 리뷰] ReDiStory: Region-Disentangled Diffusion for Consistent Visual Story Generation

Ayushman Sarkar, Zhenyu Yu|arXiv (Cornell University)|2026. 02. 01.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

ReDiStory는 학습 없는 프롬프트 임베딩 재구성 방법으로, 정체성(identity)과 프레임 특화 프롬프트를 분리하여 프레임 간 간섭을 감소시키고 확산 모델을 바꾸지 않으면서 다중 프레임 시각 이야기에서 주제 일관성을 향상시킨다.

ABSTRACT

Generating coherent visual stories requires maintaining subject identity across multiple images while preserving frame-specific semantics. Recent training-free methods concatenate identity and frame prompts into a unified representation, but this often introduces inter-frame semantic interference that weakens identity preservation in complex stories. We propose ReDiStory, a training-free framework that improves multi-frame story generation via inference-time prompt embedding reorganization. ReDiStory explicitly decomposes text embeddings into identity-related and frame-specific components, then decorrelates frame embeddings by suppressing shared directions across frames. This reduces cross-frame interference without modifying diffusion parameters or requiring additional supervision. Under identical diffusion backbones and inference settings, ReDiStory improves identity consistency while maintaining prompt fidelity. Experiments on the ConsiStory+ benchmark show consistent gains over 1Prompt1Story on multiple identity consistency metrics. Code is available at: https://github.com/YuZhenyuLindy/ReDiStory

연구 동기 및 목표

Identify why inter-frame semantic interference causes identity drift in multi-frame visual storytelling.
Propose a training-free framework that decouples identity and frame semantics at inference time.
Show that prompt embedding reorganization improves identity consistency without compromising prompt fidelity.

제안 방법

Decompose joint identity+frame prompt embeddings into identity and frame-specific components.
Compute frame-specific embedding decorrelation by removing shared directions across frames via projection onto other frames' embeddings.
Reconstruct reorganized prompt embeddings and generate each frame with the diffusion model without changing its parameters.
Operate entirely at inference time with no additional supervision or optimization.
Analyze computational overhead, which scales quadratically with the number of frames but remains modest relative to diffusion inference.

실험 결과

연구 질문

RQ1Does decoupling identity-related and frame-specific embeddings reduce cross-frame interference in multi-frame generation?
RQ2Can inference-time prompt embedding reorganization improve identity consistency while preserving prompt fidelity?
RQ3What is the computational trade-off of the proposed method compared to baseline training-free approaches?

주요 결과

ReDiStory yields consistent improvements in identity consistency over the strongest baseline among training-free methods.
Under the ConsiStory+ benchmark, ReDiStory achieves higher CLIP-I and lower DreamSim than 1Prompt1Story while maintaining prompt fidelity (CLIP-T).
The method incurs a small overhead in memory and inference time but remains reasonable (e.g., modest increases compared to baselines).
Ablation shows that removing reorganization or using only identity-related reorganization degrades performance, with full ReDiStory providing the best results.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.