QUICK REVIEW

[논문 리뷰] SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

Youngwoo Shin, Jiwan Hur|arXiv (Cornell University)|2026. 02. 05.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

SSG는 주파수 도메인 Discrete Spatial Enhancement (DSE)을 통해 더 거친 사전으로부터 분리된 고주파 시맨틱 잔차를 강조하여 여러 규모의 시각 자기회귀 생성에서 훈련 없는 추론 시 지침을 제공합니다.

ABSTRACT

Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.

연구 동기 및 목표

제한된 용량과 누적 오차로 인해 발생하는 다중 규모 시각 자기회귀(VAR) 생성의 train–inference 드리프트를 동기 부여하고 해결한다.
이전 규모로 설명되지 않는 고주파 콘텐츠가 각 규모에서 기여하도록 보장하여 거친-세밀한 계층 구조를 보존하는 방법을 개발한다.
주파수 도메인 사전 추출(시맨틱 잔차)과 추론 시 가이던스 메커니즘을 VAR 모델 전반에 적용 가능하도록 제안한다.

제안 방법

고주파 타깃 신호를 거친 사전으로부터 분리된 시맨틱 잔차로 정의한다.
시맨틱 잔차를 선명하게 하고 분리하기 위한 주파수 도메인 절차인 Discrete Spatial Enhancement (DSE)을 도입한다.
의도된 계층 구조로 생성을 유도하기 위해 추론 시 훈련-없는 가이던스로 Scaled Spatial Guidance (SSG)를 적용한다.
토큰화나 조건화 모듈에 관계없이 이산 시각 토큰을 사용하는 VAR 모델 간의 호환성을 보장한다.
낮은 대기 시간 오버헤드로 충실도와 다양성의 향상을 입증한다.

실험 결과

연구 질문

RQ1다중 규모 VAR 모델의 각 규모가 이전 규모에서 포착되지 않은 고주파 시맨틱 콘텐츠를 어떻게 기여할 수 있어 train–inference 차이를 완화하는가?
RQ2SSG가 토큰화 및 조건화 모듈이 다르더라도 추론 속도를 희생하지 않으면서 충실도와 다양성을 개선할 수 있는가?
RQ3제안된 주파수 도메인 사전 추출(DSE)이 VAR 아키텍처 전반에서 일반적으로 효과적인가?

주요 결과

SSG는 다중 규모 시각 자기회귀 생성의 충실도에 대해 일관된 향상을 제공합니다.
SSG는 다중 규모 시각 자기회귀 생성의 다양성에 대해 일관된 향상을 제공합니다.
SSG는 생성 품질을 향상시키면서도 낮은 지연 시간을 유지합니다.
SSG는 훈련 없이 다양한 이산 시각 토큰 및 다양한 조건화 모듈을 사용하는 VAR 모델에 광범위하게 적용 가능합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.