QUICK REVIEW

[論文レビュー] SSG: Scaled Spatial Guidance for Multi-Scale Visual Autoregressive Generation

Youngwoo Shin, Jiwan Hur|arXiv (Cornell University)|Feb 5, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

SSGはトレーニング不要で推論時の指針を提供し、周波数領域の離散空間強化（DSE）を介して高周波の意味的残差を粗い事前知識から分離しつつ、複数スケールにまたがる視覚的自己回帰生成を制御します。

ABSTRACT

Visual autoregressive (VAR) models generate images through next-scale prediction, naturally achieving coarse-to-fine, fast, high-fidelity synthesis mirroring human perception. In practice, this hierarchy can drift at inference time, as limited capacity and accumulated error cause the model to deviate from its coarse-to-fine nature. We revisit this limitation from an information-theoretic perspective and deduce that ensuring each scale contributes high-frequency content not explained by earlier scales mitigates the train-inference discrepancy. With this insight, we propose Scaled Spatial Guidance (SSG), training-free, inference-time guidance that steers generation toward the intended hierarchy while maintaining global coherence. SSG emphasizes target high-frequency signals, defined as the semantic residual, isolated from a coarser prior. To obtain this prior, we leverage a principled frequency-domain procedure, Discrete Spatial Enhancement (DSE), which is devised to sharpen and better isolate the semantic residual through frequency-aware construction. SSG applies broadly across VAR models leveraging discrete visual tokens, regardless of tokenization design or conditioning modality. Experiments demonstrate SSG yields consistent gains in fidelity and diversity while preserving low latency, revealing untapped efficiency in coarse-to-fine image generation. Code is available at https://github.com/Youngwoo-git/SSG.

研究の動機と目的

限界容量と蓄積誤差により発生するマルチスケール視覚自己回帰（VAR）生成における訓練–推論ドリフトを動機づけ、対処する。
各スケールが前のスケールでは説明されない高周波成分を寄与するようにし、粗いから細い階層を保持する方法を開発する。
周波数領域の事前抽出（意味的残差）と推論時のガイダンス機構を、VARモデル全体に適用可能な形で提案する。

提案手法

高周波ターゲット信号を、粗い事前知識から分離された意味的残差として定義する。
周波数領域の手続きとして、意味的残差を鮮明にし分離するDiscrete Spatial Enhancement（DSE）を導入する。
推論時にトレーニング不要のガイダンスとしてScaled Spatial Guidance（SSG）を適用し、生成を意図した階層へ誘導する。
離散的な視覚トークンを用いるVARモデル間の互換性を確保し、トークン化や条件付けモードに依存しないことを保証する。
低遅延オーバーヘッドで忠実度と多様性の改善を実証する。

実験結果

リサーチクエスチョン

RQ1マルチスケールVARモデルの各スケールが、前のスケールに捕捉されていない高周波意味内容をどのように寄与して訓練–推論の差異を緩和できるか。
RQ2SSGは異なるトークン化および条件付けモード間で推論速度を犠牲にせず、忠実度と多様性を改善できるか。
RQ3提案された周波数領域の事前抽出（DSE）は、VARアーキテクチャ全般に対して一般的に有効か。

主な発見

SSGはマルチスケール視覚自己回帰生成において忠実度の一貫した向上をもたらす。
SSGはマルチスケール視覚自己回帰生成において多様性の一貫した向上をもたらす。
SSGは生成品質を向上させつつ低遅延を維持する。
SSGはトレーニング不要で、離散視覚トークンとさまざまな条件付けモダリティを使用するVARモデルに広く適用可能である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。