[论文解读] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis
tldr: SHARP 引入一个谱感知、无训练的动态 RoPE 调度,以提升大规模遥感文本到图像合成中的分辨率,在对 RS 专用扩散先验进行微调后。它在保留高频细节与全局布局的同时,产生更好的多尺度遥感输出。
Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.
研究动机与目标
- Motivate large-scale RS text-to-image generation and address the lack of a domain-specialized diffusion prior.
- Develop a training-free resolution-promotion method that preserves high-frequency RS content during denoising.
- Propose a spectrum-aware dynamic RoPE scheduler to align extrapolation strength with diffusion denoising stages.
- Demonstrate robust multi-scale RS synthesis from a single hyperparameter set across square and rectangular resolutions.
提出的方法
- Fine-tune FLUX on a large RS image corpus to obtain a domain-specialized prior (RS-FLUX).
- Introduce SHARP, a training-free method that dynamically adapts RoPE frequencies during denoising via a rational decayScheduler (RDS) and a time-dependent frequency ramp.
- RDS maps denoising time to a decay factor that governs promotion strength across steps.
- Dynamic ramp function allocates promotion across RoPE dimensions based on a time-varying frequency ratio r(d).
- SHARP is resolution-agnostic: a single configuration works for multiple target sizes by updating the promotion factor s only.
- Provide empirical and analytical evidence showing static RoPE extrapolation harms RS due to its dense high-frequency content.
实验结果
研究问题
- RQ1Can a domain-specific diffusion prior for RS improve text-to-image generation quality at native resolutions and under large-scale extrapolation?
- RQ2Does a spectrum-aware, time-varying RoPE extrapolation strategy better preserve high-frequency RS details during diffusion-based generation than static methods?
- RQ3Is SHARP capable of robust multi-scale RS synthesis from a single hyperparameter set across diverse resolutions?
- RQ4What is the qualitative and quantitative impact of RS-specific fine-tuning combined with dynamic RoPE timing on RS realism and layout fidelity?
主要发现
- RS-FLUX (RS-specific fine-tuned prior) outperforms vanilla FLUX at native resolution (1024×1024) in CLIP, Aesthetic, and HPSv2 scores.
- SHARP consistently outperforms training-free baselines across six resolutions in CLIP, Aesthetic, and HPSv2, with larger gains at higher extrapolation factors.
- Ablation shows combining RS-FLUX with SHARP yields the best results, and SHARP alone provides substantial gains over the base model.
- SHARP maintains multi-scale consistency: the same prompt generates coherent layouts across diverse resolutions from 1024×1024 to 3756×2560, with finer details emerging at higher resolutions.
- SHARP adds negligible computational overhead (≤1.5% increase in inference time).
- The scheduling form (rational decay) and hyperparameters (αs, α, β) are shown to be robust and near-optimal across ablations.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。