QUICK REVIEW

[论文解读] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

Bingxuan Zhao, Qing Zhou|arXiv (Cornell University)|Mar 23, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

tldr: SHARP 引入一个谱感知、无训练的动态 RoPE 调度，以提升大规模遥感文本到图像合成中的分辨率，在对 RS 专用扩散先验进行微调后。它在保留高频细节与全局布局的同时，产生更好的多尺度遥感输出。

ABSTRACT

Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.

研究动机与目标

Motivate large-scale RS text-to-image generation and address the lack of a domain-specialized diffusion prior.
Develop a training-free resolution-promotion method that preserves high-frequency RS content during denoising.
Propose a spectrum-aware dynamic RoPE scheduler to align extrapolation strength with diffusion denoising stages.
Demonstrate robust multi-scale RS synthesis from a single hyperparameter set across square and rectangular resolutions.

提出的方法

Fine-tune FLUX on a large RS image corpus to obtain a domain-specialized prior (RS-FLUX).
Introduce SHARP, a training-free method that dynamically adapts RoPE frequencies during denoising via a rational decayScheduler (RDS) and a time-dependent frequency ramp.
RDS maps denoising time to a decay factor that governs promotion strength across steps.
Dynamic ramp function allocates promotion across RoPE dimensions based on a time-varying frequency ratio r(d).
SHARP is resolution-agnostic: a single configuration works for multiple target sizes by updating the promotion factor s only.
Provide empirical and analytical evidence showing static RoPE extrapolation harms RS due to its dense high-frequency content.

实验结果

研究问题

RQ1Can a domain-specific diffusion prior for RS improve text-to-image generation quality at native resolutions and under large-scale extrapolation?
RQ2Does a spectrum-aware, time-varying RoPE extrapolation strategy better preserve high-frequency RS details during diffusion-based generation than static methods?
RQ3Is SHARP capable of robust multi-scale RS synthesis from a single hyperparameter set across diverse resolutions?
RQ4What is the qualitative and quantitative impact of RS-specific fine-tuning combined with dynamic RoPE timing on RS realism and layout fidelity?

主要发现

RS-FLUX (RS-specific fine-tuned prior) outperforms vanilla FLUX at native resolution (1024×1024) in CLIP, Aesthetic, and HPSv2 scores.
SHARP consistently outperforms training-free baselines across six resolutions in CLIP, Aesthetic, and HPSv2, with larger gains at higher extrapolation factors.
Ablation shows combining RS-FLUX with SHARP yields the best results, and SHARP alone provides substantial gains over the base model.
SHARP maintains multi-scale consistency: the same prompt generates coherent layouts across diverse resolutions from 1024×1024 to 3756×2560, with finer details emerging at higher resolutions.
SHARP adds negligible computational overhead (≤1.5% increase in inference time).
The scheduling form (rational decay) and hyperparameters (αs, α, β) are shown to be robust and near-optimal across ablations.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。