Skip to main content
QUICK REVIEW

[论文解读] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

Bingxuan Zhao, Qing Zhou|arXiv (Cornell University)|Mar 23, 2026
Generative Adversarial Networks and Image Synthesis被引用 0
一句话总结

tldr: SHARP 引入一个谱感知、无训练的动态 RoPE 调度,以提升大规模遥感文本到图像合成中的分辨率,在对 RS 专用扩散先验进行微调后。它在保留高频细节与全局布局的同时,产生更好的多尺度遥感输出。

ABSTRACT

Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.

研究动机与目标

  • Motivate large-scale RS text-to-image generation and address the lack of a domain-specialized diffusion prior.
  • Develop a training-free resolution-promotion method that preserves high-frequency RS content during denoising.
  • Propose a spectrum-aware dynamic RoPE scheduler to align extrapolation strength with diffusion denoising stages.
  • Demonstrate robust multi-scale RS synthesis from a single hyperparameter set across square and rectangular resolutions.

提出的方法

  • Fine-tune FLUX on a large RS image corpus to obtain a domain-specialized prior (RS-FLUX).
  • Introduce SHARP, a training-free method that dynamically adapts RoPE frequencies during denoising via a rational decayScheduler (RDS) and a time-dependent frequency ramp.
  • RDS maps denoising time to a decay factor that governs promotion strength across steps.
  • Dynamic ramp function allocates promotion across RoPE dimensions based on a time-varying frequency ratio r(d).
  • SHARP is resolution-agnostic: a single configuration works for multiple target sizes by updating the promotion factor s only.
  • Provide empirical and analytical evidence showing static RoPE extrapolation harms RS due to its dense high-frequency content.

实验结果

研究问题

  • RQ1Can a domain-specific diffusion prior for RS improve text-to-image generation quality at native resolutions and under large-scale extrapolation?
  • RQ2Does a spectrum-aware, time-varying RoPE extrapolation strategy better preserve high-frequency RS details during diffusion-based generation than static methods?
  • RQ3Is SHARP capable of robust multi-scale RS synthesis from a single hyperparameter set across diverse resolutions?
  • RQ4What is the qualitative and quantitative impact of RS-specific fine-tuning combined with dynamic RoPE timing on RS realism and layout fidelity?

主要发现

  • RS-FLUX (RS-specific fine-tuned prior) outperforms vanilla FLUX at native resolution (1024×1024) in CLIP, Aesthetic, and HPSv2 scores.
  • SHARP consistently outperforms training-free baselines across six resolutions in CLIP, Aesthetic, and HPSv2, with larger gains at higher extrapolation factors.
  • Ablation shows combining RS-FLUX with SHARP yields the best results, and SHARP alone provides substantial gains over the base model.
  • SHARP maintains multi-scale consistency: the same prompt generates coherent layouts across diverse resolutions from 1024×1024 to 3756×2560, with finer details emerging at higher resolutions.
  • SHARP adds negligible computational overhead (≤1.5% increase in inference time).
  • The scheduling form (rational decay) and hyperparameters (αs, α, β) are shown to be robust and near-optimal across ablations.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。