QUICK REVIEW

[논문 리뷰] SHARP: Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion in Remote Sensing Synthesis

Bingxuan Zhao, Qing Zhou|arXiv (Cornell University)|2026. 03. 23.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

SHARP는 대규모 원격 탐지 텍스트-이미지 합성에서 해상도 향상을 촉진하기 위한 스펙트럼 인식의 학습-없는 동적 RoPE 스케줄링을 RS 전용 확산 사전 파인튜닝 후에 도입합니다. 다중 스케일 RS 출력의 질을 개선하고 고주파 디테일과 전반적 레이아웃을 보존합니다.

ABSTRACT

Text-to-image generation powered by Diffusion Transformers (DiTs) has made remarkable strides, yet remote sensing (RS) synthesis lags behind due to two barriers: the absence of a domain-specialized DiT prior and the prohibitive cost of training at the large resolutions that RS applications demand. Training-free resolution promotion via Rotary Position Embedding (RoPE) rescaling offers a practical remedy, but every existing method applies a static positional scaling rule throughout the denoising process. This uniform compression is particularly harmful for RS imagery, whose substantially denser medium- and high-frequency energy encodes the fine structures critical for aerial-scene realism, such as vehicles, building contours, and road markings. Addressing both challenges requires a domain-specialized generative prior coupled with a denoising-aware positional adaptation strategy. To this end, we fine-tune FLUX on over 100,000 curated RS images to build a strong domain prior (RS-FLUX), and propose Spectrum-aware Highly-dynamic Adaptation for Resolution Promotion (SHARP), a training-free method that introduces a rational fractional time schedule k_rs(t) into RoPE. SHARP applies strong positional promotion during the early layout-formation stage and progressively relaxes it during detail recovery, aligning extrapolation strength with the frequency-progressive nature of diffusion denoising. Its resolution-agnostic formulation further enables robust multi-scale generation from a single set of hyperparameters. Extensive experiments across six square and rectangular resolutions show that SHARP consistently outperforms all training-free baselines on CLIP Score, Aesthetic Score, and HPSv2, with widening margins at more aggressive extrapolation factors and negligible computational overhead. Code and weights are available at https://github.com/bxuanz/SHARP.

연구 동기 및 목표

대규모 RS 텍스트-투-이미지 생성의 동기를 제공하고 도메인 특화 확산 사전의 부재를 해결한다.
디노이즈 과정에서 RS의 고주파 콘텐츠를 보존하는 학습 없이 해상도 촉진 방법을 개발한다.
스펙트럼 인식 동적 RoPE 스케줄러를 제안하여 외삽 강도를 확산 디노이징 단계와 정렬한다.
정방 및 직사각형 해상도에서 단일 하이퍼파라미터 세트로 강건한 다중 스케일 RS 합성을 시연한다.

제안 방법

대규모 RS 이미지 코퍼스에서 FLUX를 파인튜닝하여 도메인 특화 사전(RS-FLUX)을 얻는다.
학습 없이 동적으로 RoPE 주파수를 디노이징 중에 합리적 감쇠스케줄러(RDS)와 시간 의존 주파수 램프를 통해 조정하는 SHARP를 도입한다.
RDS는 디노이징 시간을 양과 관계된 감소 계수로 매핑하여 각 단계에서의 촉진 강도를 결정한다.
동적 램프 함수는 시간 변화하는 주파수 비율 r(d)에 따라 RoPE 차원에 걸쳐 촉진을 할당한다.
SHARP는 해상도에 구애받지 않는다: 단일 구성으로도 대상 크기가 다를 때 promotion factor s만 업데이트하여 작동한다.
고정 RoPE 외삽이 RS의 조밀한 고주파 콘텐츠로 인해 해를 준다는 실증 및 분석적 근거를 제시한다.

실험 결과

연구 질문

RQ1RS를 위한 도메인 특화 확산 사전이 원래 해상도에서의 텍스트-이미지 생성 품질과 대규모 외삽 하에서 향상시킬 수 있는가?
RQ2스펙트럼 인식의 시간에 따라 변화하는 RoPE 외삽 전략이 정적 방법보다 확산 기반 생성 중 RS의 고주파 디테일을 더 잘 보존하는가?
RQ3SHARP가 다양한 해상도에서 단일 하이퍼파라미터 세트로 강건한 다중 스케일 RS 합성을 수행할 수 있는가?
RQ4RS-특화 파인튜닝과 동적 RoPE 타이밍의 결합이 RS 사실감과 레이아웃 충실도에 미치는 질적 및 정량적 영향은 무엇인가?

주요 결과

RS-FLUX( RS-특화 파인튜닝 사전 )이 원래 해상도(1024×1024)에서 CLIP, Aesthetic, 및 HPSv2 점수에서 일반 FLUX를 능가한다.
SHARP는 여섯 해상도에 걸쳐 학습-없는 기준보다 일관되게 우수하며 CLIP, Aesthetic, 및 HPSv2에서 더 큰 외삽 요인에서 더 큰 이득을 보인다.
절단(an Ablation) 결과 RS-FLUX와 SHARP의 결합이 최상의 결과를 보이고 SHARP 단독도 기본 모델에 비해 상당한 이득을 제공한다.
SHARP는 다중 스케일 일관성을 유지한다: 같은 프롬프트로 1024×1024에서 3756×2560에 이르는 다양한 해상도에서 응집된 레이아웃을 생성하고 해상도가 높아질수록 세부 묘사가 더 선명해진다.
SHARP는 계산 오버헤드를 거의 추가하지 않는다(추론 시간 1.5% 이하 증가).
스케줄링 형태(합리적 감소)와 하이퍼파라미터(αs, α, β)가 제거 실험 전반에서 강건하고 거의 최적임이 보여진다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.