Skip to main content
QUICK REVIEW

[论文解读] Breaking Self-Attention Failure: Rethinking Query Initialization for Infrared Small Target Detection

Y. J. Liu, Duanni Meng|arXiv (Cornell University)|Jan 6, 2026
Infrared Target Detection Methodologies被引用 0
一句话总结

SEF-DETR 引入频率引导的补丁筛选、动态嵌入增强和可靠性-一致性感知融合,以解决 DETR 基于 IRSTD 的嵌入稀释问题,在三个 IRSTD 数据集上达到最先进的结果。

ABSTRACT

Infrared small target detection (IRSTD) faces significant challenges due to the low signal-to-noise ratio (SNR), small target size, and complex cluttered backgrounds. Although recent DETR-based detectors benefit from global context modeling, they exhibit notable performance degradation on IRSTD. We revisit this phenomenon and reveal that the target-relevant embeddings of IRST are inevitably overwhelmed by dominant background features due to the self-attention mechanism, leading to unreliable query initialization and inaccurate target localization. To address this issue, we propose SEF-DETR, a novel framework that refines query initialization for IRSTD. Specifically, SEF-DETR consists of three components: Frequency-guided Patch Screening (FPS), Dynamic Embedding Enhancement (DEE), and Reliability-Consistency-aware Fusion (RCF). The FPS module leverages the Fourier spectrum of local patches to construct a target-relevant density map, suppressing background-dominated features. DEE strengthens multi-scale representations in a target-aware manner, while RCF further refines object queries by enforcing spatial-frequency consistency and reliability. Extensive experiments on three public IRSTD datasets demonstrate that SEF-DETR achieves superior detection performance compared to state-of-the-art methods, delivering a robust and efficient solution for infrared small target detection task.

研究动机与目标

  • Motivate and analyze why self-attention dilutes target-relevant embeddings in infrared small target detection (IRSTD).
  • Propose a DETR-based framework (SEF-DETR) that leverages frequency-domain priors to initialize and refine object queries.
  • Show that FPS, DEE, and RCF components jointly improve detection of very small infrared targets across multiple datasets.
  • Demonstrate state-of-the-art performance on IRSTD benchmarks and analyze model complexity.

提出的方法

  • Introduce Frequency-guided Patch Screening (FPS) to build a target-relevant density map from patch Fourier spectra.
  • Develop Dynamic Embedding Enhancement (DEE) to reinforce multi-scale embeddings guided by the target density map.
  • Design Reliability-Consistency-aware Fusion (RCF) to select and refine object queries using spatial-frequency consistency and reliability.
  • Integrate FPS, DEE, and RCF into a DETR-based architecture (SEF-DETR) with a Hungarian loss and a patch-frequency loss.
  • Train with a combined objective: L = L_hungarian + lambda * L_freq (lambda=2).
  • Evaluate on IRSTD-1k, NUAA-SIRST, and NUDT-SIRST using CNN-based metrics (P, R, F1) and AI-TOD DETR-like AP metrics.

实验结果

研究问题

  • RQ1Why does self-attention dilute target-relevant embeddings in DETR-based IRSTD models?
  • RQ2Can frequency-domain priors improve target-focused query initialization and mitigate background contamination in IRSTD?
  • RQ3Do FPS, DEE, and RCF provide complementary benefits to enhance detection of very tiny infrared targets?
  • RQ4How does SEF-DETR perform against state-of-the-art CNN-based and DETR-like IRSTD methods across standard benchmarks?

主要发现

MethodTypeP (IRSTD-1k)R (IRSTD-1k)F1 (IRSTD-1k)P (NUAA-SIRST)R (NUAA-SIRST)F1 (NUAA-SIRST)P (NUDT-SIRST)R (NUDT-SIRST)F1 (NUDT-SIRST)
SEF-DETR (Ours)CNN-based92.485.989.094.897.396.1100.096.398.1
(Other CNN-based methods shown)----------
  • SEF-DETR achieves superior results on IRSTD-1k, NUAA-SIRST, and NUDT-SIRST compared with CNN-based methods (e.g., SEF-DETR: IRSTD-1k P=92.4, R=85.9, F1=89.0; NUAA-SIRST P=94.8, R=97.3, F1=96.1; NUDT-SIRST P=100.0, R=96.3, F1=98.1).
  • Compared to DETR-like baselines, SEF-DETR shows strong improvements on AP metrics, particularly for very tiny targets (AP vt).
  • Ablation studies confirm that FPS, DEE, and RCF each contribute to performance gains, with their combination yielding the best results.
  • Frequency bands from both high- and low-frequency components benefit performance; using the full spectrum provides the best results.
  • Learnable threshold in DEE and combined R and C fusion in RCF outperform fixed-thresholds and simple fusion.
  • SEF-DETR introduces only a small increase in parameters and FLOPs (+0.27M params, +0.08G FLOPs) yet delivers substantial accuracy gains.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。