[论文解读] Breaking Self-Attention Failure: Rethinking Query Initialization for Infrared Small Target Detection
SEF-DETR 引入频率引导的补丁筛选、动态嵌入增强和可靠性-一致性感知融合,以解决 DETR 基于 IRSTD 的嵌入稀释问题,在三个 IRSTD 数据集上达到最先进的结果。
Infrared small target detection (IRSTD) faces significant challenges due to the low signal-to-noise ratio (SNR), small target size, and complex cluttered backgrounds. Although recent DETR-based detectors benefit from global context modeling, they exhibit notable performance degradation on IRSTD. We revisit this phenomenon and reveal that the target-relevant embeddings of IRST are inevitably overwhelmed by dominant background features due to the self-attention mechanism, leading to unreliable query initialization and inaccurate target localization. To address this issue, we propose SEF-DETR, a novel framework that refines query initialization for IRSTD. Specifically, SEF-DETR consists of three components: Frequency-guided Patch Screening (FPS), Dynamic Embedding Enhancement (DEE), and Reliability-Consistency-aware Fusion (RCF). The FPS module leverages the Fourier spectrum of local patches to construct a target-relevant density map, suppressing background-dominated features. DEE strengthens multi-scale representations in a target-aware manner, while RCF further refines object queries by enforcing spatial-frequency consistency and reliability. Extensive experiments on three public IRSTD datasets demonstrate that SEF-DETR achieves superior detection performance compared to state-of-the-art methods, delivering a robust and efficient solution for infrared small target detection task.
研究动机与目标
- Motivate and analyze why self-attention dilutes target-relevant embeddings in infrared small target detection (IRSTD).
- Propose a DETR-based framework (SEF-DETR) that leverages frequency-domain priors to initialize and refine object queries.
- Show that FPS, DEE, and RCF components jointly improve detection of very small infrared targets across multiple datasets.
- Demonstrate state-of-the-art performance on IRSTD benchmarks and analyze model complexity.
提出的方法
- Introduce Frequency-guided Patch Screening (FPS) to build a target-relevant density map from patch Fourier spectra.
- Develop Dynamic Embedding Enhancement (DEE) to reinforce multi-scale embeddings guided by the target density map.
- Design Reliability-Consistency-aware Fusion (RCF) to select and refine object queries using spatial-frequency consistency and reliability.
- Integrate FPS, DEE, and RCF into a DETR-based architecture (SEF-DETR) with a Hungarian loss and a patch-frequency loss.
- Train with a combined objective: L = L_hungarian + lambda * L_freq (lambda=2).
- Evaluate on IRSTD-1k, NUAA-SIRST, and NUDT-SIRST using CNN-based metrics (P, R, F1) and AI-TOD DETR-like AP metrics.
实验结果
研究问题
- RQ1Why does self-attention dilute target-relevant embeddings in DETR-based IRSTD models?
- RQ2Can frequency-domain priors improve target-focused query initialization and mitigate background contamination in IRSTD?
- RQ3Do FPS, DEE, and RCF provide complementary benefits to enhance detection of very tiny infrared targets?
- RQ4How does SEF-DETR perform against state-of-the-art CNN-based and DETR-like IRSTD methods across standard benchmarks?
主要发现
| Method | Type | P (IRSTD-1k) | R (IRSTD-1k) | F1 (IRSTD-1k) | P (NUAA-SIRST) | R (NUAA-SIRST) | F1 (NUAA-SIRST) | P (NUDT-SIRST) | R (NUDT-SIRST) | F1 (NUDT-SIRST) |
|---|---|---|---|---|---|---|---|---|---|---|
| SEF-DETR (Ours) | CNN-based | 92.4 | 85.9 | 89.0 | 94.8 | 97.3 | 96.1 | 100.0 | 96.3 | 98.1 |
| (Other CNN-based methods shown) | - | - | - | - | - | - | - | - | - | - |
- SEF-DETR achieves superior results on IRSTD-1k, NUAA-SIRST, and NUDT-SIRST compared with CNN-based methods (e.g., SEF-DETR: IRSTD-1k P=92.4, R=85.9, F1=89.0; NUAA-SIRST P=94.8, R=97.3, F1=96.1; NUDT-SIRST P=100.0, R=96.3, F1=98.1).
- Compared to DETR-like baselines, SEF-DETR shows strong improvements on AP metrics, particularly for very tiny targets (AP vt).
- Ablation studies confirm that FPS, DEE, and RCF each contribute to performance gains, with their combination yielding the best results.
- Frequency bands from both high- and low-frequency components benefit performance; using the full spectrum provides the best results.
- Learnable threshold in DEE and combined R and C fusion in RCF outperform fixed-thresholds and simple fusion.
- SEF-DETR introduces only a small increase in parameters and FLOPs (+0.27M params, +0.08G FLOPs) yet delivers substantial accuracy gains.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。