QUICK REVIEW

[論文レビュー] Cross-Layer Feature Pyramid Transformer for Small Object Detection in Aerial Images

Zewen Du, Zhenjiang Hu|arXiv (Cornell University)|Jul 29, 2024

Infrared Target Detection Methodologies被引用数 7

ひとこと要約

小型物体検出を改善するアップサンプラー不要のCross-Layer Feature Pyramid Transformer（CFPT）を導入。Cross-layer Channel-wise Attention（CCA）とCross-layer Spatial-wise Attention（CSA）に加え、Cross-layer Consistent Relative Positional Encoding（CCPE）を組み合わせて、VisDrone2019-DETとTinyPersonで計算量を抑えつつ小型物体検出の性能をSOTAに近づける。

ABSTRACT

Object detection in aerial images has always been a challenging task due to the generally small size of the objects. Most current detectors prioritize the development of new detection frameworks, often overlooking research on fundamental components such as feature pyramid networks. In this paper, we introduce the Cross-Layer Feature Pyramid Transformer (CFPT), a novel upsampler-free feature pyramid network designed specifically for small object detection in aerial images. CFPT incorporates two meticulously designed attention blocks with linear computational complexity: Cross-Layer Channel-Wise Attention (CCA) and Cross-Layer Spatial-Wise Attention (CSA). CCA achieves cross-layer interaction by dividing channel-wise token groups to perceive cross-layer global information along the spatial dimension, while CSA enables cross-layer interaction by dividing spatial-wise token groups to perceive cross-layer global information along the channel dimension. By integrating these modules, CFPT enables efficient cross-layer interaction in a single step, thereby avoiding the semantic gap and information loss associated with element-wise summation and layer-by-layer transmission. In addition, CFPT incorporates global contextual information, which improves detection performance for small objects. To further enhance location awareness during cross-layer interaction, we propose the Cross-Layer Consistent Relative Positional Encoding (CCPE) based on inter-layer mutual receptive fields. We evaluate the effectiveness of CFPT on three challenging object detection datasets in aerial images: VisDrone2019-DET, TinyPerson, and xView. Extensive experiments demonstrate that CFPT outperforms state-of-the-art feature pyramid networks while incurring lower computational costs. The code is available at https://github.com/duzw9311/CFPT.

研究の動機と目的

空撮画像における物体がスケール別に密集して変化する課題に対処する。
アップサンプリングを行わずクロスレイヤー相互作用を可能にする特徴ピラーネットワークを開発し計算量を削減する。
グローバルコンテキストとクロスレイヤーの近傍相互作用を活用してスケール間の情報を保持する。
正確な局在化を向上させる位置情報付き機構（CCPE）を導入する。

提案手法

CCAとCSAという2つのアテンションブロックを持つCFPTを提案する。
CCAはChannel ReconstructionとOverlapped Channel-wise Patch Partitionを介してマルチスケール特徴の整列を行い、クロスレイヤーのマルチヘッドアテンションとReverse Overlapped Patch操作を通じてチャンネル次元のクロスレイヤー相互作用を強化する。
CSAはOverlapped Spatial Patch Partitionとクロスレイヤーのマルチヘッドアテンションを、チャンネル整列を必要とせずに空間次元で実行し、続いてReverse Overlapped Spatial Patch Partitionを適用する。
Cross-layer Consistent Relative Positional Encoding（CCPE）を導入し、学習可能なコードブックを用いてクロスレイヤー相対位置情報をアテンションマップへ注入する。
CFPTは特徴のアップサンプリングなしに動作し、1ステップでクロスレイヤー情報を転送し、小 objectが豊富な浅い特徴情報を保持する。
提案ブロックの複雑さは入力解像度に対して線形スケーリングするという計算量分析を提供する。

実験結果

リサーチクエスチョン

RQ1アップサンプリングを用いずに、空撮画像で小型物体検出を向上させるために、どのようにクロスレイヤー相互作用を効率的に実現できるか。
RQ2クロスレイヤー注意ブロックと位置エンコーディングを備えたトランスフォーマーネックが、従来のFPNを空撮データセット上で上回ることができるか。
RQ3クロスレイヤー相対位置情報を明示的にモデリングすることで、小型物体の局在精度が向上するか。

主な発見

手法	バックボーン	AP(%)	AP 0.5(%)	AP 0.75(%)	AP-small(%)	AP-medium(%)	AP-large(%)	パラメータ数(M)	FLOPs(G)
RetinaNet	ResNet-18	15.8	28.4	15.6	7.5	24.4	33.2	18.0	151.1
FPN	ResNet-18	18.1	32.7	17.8	9.2	28.7	33.6	19.8	164.0
PAFPN	ResNet-18	18.2	32.5	18.2	8.9	28.9	36.3	22.2	170.1
AugFPN	ResNet-18	18.6	33.2	18.4	9.0	29.5	37.3	20.4	164.2
DRFPN	ResNet-18	18.9	33.4	18.8	9.1	30.2	38.5	24.6	176.0
FPG	ResNet-18	18.6	33.2	18.4	9.5	29.6	36.0	58.1	290.5
FPT	ResNet-18	17.5	30.7	17.5	8.3	28.0	37.9	40.1	275.2
RCFPN	ResNet-18	18.3	32.4	18.1	8.6	29.3	36.3	23.0	157.5
SSFPN	ResNet-18	19.2	33.7	19.1	10.0	31.2	35.8	24.3	221.4
AFPN	ResNet-18	16.5	30.0	16.5	8.2	26.0	32.3	17.9	153.2
CFPT (ours)	ResNet-18	20.0	35.3	20.0	10.1	31.7	37.2	20.8	165.9
RetinaNet	ResNet-50	18.1	31.1	18.3	8.8	28.5	38.0	34.5	203.7
FPN	ResNet-50	21.0	36.4	21.4	10.9	34.3	40.1	36.3	216.6
PAFPN	ResNet-50	21.2	36.5	21.6	10.9	34.6	41.1	38.7	222.7
AugFPN	ResNet-50	21.7	37.1	22.2	11.1	35.4	40.4	38.1	216.8
DRFPN	ResNet-50	21.5	36.7	22.0	11.0	35.3	39.5	41.1	228.5
FPG	ResNet-50	21.7	37.3	22.2	11.5	35.2	38.7	71.0	346.1
FPT	ResNet-50	19.3	33.3	19.2	9.4	30.0	38.9	56.6	331.8
RCFPN	ResNet-50	21.0	36.0	21.3	10.5	34.8	38.1	36.0	209.2
SSFPN	ResNet-50	21.7	37.3	22.2	11.5	35.3	39.8	40.8	274.0
AFPN	ResNet-50	20.7	36.0	21.2	10.7	33.4	36.9	58.0	250.0
CFPT (ours)	ResNet-50	22.2	38.0	22.4	11.9	35.2	41.7	37.3	218.5
RetinaNet	ResNet-101	18.0	31.0	18.3	8.8	28.5	38.0	53.5	282.8
FPN	ResNet-101	21.6	37.3	21.8	11.2	34.9	41.9	55.3	295.7
PAFPN	ResNet-101	21.9	37.4	22.2	11.6	35.4	42.5	57.6	301.8
AugFPN	ResNet-101	22.0	37.8	22.4	11.3	36.0	43.2	57.1	296.0
DRFPN	ResNet-101	22.0	37.8	22.4	11.5	36.0	41.1	60.1	307.7
FPG	ResNet-101	22.0	37.9	22.4	11.5	35.7	38.7	90.0	431.3
FPT	ResNet-101	19.5	33.5	19.4	9.4	30.5	39.8	75.6	417.0
RCFPN	ResNet-101	21.4	36.8	21.7	11.1	35.2	40.0	55.0	288.3
SSFPN	ResNet-101	22.2	38.3	22.6	11.9	35.8	43.3	59.8	353.1
AFPN	ResNet-101	21.0	36.7	21.6	11.2	33.7	36.7	77.0	329.1
CFPT (ours)	ResNet-101	22.6	38.4	23.1	12.1	36.2	43.8	56.3	297.6

CFPTはVisDrone2019-DETで背後 backbone（ResNet-18/50/101）に対していくつかのSOTA FPN変種を上回り、APと小型物体のAPが向上。
ResNet-18を用いたVisDrone2019-DETでCFPTはAP 20.0、AP-small 10.1を達成し、多くのベースラインを上回りつつパラメータ数とFLOPs（165.9G）は競合的。
ResNet-50ではCFPTがAP 22.2、AP-small 11.9を達成し、多数の競合相手より改善（Params 37.3M、FLOPs 218.5G）。
ResNet-101ではCFPTがAP 22.6（AP-0.5 38.4、AP-0.75 23.1、AP-small 12.1）およびAP-large 43.8（Params 56.3M、FLOPs 297.6G）を達成。
CFPTは小型物体のAP-smallを一貫して高く保ち、スケール間で競争力のあるAPを維持しつつ、層ごとまたはアップサンプリングベースのネックに固有のセマンティックギャップおよび情報喪失リスクを低減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。