QUICK REVIEW

[論文レビュー] ThunderNet: Towards Real-time Generic Object Detection

Zheng Qin, Zeming Li|arXiv (Cornell University)|Mar 28, 2019

Advanced Neural Network Applications参考文献 31被引用数 43

ひとこと要約

ThunderNet は、カスタムの軽量バックボーン（SNet）と Context Enhancement Module および Spatial Attention Module を含む効率的な検出ヘッドを備え、低 FLOPs で ARM のリアルタイム速度と競争力のある精度を実現する、モバイルデバイス上のリアルタイム一般物体検出を目的とした軽量の2段検出器です。

ABSTRACT

Real-time generic object detection on mobile platforms is a crucial but challenging computer vision task. However, previous CNN-based detectors suffer from enormous computational cost, which hinders them from real-time inference in computation-constrained scenarios. In this paper, we investigate the effectiveness of two-stage detectors in real-time generic detection and propose a lightweight two-stage detector named ThunderNet. In the backbone part, we analyze the drawbacks in previous lightweight backbones and present a lightweight backbone designed for object detection. In the detection part, we exploit an extremely efficient RPN and detection head design. To generate more discriminative feature representation, we design two efficient architecture blocks, Context Enhancement Module and Spatial Attention Module. At last, we investigate the balance between the input resolution, the backbone, and the detection head. Compared with lightweight one-stage detectors, ThunderNet achieves superior performance with only 40% of the computational cost on PASCAL VOC and COCO benchmarks. Without bells and whistles, our model runs at 24.1 fps on an ARM-based device. To the best of our knowledge, this is the first real-time detector reported on ARM platforms. Our code and models are available at \url{https://github.com/qinzheng93/ThunderNet}.

研究の動機と目的

モバイル端末で2段検出器がリアルタイム性能を発揮できるかどうかを調査する。
物体検出に特化した軽量バックボーンを設計する（画像分類からの転移ではなく）。
精度と計算コストのバランスを取る効率的な検出ヘッドの構成要素を開発する。
入力解像度、バックボーン容量、検出ヘッド設計の橋渡しを行い、最適なリアルタイム性能を追求する。

提案手法

ShuffleNetV2 を 5×5 深度wise 畳み込みで改良し receptive field を拡大した SNet 軽量バックボーンを提案する。
計算量を削減しつつ精度を保つために RPN および RoI ヘッドの構成要素を圧縮する（例：RPN の 5×5 深度wise、1×1 conv、R-CNN fc サイズの削減）。
マルチスケールの局所的およびグローバルな文脈を 1×1 投影とアップサンプリング/ブロードキャストで融合する Context Enhancement Module（CEM）を導入する。
RPN由来の前景信号を用いて CEM の特徴を再重み付けする Spatial Attention Module（SAM）を導入する。
モバイルハードウェアでの速度と精度を最大化するために、入力解像度、バックボーン、検出ヘッドのバランスを検討する。
同期 SGD、マルチスケール学習、Cross-GPU Batch Normalization、Soft-NMS とともにエンドツーエンドで学習する。

実験結果

リサーチクエスチョン

RQ1モバイルハードウェア上で、速度と精度の点で軽量な単段検出器を上回ることができる2段検出器はあるか。
RQ2リアルタイムなモバイル検出のために、どのバックボーンと検出ヘッドの設計選択が最も良い精度・効率のトレードオフをもたらすか。
RQ3文脈と空間アテンション機構は特徴表現と定位（ローカリゼーション）にどのような影響を与えるか。
RQ4入力解像度、バックボーン容量、検出ヘッドの複雑さの最適なバランスは ARM プラットフォームでどのようになるか。

主な発見

モデル	バックボーン	入力	MFLOPs	AP	AP50	AP75
ThunderNet (ours)	SNet49	320×320	262	19.2	33.7	19.7
ThunderNet (ours)	SNet146	320×320	473	23.7	40.3	24.6
ThunderNet (ours)	SNet535	320×320	1300	28.1	46.2	29.6

ThunderNet with SNet49 は FLOPs の約 22% で MobileNet-SSD 程度の精度を達成。
ThunderNet with SNet146 は FLOPs の約 40% で従来の軽量検出器を凌ぐ。
ThunderNet with SNet535 は大規模検出器と少ない FLOPs のごく小さな割合で競合。
COCO test-dev において、ThunderNet with SNet146 は AP 23.7、AP50 40.3、AP75 24.6 を、SNet535 では AP 28.1、AP50 46.2、AP75 29.6 を達成。
ThunderNet は ARM で SNet49: 24.1 fps、SNet146: 13.8 fps を達成し、GPU ではすべてのバリアントで 200 fps 以上。
大きいバックボーンと小さなヘッドの設計は、同等の FLOPs の下で小さなバックボーンと大きなヘッドを上回り、バックボーンとヘッドの適合性を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。