QUICK REVIEW

[论文解读] TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

Xingkui Zhu, Shuchang Lyu|arXiv (Cornell University)|Aug 26, 2021

Advanced Neural Network Applications参考文献 56被引用 122

一句话总结

TPH-YOLOv5 在 YOLOv5 上增设一个额外的微对象预测头、Transformer 预测头、以及 CBAM，加上数据增强和集成技巧，在 VisDrone2021 测试挑战中达到最先进的性能（AP 39.18%）。

ABSTRACT

Object detection on drone-captured scenarios is a recent popular task. As drones always navigate in different altitudes, the object scale varies violently, which burdens the optimization of networks. Moreover, high-speed and low-altitude flight bring in the motion blur on the densely packed objects, which leads to great challenge of object distinction. To solve the two issues mentioned above, we propose TPH-YOLOv5. Based on YOLOv5, we add one more prediction head to detect different-scale objects. Then we replace the original prediction heads with Transformer Prediction Heads (TPH) to explore the prediction potential with self-attention mechanism. We also integrate convolutional block attention model (CBAM) to find attention region on scenarios with dense objects. To achieve more improvement of our proposed TPH-YOLOv5, we provide bags of useful strategies such as data augmentation, multiscale testing, multi-model integration and utilizing extra classifier. Extensive experiments on dataset VisDrone2021 show that TPH-YOLOv5 have good performance with impressive interpretability on drone-captured scenarios. On DET-test-challenge dataset, the AP result of TPH-YOLOv5 are 39.18%, which is better than previous SOTA method (DPNetV3) by 1.81%. On VisDrone Challenge 2021, TPHYOLOv5 wins 5th place and achieves well-matched results with 1st place model (AP 39.43%). Compared to baseline model (YOLOv5), TPH-YOLOv5 improves about 7%, which is encouraging and competitive.

研究动机与目标

解决无人机拍摄对象检测的挑战，包括极端尺度变化、目标密度高以及场景覆盖范围大。
在 YOLOv5 上新增一个专门针对微小对象的头部和基于 Transformer 的预测头，以提升定位并处理密集场景。
结合注意力机制以及训练/推理技巧，以提升在无人机数据集上的性能和鲁棒性。

提出的方法

为 YOLOv5 增加第四个预测头，专门处理极小对象以应对极端尺度方差。
将原有预测头替换为 Transformer Prediction Heads (TPH)，利用自注意力在拥挤场景中实现更好的定位。
集成 Convolutional Block Attention Module (CBAM)，以聚焦在密集混乱场景中的感兴趣区域。
应用一系列技巧，包括数据增强（MixUp、Mosaic）、多尺度测试和模型集成以提升准确性。
在裁剪出的目标补丁上使用自训练的 ResNet18 分类器以提高错误分类/混淆类别的预测并细化最终结果。
在集合时通过输入缩放、翻转和使用加权框融合（WBF）来融合预测，进行多尺度测试。

实验结果

研究问题

RQ1Transformer 基于预测头如何在不同尺度的无人机拍摄图像中改善对象定位？
RQ2在密集、混乱的无人机场景中，增加微小对象头和 CBAM 对检测性能有何影响？
RQ3数据增强、多尺度测试和模型集成是否显著提升 VisDrone2021 的性能，提升幅度是多少？
RQ4在裁剪补丁上的自训练分类器是否能提高混淆类别的分类准确性？

主要发现

方法	mAP (%)	AP50 (%)
RetinaNet	11.81	21.37
RefineDet	14.90	28.76
DetNet59	15.26	29.23
Cascade-RCNN	16.09	31.91
FPN	16.51	32.20
Light-RCNN	16.53	32.78
CornerNet	17.41	34.12
RRNet (2019)	29.13	55.82
DPNet-ensemble (2019)	29.62	54.00
SMPNet (2020)	35.88	59.53
DPNetV3 (2020)	37.37	62.05
TPH-YOLOv5 ensemble	39.18	N/A

在 VisDrone2021 DET 测试集上，TPH-YOLOv5 相比 YOLOv5 基线和之前的消融方法，mAP 提升。
添加微小对象头（P2）在 GFLOPs 较高的情况下仍带来显著的 AP 增益。
Transformer 编码器块在提升 mAP 的同时减少网络规模和 GFLOPs，有助于密集对象检测。
结合多尺度测试和 WBF 的模型集合比单一模型获得更高的 mAP。
自训练分类器在最终结果上带来约 0.8–1.0% 的 AP 提升。
在 VisDrone2021 测试挑战中，TPH-YOLOv5 集成获得 39.18% 的 AP，领先于先前的 SOTA DPNetV3 1.81%（表 1）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。