QUICK REVIEW

[논문 리뷰] TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

Xingkui Zhu, Shuchang Lyu|arXiv (Cornell University)|2021. 08. 26.

Advanced Neural Network Applications참고 문헌 56인용 수 122

한 줄 요약

TPH-YOLOv5는 YOLOv5에 추가적인 작은 객체 예측 헤드, Transformer Prediction Heads, CBAM을 보강하고 데이터 증강 및 앙상블 트릭을 적용하여 VisDrone2021 test-challenge에서 최첨단 성능(AP 39.18%)을 달성합니다.

ABSTRACT

Object detection on drone-captured scenarios is a recent popular task. As drones always navigate in different altitudes, the object scale varies violently, which burdens the optimization of networks. Moreover, high-speed and low-altitude flight bring in the motion blur on the densely packed objects, which leads to great challenge of object distinction. To solve the two issues mentioned above, we propose TPH-YOLOv5. Based on YOLOv5, we add one more prediction head to detect different-scale objects. Then we replace the original prediction heads with Transformer Prediction Heads (TPH) to explore the prediction potential with self-attention mechanism. We also integrate convolutional block attention model (CBAM) to find attention region on scenarios with dense objects. To achieve more improvement of our proposed TPH-YOLOv5, we provide bags of useful strategies such as data augmentation, multiscale testing, multi-model integration and utilizing extra classifier. Extensive experiments on dataset VisDrone2021 show that TPH-YOLOv5 have good performance with impressive interpretability on drone-captured scenarios. On DET-test-challenge dataset, the AP result of TPH-YOLOv5 are 39.18%, which is better than previous SOTA method (DPNetV3) by 1.81%. On VisDrone Challenge 2021, TPHYOLOv5 wins 5th place and achieves well-matched results with 1st place model (AP 39.43%). Compared to baseline model (YOLOv5), TPH-YOLOv5 improves about 7%, which is encouraging and competitive.

연구 동기 및 목표

드론 촬영 객체 탐지의 극단적 스케일 변화, 높은 객체 밀도, 넓은 장면 커버리지와 같은 도전을 해결한다.
작은 객체 전용 헤드와 Transformer 기반 예측 헤드를 통해 로컬라이제이션 및 밀집한 장면 처리 능력을 향상시킨다.
주목 메커니즘과 학습/추론 트릭을 도입하여 드론 데이터셋에서의 성능과 강인성을 높인다.

제안 방법

작은 객체를 처리하기 위해 YOLOv5에 네 번째 예측 헤드를 추가한다.
원래의 예측 헤드를 Transformer Prediction Heads (TPH)로 대체하여 혼잡한 장면에서의 로컬라이제이션을 개선한다.
Convolutional Block Attention Module (CBAM)을 통합하여 밀집하고 복잡한 장면에서 관심 영역에 초점을 맞춘다.
데이터 증강(MixUp, Mosaic), 다중 스케일 테스트, 모델 앙상블 등 여러 트릭을 적용하여 정확도를 높인다.
잘못 분류된/혼동되는 카테고리를 개선하기 위해 잘라낸 객체 패치에 대해 자체 학습한 ResNet18 분류기를 사용하고 최종 예측을 다듬는다.
앙상블 중 ms-testing(입력 스케일 조정, 뒤집기) 및 Weighted Boxes Fusion(WBF)으로 예측을 융합한다.

실험 결과

연구 질문

RQ1Transformer 기반 예측 헤드가 다양한 객체 스케일을 가진 드론 촬영 영상에서 객체 로컬라이제이션을 어떻게 개선할 수 있는가?
RQ2작은 객체 예측 헤드와 CBAM을 추가하는 것이 밀집하고 혼잡한 드론 장면의 탐지 성능에 어떤 영향을 미치는가?
RQ3데이터 증강, 다중 스케일 테스트, 모델 앙상블이 VisDrone2021 성능을 크게 향상시키는가, 그리고 그 정도는 어느 정도인가?
RQ4크롭된 패치에 대한 자체 학습 분류기가 혼동되는 카테고리의 분류 정확도를 개선할 수 있는가?

주요 결과

방법	mAP (%)	AP50 (%)
RetinaNet	11.81	21.37
RefineDet	14.90	28.76
DetNet59	15.26	29.23
Cascade-RCNN	16.09	31.91
FPN	16.51	32.20
Light-RCNN	16.53	32.78
CornerNet	17.41	34.12
RRNet (2019)	29.13	55.82
DPNet-ensemble (2019)	29.62	54.00
SMPNet (2020)	35.98	59.53
DPNetV3 (2020)	37.37	62.05
TPH-YOLOv5 ensemble	39.18	N/A

TPH-YOLOv5는 VisDrone2021 DET test-dev에서 YOLOv5 기본 모델 및 이전의 압분해(ablation) 대비 mAP를 향상시킨다.
작은 객체 헤드(P2)를 추가하면 GFLOPs가 증가하더라도 AP 이점이 뚜렷하게 나타난다.
Transformer 인코더 블록은 네트워크 규모와 GFLOPs를 줄이면서 mAP를 증가시켜 밀집 객체 탐지에 도움이 된다.
다중 모델 앙상블과 ms-testing, WBF를 적용하면 단일 모델보다 더 높은 mAP를 달성한다.
자체 학습 분류기는 최종 결과에서 약 0.8–1.0% AP 향상을 제공한다.
VisDrone2021 test-challenge에서 TPH-YOLOv5 앙상블은 39.18% AP를 달성했고, 이전 SOTA DPNetV3보다 1.81% 포인트 앞섰다(표 1).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.