Paper address: https://arxiv.org/pdf/2108.11539.pdf
project address: https://github.com/cv516Buaa/tph-yolov5
Object detection in scenes captured by drones is a recent hot task. Since drones are always navigating at different altitudes, the scale of objects changes drastically, placing a burden on network optimization. In addition, high-speed and low-altitude flight brings motion blur on dense objects, which brings great challenges to object recognition. To address the above two issues, we propose TPH-YOLOv5. Based on YOLOv5, we add a prediction head to detect objects at different scales. Then we replace the original prediction heads with Transformer Prediction Heads (TPH) to explore the prediction potential with a self-attention mechanism. We also integrate a Convolutional Block Attention Model (CBAM) to find attention regions in scenes with dense objects. To further improve our proposed TPH-YOLOv5, we provide many useful strategies such as data augmentation, multi-scale testing, multi-model ensemble, and leveraging additional classifiers. Extensive experiments on the dataset VisDrone2021 show that TPH-YOLOv5 has good performance with impressive interpretability in scenes captured by drones. On the DET-test-challenge dataset, the AP result of TPH-YOLOv5 is 39.18%, which is 1.81% better than the previous SOTA method (DPNetV3). In the VisDrone Challenge 2021, TPHYOLOv5 won the 5th place and achieved a good match result with the 1st place model (AP 39.43%). Compared with the baseline model (YOLOv5), TPH-YOLOv5 improves by about 7%, which is encouraging and competitive.
solved problem
TPH-YOLOv5 aims to solve two problems in drone imagery:
- As drones fly at different altitudes, the scale of objects changes drastically.
- High-speed and low-altitude flight brings motion blur to densely packed objects.
Major Improvements
TPH-YOLOv5 makes the following improvements on the basis of YOLOv5:
- A new detection head has been added to detect smaller scale objects.
- Replace the original prediction heads with transformer prediction heads(TPH).
- CBAM is integrated into YOLOv5 to help the network find regions of interest in images covered by large areas.
- A series of other small tricks.
The TPH-YOLOv5 network structure is as follows:
TPH module
The author uses a Transformer Encoder to replace some convolution and CSP structures, and the application of Transformer in vision is also the current mainstream trend. Transformer has a unique attention mechanism, and the effect is better than the original.
CBAM module
I found that the code published by the author is different from the code in the figure, so I reproduced one according to the above figure. Except for the detection head, I completely follow the original text. Here we can refer to the structure of this article to improve our own model. Because these modules already exist in our file, we can directly change the configuration file.
# YOLOv5 🚀 by Ultralytics, GPL-3.0 license # Diffie Hellman https://blog.csdn.net/weixin_43694096?spm=1000.2115.3001.5343 # Parameters nc: 80 # number of classes depth_multiple: 0.33 # model depth multiple width_multiple: 0.50 # layer channel multiple anchors: - [19,27, 44,40, 38,94] # P3/8 - [96,68, 86,152, 180,137] # P4/16 - [140,301, 303,264, 238,542] # P5/32 - [436,615, 739,380, 925,792] # P6/64 # YOLOv5 backbone backbone: # [from, number, module, args] [[-1, 1, Focus, [64, 3]], # 0-P1/2 [-1, 1, Conv, [128, 3, 2]], # 1-P2/4 [-1, 3, C3, [128]], [-1, 1, Conv, [256, 3, 2]], # 3-P3/8 [-1, 9, C3, [256]], [-1, 1, Conv, [512, 3, 2]], # 5-P4/16 [-1, 9, C3, [512]], [-1, 1, Conv, [768, 3, 2]], # 7-P5/32 [-1, 1, SPP, [1024, [3, 5, 7]]], [-1, 3, C3TR, [1024, False]], # 9 ] # YOLOv5 head head: [[-1, 1, Conv, [768, 1, 1]], # 10 [-1, 1, nn.Upsample, [None, 2, 'nearest']], #11 [[-1, 6], 1, Concat, [1]], # 12 cat backbone P5 [-1, 3, C3, [768, False]], # 13 [-1, 1, CBAM, [768]], # 14 [-1, 1, Conv, [512, 1, 1]], # 15 [-1, 1, nn.Upsample, [None, 2, 'nearest']], #16 [[-1, 4], 1, Concat, [1]], # 17 cat backbone P4 [-1, 3, C3, [512, False]], # 18 [-1, 1, CBAM, [512]], # 19 [-1, 1, Conv, [256, 1, 1]], # 20 [-1, 1, nn.Upsample, [None, 2, 'nearest']], #21 [[-1, 2], 1, Concat, [1]], # 22 cat backbone P3 [-1, 3, C3TR, [256, False]], # 23 (P3/8-small) [-1, 1, CBAM, [256]], # 24 [-1, 1, Conv, [256, 3, 2]], # 25 [[-1, 20], 1, Concat, [1]], # cat head P4 #26 [-1, 3, C3TR, [512, False]], # 27 (P4/16-medium) [-1, 1, CBAM, [512]], # 28 [-1, 1, Conv, [512, 3, 2]], # 29 [[-1, 15], 1, Concat, [1]], # 30 cat head P5 [-1, 3, C3TR, [768, False]], # 31 (P5/32-large) [-1, 1, CBAM, [768]], # 32 [-1, 1, Conv, [768, 3, 2]], # 33 [[-1, 10], 1, Concat, [1]], # 34 cat head P6 [-1, 3, C3TR, [1024, False]],# 35 (P6/64-xlarge) [[23, 27, 31, 35], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5, P6) ]
Model | Parameters parameters | Computational GFLOPs |
---|---|---|
TPH-YOLOv5 | 10009510 | 34.8 |
I have more Yolov5 actual content navigation 🍀
-
Take you hand in hand to adjust the Yolo v5 (v6.2) (reasoning) 🌟 Highly recommended
-
Take you hand in hand to adjust the Yolo v5 (v6.2) (training)🚀
-
Take you by hand with Yolov5 (v6.2) to add Attention mechanism (1) (with more than 30 kinds of Attention schematic diagrams attached) 🌟 Highly recommended
-
YOLOv5 applies the lightweight general upsampling operator CARAFE
-
Spatial Pyramid Pooling Improvement SPP/SPPF/SimSPPF/ASPP/RFB/SPPCSPC/SPPFCSPC🚀
-
Module SPD-Conv for low resolution images and small objects🍀
-
GSConv+Slim-neck reduces the complexity of the model and improves the accuracy🍀
-
Head Decoupling | Add YOLOX Decoupling Head to YOLOv5 | Increase Point Killer🍀