TPH-YOLOv5 | Transformer-based YOLOv5 small target detector | Four heads plus attention


Paper address: https://arxiv.org/pdf/2108.11539.pdf
project address: https://github.com/cv516Buaa/tph-yolov5

Object detection in scenes captured by drones is a recent hot task. Since drones are always navigating at different altitudes, the scale of objects changes drastically, placing a burden on network optimization. In addition, high-speed and low-altitude flight brings motion blur on dense objects, which brings great challenges to object recognition. To address the above two issues, we propose TPH-YOLOv5. Based on YOLOv5, we add a prediction head to detect objects at different scales. Then we replace the original prediction heads with Transformer Prediction Heads (TPH) to explore the prediction potential with a self-attention mechanism. We also integrate a Convolutional Block Attention Model (CBAM) to find attention regions in scenes with dense objects. To further improve our proposed TPH-YOLOv5, we provide many useful strategies such as data augmentation, multi-scale testing, multi-model ensemble, and leveraging additional classifiers. Extensive experiments on the dataset VisDrone2021 show that TPH-YOLOv5 has good performance with impressive interpretability in scenes captured by drones. On the DET-test-challenge dataset, the AP result of TPH-YOLOv5 is 39.18%, which is 1.81% better than the previous SOTA method (DPNetV3). In the VisDrone Challenge 2021, TPHYOLOv5 won the 5th place and achieved a good match result with the 1st place model (AP 39.43%). Compared with the baseline model (YOLOv5), TPH-YOLOv5 improves by about 7%, which is encouraging and competitive.

solved problem

TPH-YOLOv5 aims to solve two problems in drone imagery:

  • As drones fly at different altitudes, the scale of objects changes drastically.
  • High-speed and low-altitude flight brings motion blur to densely packed objects.

Major Improvements

TPH-YOLOv5 makes the following improvements on the basis of YOLOv5:

  • A new detection head has been added to detect smaller scale objects.
  • Replace the original prediction heads with transformer prediction heads(TPH).
  • CBAM is integrated into YOLOv5 to help the network find regions of interest in images covered by large areas.
  • A series of other small tricks.

The TPH-YOLOv5 network structure is as follows:

TPH module

The author uses a Transformer Encoder to replace some convolution and CSP structures, and the application of Transformer in vision is also the current mainstream trend. Transformer has a unique attention mechanism, and the effect is better than the original.

CBAM module

I found that the code published by the author is different from the code in the figure, so I reproduced one according to the above figure. Except for the detection head, I completely follow the original text. Here we can refer to the structure of this article to improve our own model. Because these modules already exist in our file, we can directly change the configuration file.

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license
# Diffie Hellman https://blog.csdn.net/weixin_43694096?spm=1000.2115.3001.5343

# Parameters
nc: 80  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
anchors:
  - [19,27,  44,40,  38,94]  # P3/8
  - [96,68,  86,152,  180,137]  # P4/16
  - [140,301,  303,264,  238,542]  # P5/32
  - [436,615,  739,380,  925,792]  # P6/64

# YOLOv5 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],       # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],    # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],    # 3-P3/8
   [-1, 9, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],    # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [768, 3, 2]],    # 7-P5/32
   [-1, 1, SPP, [1024, [3, 5, 7]]],
   [-1, 3, C3TR, [1024, False]],  # 9
  ]

# YOLOv5 head
head:
  [[-1, 1, Conv, [768, 1, 1]],  # 10
   [-1, 1, nn.Upsample, [None, 2, 'nearest']], #11
   [[-1, 6], 1, Concat, [1]],   # 12 cat backbone P5
   [-1, 3, C3, [768, False]],   # 13
   [-1, 1, CBAM, [768]],        # 14

   [-1, 1, Conv, [512, 1, 1]],  # 15
   [-1, 1, nn.Upsample, [None, 2, 'nearest']], #16
   [[-1, 4], 1, Concat, [1]],   # 17 cat backbone P4
   [-1, 3, C3, [512, False]],   #  18
   [-1, 1, CBAM, [512]],        # 19

   [-1, 1, Conv, [256, 1, 1]],  # 20
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],  #21
   [[-1, 2], 1, Concat, [1]],   # 22 cat backbone P3
   [-1, 3, C3TR, [256, False]], # 23 (P3/8-small)
   [-1, 1, CBAM, [256]],        # 24

   [-1, 1, Conv, [256, 3, 2]],  # 25
   [[-1, 20], 1, Concat, [1]],  # cat head P4 #26
   [-1, 3, C3TR, [512, False]], # 27 (P4/16-medium)
   [-1, 1, CBAM, [512]],        # 28

   [-1, 1, Conv, [512, 3, 2]],  # 29
   [[-1, 15], 1, Concat, [1]],  # 30 cat head P5
   [-1, 3, C3TR, [768, False]], # 31 (P5/32-large)
   [-1, 1, CBAM, [768]],        # 32

   [-1, 1, Conv, [768, 3, 2]],  # 33
   [[-1, 10], 1, Concat, [1]],  # 34  cat head P6
   [-1, 3, C3TR, [1024, False]],# 35 (P6/64-xlarge)

   [[23, 27, 31, 35], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5, P6)
  ]
ModelParameters parametersComputational GFLOPs
TPH-YOLOv51000951034.8

I have more Yolov5 actual content navigation 🍀

  1. Take you hand in hand to adjust the Yolo v5 (v6.2) (reasoning) 🌟 Highly recommended

  2. Take you hand in hand to adjust the Yolo v5 (v6.2) (training)🚀

  3. Hand-in-hand with you to adjust Yolo v5 (v6.2) (verified)

  4. How to quickly train a Yolov5 model with your own dataset

  5. Take you by hand with Yolov5 (v6.2) to add Attention mechanism (1) (with more than 30 kinds of Attention schematic diagrams attached) 🌟 Highly recommended

  6. Hand in hand with you Yolov5 (v6.2) to add attention mechanism (2) (add attention mechanism to C3 module)

  7. How does Yolov5 replace the activation function?

  8. How does Yolov5 replace BiFPN?

  9. Yolov5 (v6.2) data enhancement method analysis

  10. Yolov5 replaces the upsampling method (nearest neighbor / bilinear / bicubic / trilinear / transposed convolution)

  11. How does Yolov5 replace EIOU/alpha IOU/SIoU?

  12. Yolov5 replaces the backbone network of "Megvii Lightweight Convolutional Neural Network ShuffleNetv2"🍀

  13. YOLOv5 applies the lightweight general upsampling operator CARAFE

  14. Spatial Pyramid Pooling Improvement SPP/SPPF/SimSPPF/ASPP/RFB/SPPCSPC/SPPFCSPC🚀

  15. Module SPD-Conv for low resolution images and small objects🍀

  16. GSConv+Slim-neck reduces the complexity of the model and improves the accuracy🍀

  17. Head Decoupling | Add YOLOX Decoupling Head to YOLOv5 | Increase Point Killer🍀

Tags: AI

Posted by DJTim666 on Sun, 23 Oct 2022 22:02:16 +0530