refer to: Introduction to RepVGG Network_Sunflower's Little Mung Bean Blog-CSDN Blog
refer to: Detailed explanation of Repvgg and its implementation (pytorch) - Programmer Sought
The author of the paper knows: RepVGG: Minimalist architecture, SOTA performance, making the VGG model great again (CVPR-2021) bazyd
Official open source code: https://github.com/DingXiaoH/RepVGG
Part of the content of this article is referenced from blogs of other authors. If there is any infringement, please contact to delete.
Although many complex convolutional neural network models achieve better performance than simple networks, these complex networks also have significant disadvantages:
- Complex multi-branch network structure design (such as ResNet's residual module, Inception network), makes the model difficult to implement, reduces model reasoning performance, and increases graphics card memory usage
- Some lightweight operations, such as channel shuffle used in ShuffleNet, and depthwise separable convolution operations used in MobileNet, although these operations can reduce the number of parameters of the model, but increase the number of memory accesses, and these operations are not very good is supported by some devices (usually 3x3 convolution is optimized and supported best)
It is inaccurate to measure the processing efficiency and inference speed of the model based solely on the amount of parameters of the model and the amount of floating-point calculations FLOPs. For example, MobileNet uses depth-separable convolution, which greatly reduces the amount of parameters and floating-point calculations, but increases The number of times the memory is accessed leads to an improvement in the reasoning speed of the model and does not reach the ideal level. For example, in the figure below, EfficientNet's FLOPs calculation amount and parameter amount are smaller, but the processing speed is not necessarily fast.
In academic papers, the number of parameters and FLOPs of the model are usually used to measure the size and processing speed of the model, but the author of this article proposes that the number of parameters and FLOPs cannot really reflect the true inference speed of the model. Two other important factors affecting inference speed are:
- The number of times the model accesses the memory Memory Access Cost (MAC). Each branch of the Multi-Branch model must access the memory and save the feature map, although some branch parameters and calculations are not high (such as 1x1 convolution branch, Identity branch, group convolution, etc.), but the number of accesses to memory and the size of memory occupied have increased
- The degree of parallelization of the model, the speed of different branches of the Multi-Branch model is different, but it needs to wait for other branches, resulting in waste of computing power resources, and the degree of parallelism is not high
Advantages of RepVGG:
- In the inference stage, the RepVGG model is a network as flat as the VGG network, without any branch structure. Such a model occupies less graphics card memory, requires fewer memory accesses, and has higher computing parallelism, so the computing efficiency is higher. high
- The RepVGG model only includes 3x3 convolution operations and ReLU activation operations in the inference phase network, and the 3x3 convolution operations are very efficient.
- The RepVGG network model has not been specially designed, such as NAS search, etc.
Why does the multi-branch Multi-branch model work well?
- From the perspective of feature fusion, different branches have learned different representations, and the representation ability after fusion is stronger
- From the level of feature and gradient reuse, such as ResNeXt and DenseNet, features and gradients can be reused between multiple branches
- Understanding from the level of integrated learning, such as the short-cut connection in ResNet, every time a short-cut is encountered, the model may have two possibilities, so that the model from the beginning to the end has 2 to the Nth power of possibilities, just like Comprehensively integrate the results of 2 to the N th power models
Why is the single-branch flat model fast?
- Only one branch does not exist feature duplication, occupying less graphics card memory
- Fewer accesses to memory due to absence of other branch access characteristics
- Since there are no other branches for parallel computing, there is no need to wait for other branches to finish processing
- The types of flat model operators are more simple. For example, there are only 3x3 convolution and ReLU in RepVGG, and the execution efficiency is higher.
Paste the author of the paper here to know the answer:
In view of the fact that the multi-branch model has good training performance but poor inference performance, and the single-branch flat model has poor training performance and good inference performance, the two are combined to try to build a network model that uses multi-branch training in the model training stage to obtain better results. In the model reasoning stage, the trained multi-branch model is identically converted into a single-branch flat model. In the network model of the reasoning stage, there are only two operations: 3x3 convolution and ReLU activation. The core question here is how to convert a multi-branch model into a single-branch model? RepVGG calls this process the structure reparameterization technique.
1,Conv3x3 + BN --> Conv3x3:
The calculation of BatchNorm in the inference phase involves 4 sets of parameters. The number of parameters in each set is the same as the dimension of the feature. BatchNorm is calculated using these 4 sets of parameters. For the result of 2D convolution, the dimension size of the feature is the number of channels of the output feature map. Merge Conv3x3 + BN into one-step Conv3x3, and reset the weight and bias parameters in Conv3x3.
2,Conv1x1 + BN -> Conv3x3:
- Change the 1x1 convolution kernel to 0 to a 3x3 convolution kernel. In order to ensure that the size of the output feature map after convolution remains unchanged, padding is performed around the original feature map. The size of the padding is 1. Conv1x1 + BN -> Conv3x3 + BN
- Use the first step Conv3x3 + BN -> Conv3x3 method for fusion calculation, get Conv1x1 + BN -> Conv3x3 + BN -> Conv3x3
3,BN -> Conv3x3:
- Since there is only one BN, there is no convolution operation, first construct an identical convolution operation, the size of the convolution kernel is 1x1, the nth channel weight of the nth convolution kernel is 1, and the remaining channel weights are 0
- Then use the same method as Conv1x1 + BN -> Conv3x3 to expand the 1x1 convolution kernel to 0 and expand it into a 3x3 convolution kernel, Conv1x1 + BN -> Conv3x3 + BN
- Then use the first step Conv3x3 + BN -> Conv3x3 method for fusion calculation, get BN -> Conv1x1 + BN -> Conv3x3 + BN -> Conv3x3
4. Multi-branch Conv3x3 is fused into one Conv3x3
Now the three branches are converted into Conv3x3 operations, and the output feature maps have the same shape. Since the convolution operation is additive, the addition of multi-branch convolution results is equal to the addition of multi-branch convolution weights, offset The addition constitutes a new convolution operation, and then performs a convolution on the input feature map, thus compressing the three convolutions into one convolution.
Remarks: RepVGG is an efficient model designed for GPU and dedicated hardware. It pursues high speed and saves memory, and pays less attention to the amount of parameters and theoretical calculations. On low computing power devices, it may not be as suitable as the MobileNet and ShuffleNet series.
# coding:utf-8 from collections import OrderedDict import numpy as np import torch import torch.nn as nn def Conv3x3BNToConv3x3(g=2, in_channels=4, out_channels=4, tol=1e-4): """ Conv3x3 + BN -> Conv3x3 1,for 2 D The result of convolution, the dimension size of the feature is the output feature map number of channels 2,BatchNorm The calculation in the inference phase involves 4 sets of parameters, the number of parameters in each set is the same as the dimension of the feature, using these 4 sets of parameters to calculate BatchNorm 3,Will Conv3x3 + BN Fusion becomes one step Conv3x3,reset Conv3x3 The weight and bias parameters in :return: """ torch.random.manual_seed(0) f1 = torch.randn(1, in_channels, 3, 3) module = nn.Sequential(OrderedDict( # The original convolution does not use the bias parameter conv=nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=False, groups=g), bn=nn.BatchNorm2d(num_features=out_channels) )) # fuse conv + bn # Get the original convolution weights kernel = module.conv.weight # Get the mean of BN, nn.Buffer running_mean = module.bn.running_mean # Get the variance of BN running_var = module.bn.running_var # Get the weight parameters of BN learning gamma = module.bn.weight # Get the bias parameters for BN learning beta = module.bn.bias # Numerical stability parameters used to prevent division by 0 exceptions during BN calculation eps = module.bn.eps # Calculate the standard deviation of BN std = (running_var + eps).sqrt() print("kernel: {}".format(kernel.shape)) print("running_mean: {}".format(running_mean.shape)) print("running_var: {}".format(running_var.shape)) print("gamma: {}".format(gamma.shape)) print("beta: {}".format(beta.shape)) print("eps: {}".format(eps)) print("std: {}".format(std.shape)) print(gamma, beta, std) # Calculate the scaling factor of the original convolution weight after convolution and BN fusion t = (gamma / std).reshape(-1, 1, 1, 1) # [ch] -> [ch, 1, 1, 1] # Scale the weights of the original convolution kernel = kernel * t # Calculate the bias of the convolution operation after convolution and BN fusion bias = beta - running_mean * gamma / std fused_conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=True, groups=g) # Assign the calculated convolution weights and biases to the new convolution operation fused_conv.load_state_dict(OrderedDict(weight=kernel, bias=bias)) module.eval() fused_conv.eval() with torch.no_grad(): out1 = module(f1).detach().cpu().numpy() out2 = fused_conv(f1).detach().cpu().numpy() print(out1) print(out2) print(np.allclose(out1, out2, rtol=tol, atol=tol)) def Conv1x1BNToConv3x3(g=2, in_channels=128, out_channels=128, tol=1e-4): """ Conv1x1 + BN -> Conv3x3 + BN -> Conv3x3 1,Will 1 x1 Convolution kernel complement 0 becomes 3 x3 convolution kernel,In order to ensure that the size of the output feature map after convolution remains unchanged, the surrounding of the original feature map is padding, padding has a size of 1,Bundle Conv1x1 + BN -> Conv3x3 + BN 2,use Conv3x3BNToConv3x3 method for fusion calculation, the Conv3x3 + BN -> Conv3x3 :return: """ torch.random.manual_seed(0) f1 = torch.randn(1, in_channels, 3, 3) module = nn.Sequential(OrderedDict( # The original convolution does not use the bias parameter conv=nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=1, stride=1, padding=0, bias=False, groups=g), bn=nn.BatchNorm2d(num_features=out_channels) )) # fuse conv + bn # Get the original 1x1 convolution weights kernel = module.conv.weight # Get the mean of BN, nn.Buffer running_mean = module.bn.running_mean # Get the variance of BN running_var = module.bn.running_var # Get the weight parameters of BN learning gamma = module.bn.weight # Get the bias parameters for BN learning beta = module.bn.bias # Numerical stability parameters used to prevent division by 0 exceptions during BN calculation eps = module.bn.eps # Calculate the standard deviation of BN std = (running_var + eps).sqrt() # Initialize a 3x3 convolution kernel with all 0s # When using group convolution, the number of channels per convolution kernel is equal to the number of input channels divided by the number of groups weight = torch.zeros(out_channels, in_channels // g, 3, 3, dtype=torch.float) # Putting the 1x1 convolution kernel in the middle of the 3x3 convolution kernel is equivalent to adding 0 to the original 1x1 convolution to get a 3x3 convolution kernel weight[:, :, 1:2, 1:2] = kernel.data print("kernel: {}".format(kernel.data)) print("weight: {}".format(weight)) # Calculate the scaling factor of the original convolution weight after convolution and BN fusion t = (gamma / std).reshape(-1, 1, 1, 1) # [ch] -> [ch, 1, 1, 1] kernel_new = torch.nn.Parameter(weight * t, requires_grad=True) print("kernel new: {}".format(kernel_new.data)) # Calculate the bias of the convolution operation after convolution and BN fusion bias_new = beta - running_mean * gamma / std fused_conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=True, groups=g) # Assign the calculated convolution weights and biases to the new convolution operation fused_conv.load_state_dict(OrderedDict(weight=kernel_new, bias=bias_new)) module.eval() fused_conv.eval() with torch.no_grad(): out1 = module(f1).detach().cpu().numpy() out2 = fused_conv(f1).detach().cpu().numpy() print(out1) print(out2) print(np.allclose(out1, out2, rtol=tol, atol=tol)) def BNToConv3x3Group(g=2, in_channels=4, out_channels=4, tol=1e-4): """ BN -> Conv1x1 + BN -> Conv3x3 + BN -> Conv3x3 1,Since there is only one BN,There is no convolution operation, first construct an identical convolution operation, the size of the convolution kernel is 1 x1,No. n The first convolution kernel n Each channel has a weight of 1, and the remaining channels have a weight of 0 2,then use with Conv1x1BNToConv3x3 In the same way, put 1 x1 The convolution kernel complements 0 and expands to 3 x3 The convolution kernel, Conv1x1 + BN -> Conv3x3 + BN 3,then use Conv3x3BNToConv3x3 Fusion computing,Bundle Conv3x3 + BN -> Conv3x3 :return: """ torch.random.manual_seed(0) f1 = torch.randn(1, in_channels, 3, 3) bn = nn.BatchNorm2d(num_features=out_channels) # Get the mean of BN, nn.Buffer running_mean = bn.running_mean # Get the variance of BN running_var = bn.running_var # Get the weight parameters of BN learning gamma = bn.weight # Get the bias parameters for BN learning beta = bn.bias # Numerical stability parameters used to prevent division by 0 exceptions during BN calculation eps = bn.eps # Calculate the standard deviation of BN std = (running_var + eps).sqrt() # Calculate the scaling factor of the convolution weights after BN -> Conv3x3 + BN t = (gamma / std).reshape(-1, 1, 1, 1) # Calculate the bias of the convolution after BN -> Conv3x3 + BN bias = beta - running_mean * gamma / std # Set the convolution kernel, if the weight of the centermost element of the nth channel of the nth convolution kernel is 1, the rest of the weights are 0 # When using group convolution, the number of channels per convolution kernel is equal to the number of input channels divided by the number of groups # At the same time, to maintain the effect of identity mapping in group convolution, it is required that in each group, the weight of the most central element of the nth channel of the nth convolution kernel is 1, and the rest of the weights are 0 weight = torch.zeros(out_channels, in_channels // g, 3, 3, dtype=torch.float) for i in range(in_channels): # if g == in_channels: # j = 0 # elif g == 1: # j = i # else: # j = i % (in_channels // g) j = i % (in_channels // g) weight[i, j, 1:2, 1:2] = 1 kernel = torch.nn.Parameter(weight * t, requires_grad=True) print("kernel: {}".format(kernel.data)) conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=True, groups=g) conv.load_state_dict(OrderedDict(weight=kernel, bias=bias)) conv.eval() bn.eval() with torch.no_grad(): out1 = bn(f1).detach().cpu().numpy() out2 = conv(f1).detach().cpu().numpy() print(out1) print(out2) print(np.allclose(out1, out2, rtol=tol, atol=tol)) def FuseConv3x3(g=2, in_channels=4, out_channels=4, tol=1e-4): """ multiple Conv3x3 Convolution merged into one Conv3x3 convolution combine multiple parallel Conv3x3 The convolution output results are added together, which is equivalent to multiple Conv3x3 The weights of the convolution are added, and the offsets are added to form a new convolution operation, which is then applied to the input feature map :return: """ torch.random.manual_seed(0) f1 = torch.randn(1, in_channels, 3, 3) conv1 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=True, groups=g) conv2 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=True, groups=g) conv3 = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=True, groups=g) fuse_conv = nn.Conv2d(in_channels=in_channels, out_channels=out_channels, kernel_size=3, stride=1, padding=1, bias=True, groups=g) kernel = conv1.weight + conv2.weight + conv3.weight bias = conv1.bias + conv2.bias + conv3.bias fuse_conv.load_state_dict(OrderedDict(weight=kernel, bias=bias)) conv1.eval() conv2.eval() conv3.eval() fuse_conv.eval() with torch.no_grad(): out1 = conv1(f1).detach().cpu().numpy() out2 = conv2(f1).detach().cpu().numpy() out3 = conv3(f1).detach().cpu().numpy() fuse_out = fuse_conv(f1).detach().cpu().numpy() print(out1 + out2 + out3) print(fuse_out) print(np.allclose(out1 + out2 + out3, fuse_out, rtol=tol, atol=tol)) if __name__ == '__main__': # Conv3x3BNToConv3x3(g=128, in_channels=128, out_channels=128, tol=1e-6) # Conv1x1BNToConv3x3(g=1, in_channels=128, out_channels=128, tol=1e-6) # BNToConv3x3Group(g=128, in_channels=128, out_channels=128, tol=1e-6) FuseConv3x3(g=128, in_channels=128, out_channels=128, tol=1e-5)