Convolution is one of the most important concepts in deep learning. Let's learn and review the basic concepts of convolution.

Table of contents

From Fully Connected Layers to Convolutional Layers

# theoretical part

## From Fully Connected Layers to Convolutional Layers

Let's start with an example: suppose I want to classify cats and dogs.

Suppose I use a 12-megapixel mobile phone to take pictures, and the arranged pictures are RGB pictures (with 3 channels), and the RGB images have 36 million elements. Assuming that the mlp with a hidden layer size of 100 is used for training, this model will have 3.6 billion parameters, which is far more than the total number of cats and dogs in the world. It is better to remember all the cats and dogs in the world.

Therefore, this problem must be considered and solved when using mlp to process relatively large images.

Let's review the mlp with a single hidden layer.

As shown in the figure above, when I input 36 million, when inputting to a neural network with 100 hidden layer neurons, it needs 36 million times 100 weights, which is 3.6 billion elements. It takes about 14 G to store weights, which is only the space required for a single-layer neural network to store weights, so it will explode for multiple layers, which is ridiculous.

Let's play a game: look for the comrade "Waldo" on the left of the picture in the crowd.

There are two principles to be followed in the search process:

1. Translation invariance

Translation invariance means that for the same picture classifier, no matter where the recognized object appears in the picture, it can be accurately recognized. That is to say, the recognizer will not change due to the position where the picture appears.

2. Locality

The scope of image search does not need to be too large, as long as the area is sufficient.

Changes from the fully connected layer to the convolutional layer at the mathematical level:

The convolution operation is shown in the figure above.

That is to say, in w_{i,j,k,l}, i, j represent the position of the output point in the output matrix, and k, l represent the position of the input point in the input graph (or matrix). Then this weight matrix should record the influence (that is, the weight) of each point in the input on each point in the output. For example, if the input image is 4x4, the output image is 2x2. I need to record all the points in the input graph (1,1), (1,2), ..., (2,1), ..., (4,4) for the output graph (1,1) In the same way, it is also necessary to record the influence of all these points on (1,2), ...., (2,2) in the output graph. Then there are 4 parameters for each group of points at this time: the abscissa and ordinate of the input graph, and the abscissa and ordinate of the output graph. So to fully record all weights, a 4D tensor is needed.

Let's see how to apply the first principle - translation invariance.

Here it is required that no matter how i and j are transformed, v should not change, that is to say, when performing convolution calculation, the weight matrix multiplied, that is, the convolution kernel, has nothing to do with the position (i,j)

Principle 2 means: When studying the output hij, my convolution kernel should not be too large, just focus on the elements near xij.

## convolutional layer

The specific calculation process of convolution is shown in the figure above. Its output matrix we can see is the result of element-wise multiplication and summing of the convolution kernel and the corresponding input matrix. Every time an element of a convolution kernel is calculated, the convolution kernel is moved to the right by one relative to the input matrix. After the right shift, it moves down and to the far left, and then starts to calculate.

Examples of convolutions:

Different convolution kernels can bring different effects. The neural network can learn these kernels to achieve the image effect we want.

Here is the difference between two-dimensional cross-correlation and two-dimensional convolution. The convolutional layer in the neural network actually applies the formula of two-dimensional cross-correlation. We can see that the difference between the two is geometrically the relationship between up, down, left, and right flips, so there is a certain difference between the convolutional layer and the two-dimensional convolution in the actual mathematical field.

For one-dimensional cross-correlation, the weight is a vector. One-dimensional cross-correlation is more suitable for processing one-dimensional data, such as text, language, etc.; three-dimensional cross-correlation parameters also become three-dimensional, which is more suitable for processing three-dimensional data such as video and medical images. Data; 2D cross-correlation parameters are two-dimensional, which is suitable for processing images. The ideas between them are similar.

The advantage of the convolutional layer over the fully connected layer is mainly reflected in the smaller amount of parameter calculation.

Every node in the fully connected layer must be fully connected to every node in the next layer, and each connection has parameters to participate in the operation; while the parameters of the convolution layer are only related to the size of the convolution kernel and the number of channels of the output feature map.

The convolutional layer ensures fewer training parameters in single-layer convolution through weight sharing and sparse connections.

# practical part

code:

#image convolution #cross-correlation import torch from torch import nn from d2l import torch as d2l def corr2d(X, K):#Input x, k is the kernel matrix. Two-dimensional cross-correlation operation. """Computes a two-dimensional cross-correlation operation.""" h, w = K.shape#h is the number of rows of the convolution kernel, w is the number of columns of the convolution kernel #Initialize the output zero matrix with height and width respectively "X.shape[0] - h + 1, X.shape[1] - w + 1" Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1)) for i in range(Y.shape[0]): for j in range(Y.shape[1]): #The output is equal to [input starts from row i, looks back at row h-1, starts from column j and looks back at column w-1], then dot product with the convolution kernel matrix and then sum, #Traverse the result to the output matrix Y[i,j]. #Here it is necessary to ensure that the convoluted area is the same as the dimension of the convolution kernel. Y[i, j] = (X[i:i + h, j:j + w] * K).sum() return Y #Verify the output of the above two-dimensional cross-correlation operation X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]]) K = torch.tensor([[0.0, 1.0], [2.0, 3.0]]) print(corr2d(X, K)) print("###########################################################################") #Implementing a 2D Convolutional Layer class Conv2D(nn.Module): def __init__(self, kernel_size): super().__init__() self.weight = nn.Parameter(torch.rand(kernel_size))#Initialize the matrix whose weight dimension is kernel_size self.bias = nn.Parameter(torch.zeros(1))#The offset is initialized to 0 def forward(self, x):#The forward operation is the cross-correlation operation of the input and the weight plus the offset return corr2d(x, self.weight) + self.bias #A simple application of convolutional layers: detecting edges of different colors in an image X = torch.ones((6, 8)) X[:, 2:6] = 0 print("input matrix") print(X)#From the generated input matrix, it can be seen that there is a vertical line from black to white on the left and right. Here is to detect these two vertical lines K = torch.tensor([[1.0, -1.0]]) #1 in output Y represents edge from white to black, -1 represents edge from black to white Y = corr2d(X, K) print("Edge detection result") print(Y) #The convolution kernel K can only detect vertical edges. If the input matrix is transposed and then used for convolution with K, it will not be detected. #The solution is to transpose the convolution kernel to detect the horizontal edge print(corr2d(X.t(), K)) print("###########################################################################") #Learn the convolution kernel that generates Y from X, the input and output matrices are known, and the convolution kernel is learned through deep learning #Directly define the input channel as 1 (black-and-white image is 1, color is 3), the output channel is also 1, and the convolution kernel is a convolution operation of 1*2 conv2d = nn.Conv2d(1, 1, kernel_size=(1, 2), bias=False) X = X.reshape((1, 1, 6, 8))#The output matrix is a matrix with 1 input channel and 1 output channel, 6*8 Y = Y.reshape((1, 1, 6, 7))#The output matrix is a matrix with 1 input channel and 1 output channel, 6*7 '''print(X) print(Y) print("###########################################################################")''' for i in range(10):#Iterate 10 times Y_hat = conv2d(X)#Input X into the convolution operation to get the predicted output l = (Y_hat - Y)**2#Get loss using mean squared error conv2d.zero_grad()#Gradient zeroing l.sum().backward()#After the summation, calculate the backward and calculate the gradient conv2d.weight.data[:] -= 3e-2 * conv2d.weight.grad#Updated weight = initial weight - learning rate 0.01 * gradient if (i + 1) % 2 == 0:#Output the loss every two batch es print(f'batch {i+1}, loss {l.sum():.3f}') print("###########################################################################") #The weight tensor of the learned convolution kernel print(conv2d.weight.data.reshape((1, 2)))#The final learned convolution kernel

tensor([[19., 25.],

[37., 43.]])

###########################################################################

input matrix

tensor([[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.]])

Edge detection result

tensor([[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.]])

tensor([[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.],

[0., 0., 0., 0., 0.]])

###########################################################################

tensor([[[[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.],

[1., 1., 0., 0., 0., 0., 1., 1.]]]])

tensor([[[[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.],

[ 0., 1., 0., 0., 0., -1., 0.]]]])

###########################################################################

batch 2, loss 8.089

batch 4, loss 2.323

batch 6, loss 0.785

batch 8, loss 0.294

batch 10, loss 0.116

###########################################################################

tensor([[ 1.0241, -0.9548]])