Convolutional neural network (LeNet) - [torch learning notes]

Convolutional neural network (LeNet)

Quotation and Translation: hands on learning and deep learning

We are now ready to put all the tools together and deploy your first fully functional convolutional neural network. When we first came into contact with the image data, we applied the multi-layer perceptron to the clothing pictures in the fashion MNIST dataset. Each picture in fashion MNIST is represented by a 28 × 28. In order to make these data applicable to multilayer perceptrons, that is, to receive the input as a one-dimensional fixed length vector, we first flatten each picture to generate a vector with a length of 784, and then use a series of fully connected layers to process them.

Now we have introduced convolution layer. We can keep the image in the original spatial organization grid and process it with a series of continuous convolution layers. In addition, since we use a convolution layer, we can enjoy considerable savings in the number of parameters required.

In this section, we will introduce one of the earliest published convolutional neural networks. Its benefits were first proved by Yann Lecun (then a researcher at at&t Bell Laboratories) to recognize handwritten digits in images -LeNet5. In the 1990s, they gave the first convincing evidence for LeNet's experiments, proving that it is possible to train convolutional neural networks through back propagation. Their model achieved outstanding results at that time (at that time, only support vector machine could match it), and was adopted to identify the deposit numbers on ATM machines. Some ATMs are still running the code Yang and his colleague Leon bottomou wrote in the 1990s.

1, LeNet

Roughly speaking, we can think that LeNet is composed of two parts. (i) A convoluted layer; And (ii) a full connection layer. Before we dive into it, let's briefly review the LeNet model.

from IPython.display import Image

Data flow in LeNet 5. The input is a handwritten number, and the output is the probability of 10 possible results.

Data flow in LeNet 5. The input is a handwritten number, and the output is the probability of 10 possible results.

The basic units in the convolution block are a convolution layer and a subsequent average pool layer (note that the maximum pool layer works better, but it was not invented in the 1990s). The convolution layer is used to identify spatial patterns in the image, such as lines and parts of objects, and the subsequent average pooling layer is used to reduce the dimension. The convolution layer block consists of repeated stacks of these two basic units. Each convolution layer uses a 5 × 5, and a sigmoid activation function is used to process each output (again, it is known that ReLU works more reliably, but it was not invented at that time). The first convolution layer has 6 output channels, and the second convolution layer further increases the channel depth to 16.

However, in line with the increase in the number of such channels, the height and width are greatly reduced. Therefore, increasing the number of output channels makes the parameter sizes of the two convolution layers similar. The size of the two average pooling layers is 2 × 2. Take the span as 2 (note that this means that they do not overlap). In other words, the pooled layer downsamples the representation so that it is exactly one fourth the size before pooling.

The convolution block emits an output whose size is given by (batch size, channel, height, width). Before we pass the output of the convolution block to the fully connected block, we must flatten each instance in the mini batch. In other words, we tansform this 4D input into the 2D input expected by the full connection layer: as a reminder, the first dimension indexes the examples in the mini batch, and the second dimension gives the plane vector representation of each example. LeNet's full connection layer block has three full connection layers with 120, 84 and 10 outputs respectively. Because we are still classifying, the 10 dimensional output layer corresponds to the number of possible output categories.

Although it may take some work to truly understand the internal situation of LeNet, you can see below that it is very simple to implement it in the modern deep learning library. Again, we will rely on sequential classes.

import sys
sys.path.insert(0, '..')
import d2l
import torch
import torch.nn as nn
import torch.optim as optim
import time
class Flatten(torch.nn.Module):
    def forward(self, x):
        return x.view(x.shape[0], -1)

class Reshape(torch.nn.Module):
    def forward(self, x):
        return x.view(-1,1,28,28)
net = torch.nn.Sequential(
    nn.Conv2d(in_channels=1, out_channels=6, kernel_size=5, padding=2),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5),
    nn.AvgPool2d(kernel_size=2, stride=2),
    nn.Linear(in_features=16*5*5, out_features=120),
    nn.Linear(120, 84),
    nn.Linear(84, 10)

Compared with the original network, we arbitrarily replace the Gaussian activation of the last layer with the ordinary linear layer, which is often more convenient in training. In addition, this network conforms to the historical definition of LeNet5. Next, we put a size of 28 × 28 is sent to the network, and the forward calculation is carried out layer by layer. The output shape of each layer is printed to ensure that we understand what is happening here.

X = torch.randn(size=(1,1,28,28), dtype = torch.float32)
for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape: \t',X.shape)
Reshape output shape: 	 torch.Size([1, 1, 28, 28])
Conv2d output shape: 	 torch.Size([1, 6, 28, 28])
Sigmoid output shape: 	 torch.Size([1, 6, 28, 28])
AvgPool2d output shape: 	 torch.Size([1, 6, 14, 14])
Conv2d output shape: 	 torch.Size([1, 16, 10, 10])
Sigmoid output shape: 	 torch.Size([1, 16, 10, 10])
AvgPool2d output shape: 	 torch.Size([1, 16, 5, 5])
Flatten output shape: 	 torch.Size([1, 400])
Linear output shape: 	 torch.Size([1, 120])
Sigmoid output shape: 	 torch.Size([1, 120])
Linear output shape: 	 torch.Size([1, 84])
Sigmoid output shape: 	 torch.Size([1, 84])
Linear output shape: 	 torch.Size([1, 10])

Note that in the entire convolution block, the height and width of the representation of each layer are reduced (compared with the previous layer). The convolution layer uses a kernel with a height and width of 5. In the first convolution layer, there is only 2 pixels filled, and in the second convolution layer, there is no filling. This results in a reduction of 2 and 4 pixels in height and width, respectively. In addition, the height and width of each pool layer are halved. However, as we go up, the number of channels increases layer by layer, from 1 input to 6 after the first convolution layer and 16 after the second layer. Then, the full connection layer reduces the dimension layer by layer until an output matching the number of image categories is issued.


2, Data acquisition and training

Now that we have implemented this model, we might as well do some experiments to see what we can accomplish with the LeNet model. Although training LeNet on the original MNIST OCR data set may play a nostalgic role, this data set has become too easy. The accuracy of MLP exceeds 98%, so it is difficult to see the benefits of convolution network. Therefore, we will insist on using fashion MNIST as our dataset, because although it has the same shape (28 × 28 images), but this data set is obviously more challenging.

batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)

Although the parameters of convolutional networks may be few, their computational cost is still much higher than that of similar deep multilayer perceptrons. Therefore, if you can use GPU, it may be a good time to put it into use to speed up the training.

Here is a simple function that we can use to detect whether we have a GPU. In this function, if gpu0 is available, we will try to use torch cuda. is_ Available () method. Otherwise, we insist on using CPU.

def try_gpu():
    """If GPU Available, return torch.device by cuda:0;Otherwise return torch.device by cpu. """
    if torch.cuda.is_available():
        device = torch.device('cuda:0')
        device = torch.device('cpu')
    return device

device = try_gpu()

For evaluation, we need to evaluate what we described when we implemented softmax (chapter_softmax_scratch) from scratch_ Make some changes to the accuracy function. Since the complete data set exists on the CPU, we need to copy it to the GPU before calculating our model. This is through chapter_ use_ Described in GPU to(device). Note that we accumulate errors on the device where the data ends up (in acc). This avoids intermediate replication operations that could compromise performance.

#  This function has been saved in the d2l package for future use. This function will be improved step by step. Its full implementation will be discussed in the "image amplification" section
def evaluate_accuracy(data_iter, net,device=torch.device('cpu')):
    """Evaluate accuracy of a model on the given data set."""
    acc_sum,n = torch.tensor([0],dtype=torch.float32,device=device),0
    for X,y in data_iter:
        X,y =,
        with torch.no_grad():
            y = y.long()
            acc_sum += torch.sum((torch.argmax(net(X), dim=1) == y))
            n += y.shape[0]
    return acc_sum.item()/n

We also need to update our training functions to handle GPUs. And Chapter_ softmax_ Train defined in scratch_ CH3 is different. We now need to transfer each batch of data to our specified device (hopefully GPU) before forward and backward processing.

# This function has been saved in the d2l package for future use
def train_ch5(net, train_iter, test_iter,criterion, num_epochs, batch_size, device,lr=None):
    """Train and evaluate a model with CPU or GPU."""
    print('training on', device)
    optimizer = optim.SGD(net.parameters(), lr=lr)
    for epoch in range(num_epochs):
        train_l_sum = torch.tensor([0.0],dtype=torch.float32,device=device)
        train_acc_sum = torch.tensor([0.0],dtype=torch.float32,device=device)
        n, start = 0, time.time()
        for X, y in train_iter:
            X,y =, 
            y_hat = net(X)
            loss = criterion(y_hat, y)
            with torch.no_grad():
                y = y.long()
                train_l_sum += loss.float()
                train_acc_sum += (torch.sum((torch.argmax(y_hat, dim=1) == y))).float()
                n += y.shape[0]
        test_acc = evaluate_accuracy(test_iter, net,device)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, '
              'time %.1f sec'
              % (epoch + 1, train_l_sum/n, train_acc_sum/n, test_acc,
                 time.time() - start))

We initialize the model parameters on the device indicated by the device. This time, we use the Xavier initializer. The loss function and training algorithm still use cross entropy loss function and small batch random gradient descent method.

lr, num_epochs = 0.9, 5

def init_weights(m):
    if type(m) == nn.Linear or type(m) == nn.Conv2d:

net =

criterion = nn.CrossEntropyLoss()
train_ch5(net, train_iter, test_iter, criterion,num_epochs, batch_size,device, lr)
training on cpu
epoch 1, loss 0.0091, train acc 0.103, test acc 0.100, time 60.6 sec
epoch 2, loss 0.0055, train acc 0.446, test acc 0.637, time 59.9 sec
epoch 3, loss 0.0032, train acc 0.677, test acc 0.714, time 56.4 sec
epoch 4, loss 0.0026, train acc 0.734, test acc 0.756, time 57.6 sec

3, Summary

  • Convolutional neural network (ConvNet) is a network using convolution layer.
  • In a convolution network, we alternate between convolution, nonlinearity, and common pooling operations.
  • Finally, the resolution is reduced before the output is sent through one (or more) dense layers.
  • LeNet is the first successful deployment of this network.

4, Practice

1. The maximum set method is used to replace the average set method. What will happen?
2. Try to build a more complex network based on LeNet to improve its accuracy.

  • Resize the convolution window.
  • Adjust the number of output channels.
  • Adjust the activation function (ReLU?).
  • Adjust the number of convolutions.
  • Adjust the number of fully connected layers.
  • Adjust learning rates and other training details (initialization, epochs, etc.).

3. Try the improved network on the original MNIST dataset.

4. Displays the activation of different inputs (e.g. sweaters, coats) in the first and second layer of LeNet.

Tags: Deep Learning neural networks Math torch Convolutional Neural Networks

Posted by lawnninja on Wed, 01 Jun 2022 15:37:39 +0530