Limitations of linear neural networks
There is no difference between any multi hidden layer neural network and single-layer neural network, and they are linear, and the problems that can be solved by the linear model are limited
Types of neural networks
- Basic neural network: linear neural network, BP neural network, Hopfield neural network, etc
- Advanced neural network: Boltzmann machine, restricted Boltzmann machine, recursive neural network, etc
- Deep neural network: deep confidence network, convolutional neural network, cyclic neural network, LSTM network, etc
Convolutional neural network
The traditional multi-layer neural network has only input layer, hidden layer and output layer. The number of hidden layers depends on the needs. There is no clear theoretical derivation to explain how many layers are suitable for the convolution neural network CNN. On the basis of the original multi-layer neural network, a more effective feature learning part is added. The specific operation is to add a partially connected convolution layer and pooling layer in front of the original fully connected layer. The emergence of convolutional neural network makes the number of neural network layers deepen and deep learning can be realized. Generally speaking, deep learning generally refers to these new structures such as CNN and some new methods (such as new activation function Relu), which solve some difficult problems of traditional multilayer neural network
Three structures of convolutional neural network
The basic composition of neural networks includes input layer, hidden layer and output layer. The characteristic of convolutional neural network is that the hidden layer is divided into convolution layer, pooling layer (also known as down sampling layer) and activation layer. Role of each layer
- Convolution layer: extract features by translating on the original image
- Active layer: increase nonlinear segmentation capability
- Pooling layer: reduce the learning parameters and reduce the network complexity through sparse parameters after features (maximum pooling and average pooling)
In order to achieve the classification effect, there will be a full connection layer (FC), that is, the last output layer, for loss calculation and classification.
Convolution layer
Convolutional layer
Each convolution layer in convolutional neural network is composed of several convolution units (convolution kernel), and the parameters of each convolution unit are optimized by back propagation algorithm.
The purpose of convolution operation is to extract different input features. The first convolution layer may only extract some low-level features, such as edges, lines and angles. Networks with more layers can iteratively extract more complex features from low-level features.
Four elements of convolution kernel
- Number of convolution kernels
- Convolution kernel size
- Convolution kernel step
- Convolution kernel zero fill size
Next, let's explain through a calculation case, assuming that the picture is a black-and-white picture (only one channel) and a pixel value table
How convolution is calculated - size
Convolution kernel can be understood as an observer who observes with several weights and a bias to perform feature weighting operation.
: the above shall be offset
Convolution kernel size 1*1,3*3,5*5
Usually, the size of convolution kernel is selected, which is proved to be a good effect by researchers. This person will get an operation result after observing,
So what if this person wants to see all the pixels in this picture? That's what's needed
How convolution is calculated - step size
You need to move the convolution kernel to observe this picture. The required parameter is the step size.
Assuming that the moving step is one pixel, the final observation result of this person is as follows:
For 5x5 pictures, the convolution size of 3x3 is calculated by one step to obtain the observation result of 3x3 size
If you move in steps of 2, this is the result
5x5 pictures, 3x3 convolution size, remove two step operations to obtain 2x2 size observation results
How to calculate convolution - number of convolution kernels
Then, if more than one person observes in a certain layer of structure, multiple people (convolution kernel) observe together. Then get multiple observations.
The weights and offsets of different convolution kernel bands are different, that is, the parameters of random initialization
We have come to the conclusion that the size of the output result depends on the size and step size, but is that the only one? The other is zero filling. The size and moving step of the Filter observation window will cause the pixel width of the picture to be exceeded!
How to calculate convolution - zero fill size
Zero fill is to fill a circle of pixels with a value of 0 around the picture pixels.
There are two ways, SAME and VALID
SAME: When sampling across the edge, the sampling area is consistent with the pixel width of the input image. VALID: Without sampling across the edge, the sampling area is less than the pixel width of the input person's image.
Output size calculation formula
How much is the final zero fill? We don't need to pay attention. Next, we use these known conditions to find the size of the output and see the result
Understand the following formula through an example
Calculation case: 1,Assuming a known condition: input image 32*32*1, 50 individual Filter,Size 5*5,The move step is 1 and the zero fill size is 1. Request output size? H2 = (H1 - F + 2P)/S + 1 = (32 - 5 + 2 * 1)/1 + 1 = 30 W2 = (H1 - F + 2P)/S + 1 = (32 -5 + 2 * 1)/1 + 1 = 30 D2 = K = 50 So the output size is[30, 30, 50] 2,Assuming a known condition: input image 32*32*1, 50 individual Filter,Size 3*3,Move step is 1, unknown zero padding. Output size 32*32? H2 = (H1 - F + 2P)/S + 1 = (32 - 3 + 2 * P)/1 + 1 = 32 W2 = (H1 - F + 2P)/S + 1 = (32 -3 + 2 * P)/1 + 1 = 32 So the zero fill size is: 1*1
How to observe multi-channel pictures
If it is a color picture, there are three tables: R, G and B. Originally, everyone needs to take a convolution kernel of 3x3 or other sizes. Now, they need to take three 3x3 weights and an offset, a total of 27 weights. In the end, everyone came up with a result
Activation function
The convolution network structure adopts the activation function, since the network has been developed. It is found that the original sigmoid activation functions can not achieve good results, so a new activation function is adopted.
-
Relu
-
What is the effect?
-
Relu benefits
Effectively solve the problem of gradient explosion
The calculation speed is very fast. You only need to judge whether the input is greater than 0. The solution speed of SGD (batch gradient descent) is much faster than sigmoid and tanh -
sigmoid disadvantages
Using sigmoid and other functions, the amount of calculation is relatively large, while using Relu activation function, the amount of calculation in the whole process is saved a lot. In deep networks, when sigmoid function propagates back, it is easy to appear gradient explosion
Pooling layer
The main function of the Pooling layer is feature extraction, which further reduces the number of parameters by removing unimportant samples from the Feature Map. There are many methods of Pooling, which usually adopts maximum Pooling
max_polling:Take the maximum value of the pooled window avg_polling:Take the average value of the pooled window
Pool layer calculation
The pool layer also has the window size and moving step size. How to calculate the subsequent output size? The calculation formula is the same as that of convolution
Calculation: 224 x224x64,The window is 2 and the step size is 2. Output the result? H2 = (224 - 2 + 2*0)/2 +1 = 112 w2 = (224 - 2 + 2*0)/2 +1 = 112
Generally, the pool layer adopts a 2x2 window with a step size of 2
BN layer
Objective: to improve the generalization ability of network and prevent over fitting
BN(Batch Normalization) also belongs to a layer of the network, also known as the normalization layer. The benefits of using BN include greater learning rate, which has become the standard configuration of CNN
Full Connection layer
The previous convolution and pooling are equivalent to feature engineering. The final full connection layer plays the role of "Classifier" in the whole convolution neural network.
Gradient descent different optimized versions
The simplest optimization algorithm is SGD, and the gradient descent algorithm is also very effective, but there are also some problems. It is difficult to choose a reasonable learning rate. It is easy to fall into those suboptimal local extreme points
Expanded content (understanding): SGD with Momentum Gradient update rule:Momentum Inertia is added in the process of gradient descent, which makes the speed faster in the dimension with unchanged gradient direction and slower in the dimension with changed gradient direction, so as to accelerate convergence and reduce oscillation. RMSProp Gradient update rule:solve Adagrad The sharp decline in learning rate, RMSProp The second-order momentum calculation method is changed, that is, the second-order momentum is calculated by window sliding weighted average. Adam Gradient update rule:Adam = Adaptive + Momentum,seeing the name of a thing one thinks of its function Adam Integrated SGD First order momentum sum of RMSProp Second order momentum
Development history of convolutional neural network
Convolutional networks for other purposes
Image target detection Yolo: GoogleNet+ bounding boxes SSD: VGG + region proposals
Simple CNN construction
import torch import torch.nn as nn import torch.nn.functional as F
class Net(nn.Module): def __init__(self): #Define the neural network structure and input data 1x32x32 super(Net, self).__init__() # First layer (convolution layer) self.conv1 = nn.Conv2d(1,6,3) #Input channel 1, output channel 6, convolution 3x3 # Second layer (convolution layer) self.conv2 = nn.Conv2d(6,16,3) #Input channel 6, output channel 16, convolution 3x3 # Third floor (full connection floor) self.fc1 = nn.Linear(16*28*28, 512) #Input dimension 16x28x28=12544, output dimension 512 # Fourth floor (full connection floor) self.fc2 = nn.Linear(512, 64) #Input dimension 512, output dimension 64 # Fifth floor (full connection floor) self.fc3 = nn.Linear(64, 2) #Input dimension 64, output dimension 2 def forward(self, x): #Define data flow x = self.conv1(x) x = F.relu(x) x = self.conv2(x) x = F.relu(x) x = x.view(-1, 16*28*28) x = self.fc1(x) x = F.relu(x) x = self.fc2(x) x = F.relu(x) x = self.fc3(x) return x
net = Net() print(net)
Net( (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1)) (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1)) (fc1): Linear(in_features=12544, out_features=512, bias=True) (fc2): Linear(in_features=512, out_features=64, bias=True) (fc3): Linear(in_features=64, out_features=2, bias=True) )
#Generate random input input_data = torch.randn(1,1,32,32) print(input_data) print(input_data.size())
tensor([[[[ 0.3055, -0.8828, 0.1044, ..., -0.4833, 1.1879, -0.0727], [ 0.2718, -1.5784, -1.0362, ..., -0.5160, 0.4685, -0.5401], [ 2.4876, 0.1718, 1.2377, ..., -0.6047, -0.7236, 0.3888], ..., [-0.8249, -0.3313, -0.3513, ..., 0.2470, -0.6509, -0.9969], [ 1.0528, 0.0348, 0.6416, ..., -0.4129, -0.1997, 0.1648], [ 1.5184, 0.0120, -2.3959, ..., -1.3124, -0.4289, -0.2882]]]]) torch.Size([1, 1, 32, 32])
# Running neural network out = net(input_data) print(out) print(out.size())
tensor([[-0.0375, -0.0235]], grad_fn=<AddmmBackward>) torch.Size([1, 2])
# Randomly generated real value target = torch.randn(2) target = target.view(1,-1) print(target)
tensor([[-2.1838, -0.4858]])
criterion = nn.L1Loss() # Define loss function loss = criterion(out, target) # Calculate loss print(loss)
tensor(1.3043, grad_fn=<L1LossBackward>)
# Reverse transfer net.zero_grad() #Clear gradient loss.backward() #Automatic calculation of gradient and reverse transfer
import torch.optim as optim optimizer = optim.SGD(net.parameters(), lr=0.01) optimizer.step()
out = net(input_data) print(out) print(out.size())
tensor([[-0.0946, -0.0601]], grad_fn=<AddmmBackward>) torch.Size([1, 2])
- The second time is the weight update, and the loss is smaller than the first time
criterion = nn.L1Loss() # Define loss function MAE loss = criterion(out, target) # Calculate loss print(loss)
tensor(1.2574, grad_fn=<L1LossBackward>)
Python + CNN handwritten numeral recognition
- Import package loads a mnist dataset
import torch import torchvision.datasets as dataset import torchvision.transforms as transforms import torch.utils.data as data_utils #data train_data = dataset.MNIST(root="mnist", train=True, transform=transforms.ToTensor(), download=True) test_data = dataset.MNIST(root="mnist", train=False, transform=transforms.ToTensor(), download=False) #batchsize train_loader = data_utils.DataLoader(dataset=train_data, batch_size=64, shuffle=True) test_loader = data_utils.DataLoader(dataset=test_data, batch_size=64, shuffle=True)
- Building CNN
class CNN(torch.nn.Module): def __init__(self): super(CNN, self).__init__() self.conv =torch.nn.Sequential( torch.nn.Conv2d(1, 32, kernel_size=5, padding=2), torch.nn.BatchNorm2d(32), torch.nn.ReLU(), torch.nn.MaxPool2d(2) ) self.fc = torch.nn.Linear(14 * 14 * 32, 10) def forward(self, x): out = self.conv(x) out = out.view(out.size()[0], -1) out = self.fc(out) return out
- Load model test
class CNN(torch.nn.Module): def __init__(self): super(CNN, self).__init__() self.conv =torch.nn.Sequential( torch.nn.Conv2d(1, 32, kernel_size=5, padding=2), torch.nn.BatchNorm2d(32), torch.nn.ReLU(), torch.nn.MaxPool2d(2) ) self.fc = torch.nn.Linear(14 * 14 * 32, 10) def forward(self, x): out = self.conv(x) out = out.view(out.size()[0], -1) out = self.fc(out) return out cnn = CNN() # cnn = cnn.cuda() #loss loss_func = torch.nn.CrossEntropyLoss() #optimizer optimizer = torch.optim.Adam(cnn.parameters(), lr=0.01) #training for epoch in range(10): for i, (images, labels) in enumerate(train_loader): # images = images.cuda() # labels = labels.cuda() outputs = cnn(images) loss = loss_func(outputs, labels) optimizer.zero_grad() loss.backward() optimizer.step() print("epoch is {}, ite is " "{}/{}, loss is {}".format(epoch+1, i, len(train_data) // 64, loss.item())) #eval/test loss_test = 0 accuracy = 0 for i, (images, labels) in enumerate(test_loader): # images = images.cuda() # labels = labels.cuda() outputs = cnn(images) #[batchsize] #outputs = batchsize * cls_num loss_test += loss_func(outputs, labels) _, pred = outputs.max(1) accuracy += (pred == labels).sum().item() accuracy = accuracy / len(test_data) loss_test = loss_test / (len(test_data) // 64) print("epoch is {}, accuracy is {}, " "loss test is {}".format(epoch + 1, accuracy, loss_test.item())) torch.save(cnn, "mnist_model.pkl")
cnn = torch.load("mnist_model.pkl") # cnn = cnn.cuda() #loss #eval/test loss_test = 0 accuracy = 0 import cv2 #pip install opencv-python -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com for i, (images, labels) in enumerate(test_loader): # images = images.cuda() # labels = labels.cuda() outputs = cnn(images) _, pred = outputs.max(1) accuracy += (pred == labels).sum().item() images = images.cpu().numpy() labels = labels.cpu().numpy() pred = pred.cpu().numpy() #batchsize * 1 * 28 * 28 for idx in range(images.shape[0]): im_data = images[idx] im_label = labels[idx] im_pred = pred[idx] im_data = im_data.transpose(1, 2, 0) accuracy = accuracy / len(test_data) print(accuracy)
0.9824
Construction of Cifar10 image classifier by CNN
import torch import torchvision import torchvision.transforms as transforms from tqdm import tqdm
# (0.5, 0.5, 0.5), (0.5, 0.5, 0.5) the first is the mean value of rgb, and the second is the variance of rgb. All three channels are specified as 0.5 # Standardized normalization transform = transforms.Compose( [transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)) ] ) #Training data set trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform) trainloader = torch.utils.data.DataLoader(trainset, batch_size=16, shuffle=True, num_workers=2) #Test data set testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform) testloader = torch.utils.data.DataLoader(testset, batch_size=16, shuffle=False, num_workers=2)
Files already downloaded and verified Files already downloaded and verified
import matplotlib.pyplot as plt import numpy as np %matplotlib inline def imshow(img): # Input data: torch.tensor [c, h, w] img = img / 2+0.5 nping = img.numpy() nping = np.transpose(nping, (1,2,0)) # [h,w,c] plt.imshow(nping) dataiter = iter(trainloader) #Load a mini batch randomly images, labels = dataiter.next() imshow(torchvision.utils.make_grid(images))
import torch.nn as nn import torch.nn.functional as F class Net(nn.Module): def __init__(self): #Define the neural network structure and input data 3x32x3 super(Net, self).__init__() # First layer (convolution layer) self.conv1 = nn.Conv2d(3,6,3) #Input channel 3, output channel 6, convolution 3x3 # Second layer (convolution layer) self.conv2 = nn.Conv2d(6,16,3) #Input channel 6, output channel 16, convolution 3x3 # Third floor (full connection floor) self.fc1 = nn.Linear(16*28*28, 512) #Input dimension 16x28x28=12544, output dimension 512 # Fourth floor (full connection floor) self.fc2 = nn.Linear(512, 64) #Input dimension 512, output dimension 64 # Fifth floor (full connection floor) self.fc3 = nn.Linear(64, 10) #Input dimension 64, output dimension 10 def forward(self, x): #Define data flow x = self.conv1(x) x = F.relu(x) x = self.conv2(x) x = F.relu(x) x = x.view(-1, 16*28*28) x = self.fc1(x) x = F.relu(x) x = self.fc2(x) x = F.relu(x) x = self.fc3(x) return x
net = Net() print(net)
Net( (conv1): Conv2d(3, 6, kernel_size=(3, 3), stride=(1, 1)) (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1)) (fc1): Linear(in_features=12544, out_features=512, bias=True) (fc2): Linear(in_features=512, out_features=64, bias=True) (fc3): Linear(in_features=64, out_features=10, bias=True) )
import torch.optim as optim criterion = nn.CrossEntropyLoss() # Cross entropy loss # optimizer = optim.SGD(net.parameters(), lr=0.0001, momentum=0.9) # Momentum gradient descent method is to calculate the exponential weighted average of the gradient and update the weight. Its running speed is almost always faster than the standard gradient descent algorithm.
train_loss_hist = [] test_loss_hist = [] for epoch in range(2): for i, data in enumerate(trainloader): images, labels = data outputs = net(images) loss = criterion(outputs, labels) # Calculate loss optimizer.zero_grad() loss.backward() optimizer.step() if i%1000==0: print("Epoch: {} step: {} Loss: {}".format(epoch, i, loss.item()))
Epoch: 0 step: 0 Loss: 2.327638864517212
Epoch: 0 step: 1000 Loss: 2.2910702228546143
Epoch: 0 step: 2000 Loss: 2.303840160369873
Epoch: 0 step: 3000 Loss: 2.252164363861084
Epoch: 1 step: 0 Loss: 2.2408382892608643
Epoch: 1 step: 1000 Loss: 2.0526092052459717
Epoch: 1 step: 2000 Loss: 2.0468878746032715
Epoch: 1 step: 3000 Loss: 2.1996114253997803
train_loss_hist = [] test_loss_hist = [] for epoch in tqdm(range(20)): #train net.train() running_loss = 0.0 for i, data in enumerate(trainloader): images, labels = data outputs = net(images) loss = criterion(outputs, labels) # Calculate loss optimizer.zero_grad() loss.backward() optimizer.step() running_loss += loss.item() if(i%250 == 0): #Test every 250 mini batch correct = 0.0 total = 0.0 net.eval() with torch.no_grad(): for test_data in testloader: test_images, test_labels = test_data test_outputs = net(test_images) test_loss = criterion(test_outputs, test_labels) train_loss_hist.append(running_loss/250) test_loss_hist.append(test_loss.item()) running_loss=0.0
100%|██████████| 20/20 [50:48<00:00, 148.08s/it]
plt.figure() plt.plot(temp) plt.plot(test_loss_hist) plt.legend(('train loss', 'test loss')) plt.title('Train/Test Loss') plt.xlabel('# mini batch *250') plt.ylabel('Loss')
Text(0,0.5,'Loss')