# Table of contents

1. Process derivation - understand the principle of BP

2. Numerical calculation - manual calculation, grasp the details

3. Code implementation - numpy hand push + pytorch automatic

1. Compare [numpy] and [pytorch] programs, summarize and state.

3. Change the activation function Sigmoid to Relu, observe, summarize and state.

5. Change the loss function MSE to cross entropy, observe, summarize and state.

6. Change the step size, the number of training sessions, observe, summarize and state.

8. The initial value of weights w1-w8 is changed to 0, observe, summarize and state.

# foreword

I hope the epidemic will pass soon. I have been collecting all grade tables and statistics with another classmate for the past few days. I really don’t have time to write. I’m really tired.

The following is the deduction process of the bp algorithm, algebraic calculation and code implementation. Although I have written it more than once before, this time I still wrote it in detail. This time I really feel the rewards. I also push the derivation below step by step. Yes, it might be a little messy, but it's unavoidable.

Finally, I hope that the teacher and you guys will teach me more.

# question

- Process Derivation - Understanding BP Principles
- Numerical Calculations - Calculate manually, master the details
- Code implementation - numpy hand push + pytorch automatic

Process derivation, numerical calculation, one of the following three forms can be selected:

- Write directly on the blog with the editor
- Handwriting on electronic device, screenshot
- Write on paper, take pictures and post pictures

# 1. Process derivation - understand the principle of BP

Let me put a comment and part of the derivation I wrote on the watermelon book before. The detailed derivation is below. You can first press the picture on the dandelion book to correspond to this picture. I will write the correspondence of the variables below, The derivation of bias is included (the network above is not), but I feel that most of the future networks are biased.

I feel that we will use more in the future, it will be biased, and the number of each network node is uncertain, so the more important thing is the general term formula, so I wrote each as a general term formula The form of , which is below, may be a bit sketchy, and the process of derivation is inevitable.

Let's talk about the corresponding relationship of variables, that is, d=2, q=2, l=2 in the above figure, so the network corresponds to it, and then on the variables, w1, w2, w3, w4 correspond to, w5, w6, w7, w8 correspond to, O1, O2 correspond to y1, y2.

Okay, let's take a look at the derivation below, it may be a little smudged, but it is really inevitable to push it little by little.

If you feel that the above is not clear enough, take a look at the Pumpkin Book

# 2. Numerical calculation - manual calculation, grasp the details

The process of calculating the above numerical value is actually the process of practicing the formula pushed out above, which can be understood as bringing the data into the above formula.

The teacher said that these eight have to be pushed by hand, so I carefully calculated them one by one and wrote them down below.

At the end of the formula, I have expanded it, just bring the number in, but the number is really difficult to calculate (woohoo)

# 3. Code implementation - numpy hand push + pytorch automatic

### 1. Compare [numpy] and [pytorch] programs, summarize and state.

The numpy version (somewhat similar to the numpy implementation of neural network classification iris that I posted before):

# coding=gbk import numpy as np import matplotlib.pyplot as plt def sigmoid(z): a = 1 / (1 + np.exp(-z)) return a def forward_propagate(x1, x2, y1, y2, w1, w2, w3, w4, w5, w6, w7, w8): # forward propagation in_h1 = w1 * x1 + w3 * x2 out_h1 = sigmoid(in_h1) in_h2 = w2 * x1 + w4 * x2 out_h2 = sigmoid(in_h2) in_o1 = w5 * out_h1 + w7 * out_h2 out_o1 = sigmoid(in_o1) in_o2 = w6 * out_h1 + w8 * out_h2 out_o2 = sigmoid(in_o2) error = (1 / 2) * (out_o1 - y1) ** 2 + (1 / 2) * (out_o2 - y2) ** 2 return out_o1, out_o2, out_h1, out_h2, error def back_propagate(out_o1, out_o2, out_h1, out_h2): # backpropagation d_o1 = out_o1 - y1 d_o2 = out_o2 - y2 d_w5 = d_o1 * out_o1 * (1 - out_o1) * out_h1 d_w7 = d_o1 * out_o1 * (1 - out_o1) * out_h2 d_w6 = d_o2 * out_o2 * (1 - out_o2) * out_h1 d_w8 = d_o2 * out_o2 * (1 - out_o2) * out_h2 d_w1 = (d_w5 + d_w6) * out_h1 * (1 - out_h1) * x1 d_w3 = (d_w5 + d_w6) * out_h1 * (1 - out_h1) * x2 d_w2 = (d_w7 + d_w8) * out_h2 * (1 - out_h2) * x1 d_w4 = (d_w7 + d_w8) * out_h2 * (1 - out_h2) * x2 return d_w1, d_w2, d_w3, d_w4, d_w5, d_w6, d_w7, d_w8 def update_w(step,w1, w2, w3, w4, w5, w6, w7, w8): #Gradient descent, update weights w1 = w1 - step * d_w1 w2 = w2 - step * d_w2 w3 = w3 - step * d_w3 w4 = w4 - step * d_w4 w5 = w5 - step * d_w5 w6 = w6 - step * d_w6 w7 = w7 - step * d_w7 w8 = w8 - step * d_w8 return w1, w2, w3, w4, w5, w6, w7, w8 if __name__ == "__main__": w1, w2, w3, w4, w5, w6, w7, w8 = 0.2, -0.4, 0.5, 0.6, 0.1, -0.5, -0.3, 0.8 # A random value can be given, and a specified value can be given to match the PPT x1, x2 = 0.5, 0.3 # input value y1, y2 = 0.23, -0.07 # Positive numbers converge exactly; negative numbers do not. why? Because with sigmoid output, y1, y2 are in the range of (0,1). N = 10 # number of iterations step = 10 # step size print("input value x0, x1:", x1, x2) print("output value y0, y1:", y1, y2) print("input value: x1, x2；",x1, x2, "output value: y1, y2:", y1, y2) eli = [] lli = [] for i in range(N): print("=====the first" + str(i) + "wheel=====") # forward propagation out_o1, out_o2, out_h1, out_h2, error = forward_propagate(x1, x2, y1, y2, w1, w2, w3, w4, w5, w6, w7, w8) print("forward propagation:", round(out_o1, 5), round(out_o2, 5)) print("Loss function:", round(error, 2)) # backpropagation d_w1, d_w2, d_w3, d_w4, d_w5, d_w6, d_w7, d_w8 = back_propagate(out_o1, out_o2, out_h1, out_h2) # Gradient descent, update weights w1, w2, w3, w4, w5, w6, w7, w8 = update_w(step,w1, w2, w3, w4, w5, w6, w7, w8) eli.append(i) lli.append(error) plt.plot(eli, lli) plt.ylabel('Loss') plt.xlabel('w') plt.show()

Implemented by pytorch:

import torch x = [0.5, 0.3] # x0, x1 = 0.5, 0.3 y = [0.23, -0.07] # y0, y1 = 0.23, -0.07 print("input value x0, x1:", x[0], x[1]) print("output value y0, y1:", y[0], y[1]) w = [torch.Tensor([0.2]), torch.Tensor([-0.4]), torch.Tensor([0.5]), torch.Tensor( [0.6]), torch.Tensor([0.1]), torch.Tensor([-0.5]), torch.Tensor([-0.3]), torch.Tensor([0.8])] # Weight initial value for i in range(0, 8): w[i].requires_grad = True print("weight w0-w7:") for i in range(0, 8): print(w[i].data, end=" ") def forward_propagate(x): # Computational graph in_h1 = w[0] * x[0] + w[2] * x[1] out_h1 = torch.sigmoid(in_h1) in_h2 = w[1] * x[0] + w[3] * x[1] out_h2 = torch.sigmoid(in_h2) in_o1 = w[4] * out_h1 + w[6] * out_h2 out_o1 = torch.sigmoid(in_o1) in_o2 = w[5] * out_h1 + w[7] * out_h2 out_o2 = torch.sigmoid(in_o2) print("Forward calculation, hidden layer h1 ,h2: ", end="") print(out_h1.data, out_h2.data) print("Forward calculation, predicted value o1 ,o2: ", end="") print(out_o1.data, out_o2.data) return out_o1, out_o2 def loss(x, y): # loss function y_pre = forward_propagate(x) # forward propagation loss_mse = (1 / 2) * (y_pre[0] - y[0]) ** 2 + (1 / 2) * (y_pre[1] - y[1]) ** 2 # Consider: t.nn.MSELoss() print("loss function(mean squared error): ", loss_mse.item()) return loss_mse if __name__ == "__main__": for k in range(1): print("\n=====the first" + str(k+1) + "wheel=====") l = loss(x, y) # Forward propagation, seek Loss, build a computational graph l.backward() # Backpropagation, find all gradients in the calculation graph and store them in w. Automatically find gradients without manual programming. print("w the gradient of: ", end=" ") for i in range(0, 8): print(round(w[i].grad.item(), 2), end=" ") # View Gradients step = 1 # step size for i in range(0, 8): w[i].data = w[i].data - step * w[i].grad.data # update weights w[i].grad.data.zero_() # Note: clear all gradients in w print("\n updated weights w:") for i in range(0, 8): print(w[i].data, end=" ")

In fact, I have posted about the numpy implementation before and that one is more responsible for one generation, but now, it is not very systematic. For the pytorch implementation, I feel that it is actually a bit like building blocks. It’s enough to go with the things you want, so the amount of code is less, but if you only learn the framework and not the principles, as I said before, the teacher said, if you are a major in artificial intelligence and don’t understand the principles, then you can work with other What is the difference between majors, I feel very right.

Let's take a look at the results below (note that it starts from round 0 due to numerical reasons)

The result in numpy is:

10 rounds

===== Round 10 =====

Forward propagation: 0.26348 0.11236

Loss function: 0.02

100 rounds

===== Round 100 =====

Forward propagation: 0.23242 0.04219

Loss function: 0.01

1000 rounds

=====1000th round =====

Forward propagation: 0.23038 0.00954

Loss function: 0.0

The result of pytorch is:

10 rounds

===== Round 10 =====

Forward calculation, hidden layers h1, h2: tensor([0.5809]) tensor([0.4857])

Forward calculation, predicted values o1 ,o2: tensor([0.4109]) tensor([0.3647])

Loss function (mean squared error): 0.11082295328378677

Gradient of w: -0.02 0.0 -0.01 0.0 0.03 0.06 0.02 0.05

The updated weight w:

tensor([0.3273]) tensor([-0.4547]) tensor([0.5764]) tensor([0.5672]) tensor([-0.1985]) tensor([-1.2127]) tensor([-0.5561]) tensor([0.1883])

100 rounds

=====Round 100======

Forward calculation, hidden layers h1, h2: tensor([0.6863]) tensor([0.5281])

Forward calculation, predicted values o1 ,o2: tensor([0.2378]) tensor([0.0736])

Loss function (mean squared error): 0.010342842899262905

Gradient of w: -0.0 -0.0 -0.0 -0.0 0.0 0.01 0.0 0.01

The updated weight w:

tensor([0.9865]) tensor([-0.2037]) tensor([0.9719]) tensor([0.7178]) tensor([-0.8628]) tensor([-2.8459]) tensor([-1.0866]) tensor([-1.1112])

1000 rounds

=====1000th round =====

Forward calculation, hidden layers h1, h2: tensor([0.7750]) tensor([0.5920])

Forward calculation, predicted values o1 ,o2: tensor([0.2296]) tensor([0.0098])

Loss function (mean squared error): 0.003185197012498975

Gradient of w: -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0

The updated weight w:

tensor([1.6515]) tensor([0.1770]) tensor([1.3709]) tensor([0.9462]) tensor([-0.7798]) tensor([-4.2741]) tensor([-1.0236]) tensor([-2.1999])

It can be seen from the above that both methods can make the loss function tend to zero, but according to the experience discussed with the teacher yesterday, this should be reached, the global optimal solution.

### 2. The activation function Sigmoid uses PyTorch's own function torch.sigmoid() to observe, summarize and state.

Let's talk about this function first, because I actually said it before and compared it. This is actually similar to torch.nn.functuonal.sigmoid, and the latter one will be deprecated, so everyone still uses this, and the parameter settings are still There is a big difference.

It should be noted that many functions in pytorch are encapsulated by a class, which is different from what we wrote. We only write numerical calculations. Although they are roughly the same, it is best to pay attention.

Below is the official documentation of pytorch, let's take a look (note that this alias is called the function below)

I really learned this. I didn’t expect it to be called this. I really learned it. This is the official document. You must read the official document more.

As a result, you can see

10 rounds

===== Round 10 =====

Forward calculation, hidden layers h1, h2: tensor([0.5809]) tensor([0.4857])

Forward calculation, predicted values o1 ,o2: tensor([0.4109]) tensor([0.3647])

Loss function (mean squared error): 0.11082295328378677

Gradient of w: -0.02 0.0 -0.01 0.0 0.03 0.06 0.02 0.05

The updated weight w:

tensor([0.3273]) tensor([-0.4547]) tensor([0.5764]) tensor([0.5672]) tensor([-0.1985]) tensor([-1.2127]) tensor([-0.5561]) tensor([0.1883])

100 rounds

=====Round 100======

Forward calculation, hidden layers h1, h2: tensor([0.6863]) tensor([0.5281])

Forward calculation, predicted values o1 ,o2: tensor([0.2378]) tensor([0.0736])

Loss function (mean squared error): 0.010342842899262905

Gradient of w: -0.0 -0.0 -0.0 -0.0 0.0 0.01 0.0 0.01

The updated weight w:

tensor([0.9865]) tensor([-0.2037]) tensor([0.9719]) tensor([0.7178]) tensor([-0.8628]) tensor([-2.8459]) tensor([-1.0866]) tensor([-1.1112])

1000 rounds

=====1000th round =====

Forward calculation, hidden layers h1, h2: tensor([0.7750]) tensor([0.5920])

Forward calculation, predicted values o1 ,o2: tensor([0.2296]) tensor([0.0098])

Loss function (mean squared error): 0.003185197012498975

Gradient of w: -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0

The updated weight w:

tensor([1.6515]) tensor([0.1770]) tensor([1.3709]) tensor([0.9462]) tensor([-0.7798]) tensor([-4.2741]) tensor([-1.0236]) tensor([-2.1999])

It can be seen from this that there is almost no change, but the direct call encapsulated in torch will definitely improve the time efficiency.

### 3. Change the activation function Sigmoid to Relu, observe, summarize and state.

Just look at the official documentation. There is nothing better than the official documentation.

Take a look at the output:

10 rounds

===== Round 10 =====

Forward calculation, hidden layers h1, h2: tensor([0.5652]) tensor([0.4948])

Forward calculation, predicted values o1 ,o2: tensor([0.4106]) tensor([0.])

Loss function (mean squared error): 0.018752072006464005

Gradient of w: -0.0 -0.0 -0.0 -0.0 0.02 0.0 0.02 0.0

The updated weight w:

tensor([0.2193]) tensor([-0.3982]) tensor([0.5116]) tensor([0.6011]) tensor([-0.1951]) tensor([-0.6480]) tensor([-0.5578]) tensor([0.6700])

100 rounds

===== Round 100 =====

Forward calculation, hidden layers h1, h2: tensor([0.5785]) tensor([0.5173])

Forward calculation, predicted values o1 ,o2: tensor([0.2466]) tensor([0.])

Loss function (mean squared error): 0.0025875307619571686

Gradient of w: -0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 0.0

The updated weight w:

tensor([0.2985]) tensor([-0.2684]) tensor([0.5591]) tensor([0.6790]) tensor([-0.8873]) tensor([-0.6480]) tensor([-1.1704]) tensor([0.6700])

1000 rounds

=====1000th round =====

Forward calculation, hidden layers h1, h2: tensor([0.5812]) tensor([0.5209])

Forward calculation, predicted values o1 ,o2: tensor([0.2300]) tensor([0.])

Loss function (mean squared error): 0.0024500000290572643

Gradient of w: -0.0 -0.0 -0.0 -0.0 0.0 0.0 0.0 0.0

The updated weight w:

tensor([0.3140]) tensor([-0.2478]) tensor([0.5684]) tensor([0.6913]) tensor([-0.9666]) tensor([-0.6480]) tensor([-1.2413]) tensor([0.6700])

It can be seen from the results that the relu function is faster than the sigmoid function in terms of the speed of reaching the optimal solution, and the accuracy is also better.

I'll talk about some of the information I found later. I've combined the links. this is good

First, use sigmoid For other functions, when calculating the activation function (exponential operation), the amount of calculation is large. When backpropagating to find the error gradient, the derivation involves division, and the amount of calculation is relatively large. Using the Relu activation function saves a lot of calculation in the whole process.

Second, for a deep network, when the sigmoid function is back-propagated, the gradient disappears easily (when the sigmoid is close to the saturation region, the transformation is too slow, and the derivative tends to 0, which will cause information loss, making it impossible to Complete the training of the deep network.

Third, Relu will make the output of some neurons to be 0, which causes the sparsity of the network, and reduces the interdependence of parameters, which alleviates the occurrence of overfitting problems (and some people's biological explanation balabala).

And the derivative of sigmoid only has better activation when it is near 0. The gradient in the positive and negative saturation regions is close to 0, so this will cause gradient dispersion, and the gradient of the relu function is constant in the part greater than 0, so it is not Gradient dispersion occurs. Second, the derivative of the relu function in the negative half is 0, so once the neuron activation value enters the negative half, the gradient will be 0, which means that this neuron will not undergo training, the so-called sparsity. Third, the derivative calculation of the relu function is faster, the program implementation is an if-else statement, and the sigmoid function needs to perform floating-point arithmetic. In summary, relu is a very good activation function

4. The loss function MSE is replaced by PyTorch's own function t.nn.MSELoss(), observe, summarize and state.

First of all, let's talk about a question I said before, nn generates something similar to an iterator (I said this in the last blog, you can go and have a look), so if you change it directly, you will get an error of.

Because, if you look at the official documentation, you will find that there are no parameters for input data, so you can only generate an iterator first, which is somewhat similar to the usage of classes. The official documentation is as follows:

So make the following changes, that is

def loss_fuction(x1, x2, y1, y2): y1_pred, y2_pred = forward_propagate(x1, x2) lossfuction = torch.nn.MSELoss() loss1=lossfuction(y1_pred,y1) loss2=lossfuction(y2_pred,y2) loss = loss1 + loss2 print("Loss function (mean squared error):", loss.item()) return loss

At the same time, you should also pay attention to modifying variables, input variables and labels, and make the following modifications:

x1, x2 = torch.Tensor([0.5]), torch.Tensor([0.3]) y1, y2 = torch.Tensor([0.23]), torch.Tensor([-0.07]) print("=====input value: x1, x2；True output value: y1, y2=====") print(x1, x2, y1, y2) w1, w2, w3, w4, w5, w6, w7, w8 = torch.Tensor([0.2]), torch.Tensor([-0.4]), torch.Tensor([0.5]), torch.Tensor( [0.6]), torch.Tensor([0.1]), torch.Tensor([-0.5]), torch.Tensor([-0.3]), torch.Tensor([0.8]) # Weight initial value w1.requires_grad = True w2.requires_grad = True w3.requires_grad = True w4.requires_grad = True w5.requires_grad = True w6.requires_grad = True w7.requires_grad = True w8.requires_grad = True

The running result is:

10 rounds

===== Round 10 =====

Forward calculation: o1 ,o2

tensor([0.3609]) tensor([0.2613])

Loss function (mean squared error): 0.1268753707408905

grad W: -0.03 -0.01 -0.02 -0.0 0.04 0.08 0.03 0.06

updated weights

tensor([0.4696]) tensor([-0.4351]) tensor([0.6618]) tensor([0.5790]) tensor([-0.4145]) tensor([-1.6882]) tensor([-0.7343]) tensor([-0.2040])

100 rounds

===== Round 100 =====

Forward calculation: o1 ,o2

tensor([0.2280]) tensor([0.0412])

Loss function (mean squared error): 0.012363419868052006

grad W: -0.0 -0.0 -0.0 -0.0 -0.0 0.01 -0.0 0.0

updated weights

tensor([1.1885]) tensor([-0.1073]) tensor([1.0931]) tensor([0.7756]) tensor([-0.8715]) tensor([-3.3002]) tensor([-1.0941]) tensor([-1.4604])

1000 rounds

=====1000th round =====

Forward calculation: o1 ,o2

tensor([0.2298]) tensor([0.0050])

Loss function (mean squared error): 0.005628134589642286

grad W: -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0

updated weights

tensor([1.8441]) tensor([0.3147]) tensor([1.4865]) tensor([1.0288]) tensor([-0.7469]) tensor([-4.6932]) tensor([-0.9992]) tensor([-2.5217])

It can be seen from the results that the effect of this handwritten function will be better than the effect of direct library adjustment, but I feel that the time efficiency of library adjustment is higher, but I don't know why the accuracy will be reduced. I hope you guys can explain it.

### 5. Change the loss function MSE to cross entropy, observe, summarize and state.

It's also the same problem I mentioned. The problem I mentioned in the last blog post is that all generated by nn is an iterator, and the iterator needs to be generated in advance before assignment can be made.

Let's just look at the official documents directly. I feel more and more that the official documents are really hanging. Bye bye.

The loss function here is:

def loss_fuction(x1, x2, y1, y2): y1_pred, y2_pred = forward_propagate(x1, x2) lossfuction = torch.nn.CrossEntropyLoss() y_pred = torch.stack([y1_pred, y2_pred], dim=1) y = torch.stack([y1, y2], dim=1) loss = lossfuction(y_pred) print("Loss function (mean squared error):", loss.item()) return loss

It must be in this form. There is a problem to pay attention to here. If you change it directly, an error will be reported.

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)

Why? After checking a lot, I found that this is the reason for the inconsistency of dimensions, and the function of stack was written in the previous blog. Function: along a new dimension Concatenates a sequence of input tensors, all tensors in the sequence should be of the same shape.

The running result is:

10 rounds

===== Round 10 =====

Forward calculation: o1 ,o2

tensor([0.5249]) tensor([0.4794])

Loss function (mean squared error): 0.1041153073310852

grad W: -0.0 0.0 -0.0 0.0 -0.02 0.02 -0.02 0.02

updated weights

tensor([0.2364]) tensor([-0.4437]) tensor([0.5218]) tensor([0.5738]) tensor([0.3117]) tensor([-0.7117]) tensor([-0.1158]) tensor([0.6158])

100 rounds

===== Round 100 =====

Forward calculation: o1 ,o2

tensor([0.8455]) tensor([0.1529])

Loss function (mean squared error): 0.016428470611572266

grad W: -0.01 -0.0 -0.0 -0.0 -0.01 0.01 -0.01 0.01

updated weights

tensor([0.9026]) tensor([-0.3255]) tensor([0.9216]) tensor([0.6447]) tensor([1.7590]) tensor([-2.1559]) tensor([1.0382]) tensor([-0.5358])

1000 rounds

=====1000th round =====

Forward calculation: o1 ,o2

tensor([0.9929]) tensor([0.0072])

Loss function (mean squared error): -0.018253758549690247

grad W: -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0

updated weights

tensor([2.2809]) tensor([0.6580]) tensor([1.7485]) tensor([1.2348]) tensor([3.8104]) tensor([-4.2013]) tensor([2.5933]) tensor([-2.0866])

First of all, let's talk about the case of negative numbers, which fully illustrates that the cross-entropy loss function is suitable for the characteristics of classification. The cross-entropy loss function is the same as the classification, and it can be transformed with sigmoid.

### 6. Change the step size, the number of training sessions, observe, summarize and state.

If you want to adjust the parameters, this time you have to adjust them carefully. If you hadn't asked the teacher before, you would never have dared to let go of the parameters.

It can be seen from the results that when step=5, the optimal value appears in about 10 rounds, but the optimal value is consistent with the results when the other two steps reach the optimal value, and when step=1, at 10 The optimal value appears around the round, but the optimal value is the same as when the other two steps reach the optimal value, but it will be found that when step=0.01, it has not reached the best quality after 1000 iterations.

I tried 0.01 without giving up, and the result still proved that the effect is not good. This result proves that the teacher, as I said, the step has a great relationship with the data set. From this, we can know that the step cannot be taken from experience. , is to try little by little.

The conclusion we can draw from the above is that choosing a suitable step will not affect the acquisition of the optimal solution, but will reduce the number of iterations and improve time efficiency.

### 7. Change the initial value of weights w1-w8 to random numbers, compare the results of "specified weights", observe, summarize and state.

The code is:

Just turn the assignment into a torch.rand function

# coding=gbk import torch x1, x2 = torch.Tensor([0.5]), torch.Tensor([0.3]) y1, y2 = torch.Tensor([0.23]), torch.Tensor([-0.07]) print("=====input value: x1, x2；True output value: y1, y2=====") print(x1, x2, y1, y2) w1, w2, w3, w4, w5, w6, w7, w8 = torch.randn(1, 1), torch.randn(1, 1), torch.randn(1, 1), torch.randn(1, 1), torch.randn(1, 1), torch.randn(1, 1), torch.randn(1, 1), torch.randn(1, 1) #Weight initial value w1.requires_grad = True w2.requires_grad = True w3.requires_grad = True w4.requires_grad = True w5.requires_grad = True w6.requires_grad = True w7.requires_grad = True w8.requires_grad = True def sigmoid(z): a = 1 / (1 + torch.exp(-z)) return a def forward_propagate(x1, x2): in_h1 = w1 * x1 + w3 * x2 out_h1 = sigmoid(in_h1) # out_h1 = torch.sigmoid(in_h1) in_h2 = w2 * x1 + w4 * x2 out_h2 = sigmoid(in_h2) # out_h2 = torch.sigmoid(in_h2) in_o1 = w5 * out_h1 + w7 * out_h2 out_o1 = sigmoid(in_o1) # out_o1 = torch.sigmoid(in_o1) in_o2 = w6 * out_h1 + w8 * out_h2 out_o2 = sigmoid(in_o2) # out_o2 = torch.sigmoid(in_o2) print("Forward calculation: o1 ,o2") print(out_o1.data, out_o2.data) return out_o1, out_o2 def loss_fuction(x1, x2, y1, y2): y1_pred, y2_pred = forward_propagate(x1, x2) lossfuction = torch.nn.CrossEntropyLoss() y_pred = torch.stack([y1_pred, y2_pred], dim=1) y = torch.stack([y1, y2], dim=1) loss = lossfuction(y_pred,y) print("Loss function (mean squared error):", loss.item()) return loss def update_w(w1, w2, w3, w4, w5, w6, w7, w8): # step size step = 1 w1.data = w1.data - step * w1.grad.data w2.data = w2.data - step * w2.grad.data w3.data = w3.data - step * w3.grad.data w4.data = w4.data - step * w4.grad.data w5.data = w5.data - step * w5.grad.data w6.data = w6.data - step * w6.grad.data w7.data = w7.data - step * w7.grad.data w8.data = w8.data - step * w8.grad.data w1.grad.data.zero_() # Note: clear all gradients in w w2.grad.data.zero_() w3.grad.data.zero_() w4.grad.data.zero_() w5.grad.data.zero_() w6.grad.data.zero_() w7.grad.data.zero_() w8.grad.data.zero_() return w1, w2, w3, w4, w5, w6, w7, w8 if __name__ == "__main__": print("=====weight before update=====") print(w1.data, w2.data, w3.data, w4.data, w5.data, w6.data, w7.data, w8.data) for i in range(10): print("=====the first" + str(i+1) + "wheel=====") L = loss_fuction(x1, x2, y1, y2) # Forward propagation, seek Loss, build a computational graph L.backward() # Automatically find gradients without manual programming. Backpropagation, find all gradients in the calculation graph and store them in w print("\tgrad W: ", round(w1.grad.item(), 2), round(w2.grad.item(), 2), round(w3.grad.item(), 2), round(w4.grad.item(), 2), round(w5.grad.item(), 2), round(w6.grad.item(), 2), round(w7.grad.item(), 2), round(w8.grad.item(), 2)) w1, w2, w3, w4, w5, w6, w7, w8 = update_w(w1, w2, w3, w4, w5, w6, w7, w8) print("updated weights") print(w1.data, w2.data, w3.data, w4.data, w5.data, w6.data, w7.data, w8.data)

The running result is:

10 rounds

===== Round 10 =====

Forward calculation: o1 ,o2

tensor([5.1180]) tensor([0.])

Loss function (CrossEntropyLoss): -0.3573055565357208

grad W: -0.08 -0.07 -0.05 -0.04 -0.09 0.0 -0.08 0.0

100 rounds

===== Round 100 =====

Forward calculation: o1 ,o2

tensor([6998288.5000]) tensor([0.])

Loss function (CrossEntropyLoss): -489880.1875

grad W: -93.3 -77.44 -55.98 -46.46 -108.81 0.0 -90.31 0.0

500 rounds

=====500th round =====

Forward calculation: o1 ,o2

tensor([1.2854e+34]) tensor([0.])

Loss function (CrossEntropyLoss): -8.997903731118773e+32

grad W: -3998669581844480.0 -3318942688870400.0 -2399201749106688.0 -1991365747539968.0 -4663210812637184.0 0.0 -3870519032020992.0 0.0

updated weights

tensor([1.0596e+17]) tensor([8.7951e+16]) tensor([6.3578e+16]) tensor([5.2771e+16]) tensor([1.2357e+17]) tensor([-0.0035]) tensor([1.0257e+17]) tensor([-0.1931])

It can be seen from this that this has a lot of randomness. If the generated initial value is good, the convergence will be achieved faster. If the initial value is not good, the convergence will not even be achieved in extreme cases. In this case, you can consider the same as k-means++, which has a tendency to select the initial value.

### 8. The initial value of weights w1-w8 is changed to 0, observe, summarize and state.

The code is:

Just replace all the numbers in the tensor with 0

# coding=gbk import torch x1, x2 = torch.Tensor([0.5]), torch.Tensor([0.3]) y1, y2 = torch.Tensor([0.23]), torch.Tensor([-0.07]) print("=====input value: x1, x2；True output value: y1, y2=====") print(x1, x2, y1, y2) w1, w2, w3, w4, w5, w6, w7, w8 = torch.Tensor([0]), torch.Tensor([0]), torch.Tensor([0]), torch.Tensor( [0]), torch.Tensor([0]), torch.Tensor([0]), torch.Tensor([0]), torch.Tensor([0]) # Weight initial value w1.requires_grad = True w2.requires_grad = True w3.requires_grad = True w4.requires_grad = True w5.requires_grad = True w6.requires_grad = True w7.requires_grad = True w8.requires_grad = True def sigmoid(z): a = 1 / (1 + torch.exp(-z)) return a def forward_propagate(x1, x2): in_h1 = w1 * x1 + w3 * x2 out_h1 = sigmoid(in_h1) # out_h1 = torch.sigmoid(in_h1) in_h2 = w2 * x1 + w4 * x2 out_h2 = sigmoid(in_h2) # out_h2 = torch.sigmoid(in_h2) in_o1 = w5 * out_h1 + w7 * out_h2 out_o1 = sigmoid(in_o1) # out_o1 = torch.sigmoid(in_o1) in_o2 = w6 * out_h1 + w8 * out_h2 out_o2 = sigmoid(in_o2) # out_o2 = torch.sigmoid(in_o2) print("Forward calculation: o1 ,o2") print(out_o1.data, out_o2.data) return out_o1, out_o2 def loss_fuction(x1, x2, y1, y2): y1_pred, y2_pred = forward_propagate(x1, x2) lossfuction = torch.nn.CrossEntropyLoss() y_pred = torch.stack([y1_pred, y2_pred], dim=1) y = torch.stack([y1, y2], dim=1) loss = lossfuction(y_pred,y) print("Loss function (mean squared error):", loss.item()) return loss def update_w(w1, w2, w3, w4, w5, w6, w7, w8): # step size step = 1 w1.data = w1.data - step * w1.grad.data w2.data = w2.data - step * w2.grad.data w3.data = w3.data - step * w3.grad.data w4.data = w4.data - step * w4.grad.data w5.data = w5.data - step * w5.grad.data w6.data = w6.data - step * w6.grad.data w7.data = w7.data - step * w7.grad.data w8.data = w8.data - step * w8.grad.data w1.grad.data.zero_() # Note: clear all gradients in w w2.grad.data.zero_() w3.grad.data.zero_() w4.grad.data.zero_() w5.grad.data.zero_() w6.grad.data.zero_() w7.grad.data.zero_() w8.grad.data.zero_() return w1, w2, w3, w4, w5, w6, w7, w8 if __name__ == "__main__": print("=====weight before update=====") print(w1.data, w2.data, w3.data, w4.data, w5.data, w6.data, w7.data, w8.data) for i in range(1000): print("=====the first" + str(i+1) + "wheel=====") L = loss_fuction(x1, x2, y1, y2) # Forward propagation, seek Loss, build a computational graph L.backward() # Automatically find gradients without manual programming. Backpropagation, find all gradients in the calculation graph and store them in w print("\tgrad W: ", round(w1.grad.item(), 2), round(w2.grad.item(), 2), round(w3.grad.item(), 2), round(w4.grad.item(), 2), round(w5.grad.item(), 2), round(w6.grad.item(), 2), round(w7.grad.item(), 2), round(w8.grad.item(), 2)) w1, w2, w3, w4, w5, w6, w7, w8 = update_w(w1, w2, w3, w4, w5, w6, w7, w8) print("updated weights") print(w1.data, w2.data, w3.data, w4.data, w5.data, w6.data, w7.data, w8.data)

The result of running is:

10 rounds

===== Round 10 =====

Forward calculation: o1 ,o2

tensor([0.5417]) tensor([0.4583])

Loss function (mean squared error): 0.0985325276851654

grad W: -0.0 -0.0 -0.0 -0.0 -0.02 0.02 -0.02 0.02

100 rounds

===== Round 100 =====

Forward calculation: o1 ,o2

tensor([0.8406]) tensor([0.1594])

Loss function (mean squared error): 0.01783166080713272

grad W: -0.01 -0.01 -0.0 -0.0 -0.01 0.01 -0.01 0.01

1000 rounds

=====1000th round =====

Forward calculation: o1 ,o2

tensor([0.9932]) tensor([0.0068])

Loss function (mean squared error): -0.018344268202781677

grad W: -0.0 -0.0 -0.0 -0.0 -0.0 0.0 -0.0 0.0

updated weights

tensor([1.7782]) tensor([1.7782]) tensor([1.0669]) tensor([1.0669]) tensor([3.2392]) tensor([-3.2392]) tensor([3.2392]) tensor([-3.2392])

It can be considered that it is generally better to take zero in this time than random, but compared with random, it can be found that the first

The loss goes up during the first 100 rounds, but is the same at the end when it reaches the optimal value at 1000 rounds.

### 9. Comprehensively summarize the principle and coding implementation of backpropagation, and write down the experience carefully.

First of all, I wrote so carefully before, I feel that it finally paid off. I used to feel that it was worthwhile, but now I feel that it is worthwhile, because when many people write the function of nn, many people go directly to algebra, There are problems, and as I said in the last blog, he generates something similar to an iterator.

Secondly, I finally adjusted to 0.01 without first tuning the parameters. I really understand. The teacher told me about the concepts related to tuning the data set. This time I really learned a lot from a lot of tuning, although I tried it in the end. 0.01 once, but that's to prove that 0.01 doesn't work.

Secondly, I feel that I understand a lot about neural networks. I used to push the bp algorithm, but this time it is algebraic calculation, and I still bring 8 numbers. I feel really familiar with the process. I feel like I have recited it a bit. This time I took the number with it, but there are still some uncertainties. I feel that algebraic calculation is really useful. It is strongly recommended to count all 8.

Secondly, I feel that I understand and understand the use of the framework. In the past, the use of the framework was a bit rigid, and it was based on other people's shelves. A little more proficient.

Secondly, I have learned about the properties of a lot of functions. Some of them have seen online courses before, but they have not been explained in such detail. It is just a very general thing. It's a very vague feeling, but after this time I really understand a lot.

Secondly, I feel really tired. I push the formula little by little, and calculate it for several generations. This is really laborious, including the code changes later, but I believe it is worth it.

Finally, I hope the epidemic will pass soon. One of my classmates and I have been collecting all grades and doing statistics these days. We are really tired, and we really don’t have time. We have a lot of time. I'm here to write my homework. Fortunately, I started writing it before. If I didn't see it, I would be able to hand it in.

Finally, of course, I would like to thank Mr. Wei for his care in study and life. I hope everyone will take precautions and look forward to the spring.