lesson8
 Video address: https://course19.fast.ai/videos/?lesson=8
preface
This second part is very different from the 2018 version. The course name is "deep learning from the foundation". We will learn to implement many things in Fastai and PyTorch. Basically, we will learn something that can be used to build our own deep learning library. In this process, we will learn how to implement the paper, which is an important skill to master when making the most advanced model.
Basic, but it basically means starting from scratch, so we will study basic matrix calculus, create a training cycle from scratch, create an optimizer and many different layers and architectures from scratch, and so on, not just create some stupid library that is useless for anything, But actually build something from scratch that can train cuttingedge worldclass models, so this is a goal we have never done before. We think no one has done this before, so I don't know exactly how far we will go, but you know, this is our ongoing journey, and we will see how we will move forward.
Therefore, in this process, we will have to read and correctly implement the papers, because Fastai library is full of implementation papers, so if you do not read and implement the papers, you will not be able to do this, and we will also implement most of the pytorch. As you will see, we will also solve some applications that are not fully integrated into the Fastai library, so a lot of customization work will be required, such as object detection, sequence seq2seq, sorting using attention converter, transformer excel loop gain audio and other things, so I will study some performance considerations more deeply, such as distributed multi GPU training using our new realtime compiler, which is called JIT and C + + from now on, so this is the first five lessons.
The last two classes are to use Swift to realize some of these applications, which is the indepth learning of implicit. This part is opposite to part1.
 Part 1 is topdown, understanding how to use deep learning in practice, how to use it, and how to get results
 Part 2 will be bottomup: let you see the connections between various things. Then you can customize your algorithm for your own problems and do what you need it to do.
We have changed this year for several reasons:
 ① So many papers have been published, and the scale of literature in this field has grown so fast that I can't pick out 12 papers you really need to know in the next seven weeks, because there are too many, and it's a little meaningless, because once you enter, you realize that almost all papers are talking about small changes in the same thing, so on the contrary, What I want to do is to show you the foundation, let you read the twelve papers you care about and realize that this is the small adjustment. Now I have all the tools needed to implement it, test and experiment it, so this is a very key question why we should move in this direction, And more and more clearly let you know that some of us used to call them cuttingedge technology learning of coders.
 The frontier of deep learning is actually engineering, not just papers. The difference between efficient people and others in deep learning is that they can make useful things with code, but there are few such people. So part2 is to let everyone practice deeply and achieve work with code.
 Therefore, part2 is an implantable work, which is usually done by the library. Usually, you won't do it.
 part1's course is topdown, so you can already understand the context, so Part2 is bottomup. When you build all the contents from the bottom, you can see the connections between all different things and see that they are variants of the same thing you know, Then you can customize algorithm A or algorithm B, create your own algorithm to solve your own problems, and only do what you need it to do. Then you can make sure that you know it performs well, you can debug it and maintain it, because you understand that all parts are normal.
[the external chain picture transfer fails. The source station may have an antitheft chain mechanism. It is recommended to save the picture and upload it directly (img81hky12m163287844338) (snipaste_20210927_173401. PNG)]
What are we going to do?
 The second part is very different from last year
 We will implement the fastai library from the foundation (from scratch)
 Basic matrix calculus
 Training cycle
 Optimizer customization
 Custom annealing
 It's actually a place where worldclass models can be trained
 Reading and implementing papers
 Resolve applications that are not fully supported in the fastai library
 Finally, it is implemented on Swift
 There are too many papers now
 And say small changes in the same thing
 Show the basics so that you can choose 12 papers
 Cutting edge technology is actually about engineers, not about papers
 Who can bake those things with code
 The second part will be more about bottomup (with code)
 Create your own algorithm to solve what you do.
 Today we will implement matrix multiplication from scratch in Python
why swift?

Chris Lattner is a unique one. He established the world's most widely used compiler framework LOVM

The default C and C + + compilers have also been created for Mac, clang

He founded Swift, perhaps the world's fastestgrowing computer language, and now focuses on deep learning
 When you really see internal structures like tensorflow, it seems that tensorflow was designed by a group of declining people, not compilers.
 So I always wanted to have a good digital programming language. It was built by people who really mastered the programming language, but it never happened
 Python languages are not built to be good at data analysis. They are not built by those who really have a deep understanding of compilers. Of course, they are not built for our modern highly parallel processor, but Swift is right. Therefore, we have encountered this unique situation. You know a really widely used language for the first time. A very welldesigned language from scratch is actually for digital programming and deep learning, so I can't miss that boat. I don't want you to miss it.
 Another Giulia language has great potential, but the number of users is 10 times less than Swift. But Swift doesn't have the same level of community as Giulia, but it's still exciting. And Giulia goes further in simplicity.

Previous languages were not designed for multiprocessor operation, but were designed by compiler experts

Another option is Julia for numerical programming, which goes further; So now there are two options.

Jeremy plays with the swift library during the Christmas holiday.
 I'm glad to find that I can build code from scratch, and the code is comparable to the fastest manually tuned linear algebra library.
 Swift is simple and efficient
 Chris Lattner himself will also attend the last two classes to ask you to use Swift
 Sift for TensorFlow, S4TF has some advantages and disadvantages, as well as pytorch. The two are opposite
Advantages and disadvantages of Swift and pytoch
 pytorch has a good ecosystem and excellent documents and tutorials. It can get work done quickly, practice and solve problems quickly
 But S4TF is not. It has few ecology and few documents. People say Swift is an iPhone programming language, but it is actually a welldesigned and powerful language.
 LLVM is a compiler. Swift communicates directly with LLVM compiler. Swift is the thin layer above. When you write something with swift, it is really easy. LLVM likes to compile it into super fast optimized code.
 When python is translated into other languages, the impedance mismatch between what I try to write and what I actually run makes it difficult to explore the kind of depth we will do
[the external chain picture transfer fails. The source station may have an antitheft chain mechanism. It is recommended to save the picture and upload it directly (imghdus3nxo16328784340) (snipaste_20210928_104508. PNG)]
what do we mean by from the foundations
Rewrite fastai and many functions of pytorch: matrix multiplication, torch.nn, torch.optim, and toplevel data set loading data loader
We can use python and some python standard libraries
rule
 We can use pure python
 Standard python Library
 Non data science module
 pytorch  array and rng only
 fastai.datasets (for source material)
 matplotlib
 Once we recreate a function, we can use the real version downstream
why?
 We need to really understand what is going on in the model and what really happened in the training. You will find this in the experiments we will do in the next few classes
 We'll actually come up with some new ideas if you can create something from scratch and understand it,
 Then once you create something from scratch and you really understand it, you can adjust. You will suddenly realize that target detection, architecture and optimizer are not as perfect as those in the library, but a pile of semi arbitrary specific choices, and your specific problem is likely to require a different set of knobs and choices.
 For those who want to contribute to fastai open source, you will learn how fastai is built, which parts work well, and how you know how to contribute tests or documents or new functions or create your own libraries,
 For those who are interested in further research, you will implement the paper, which means that you will be able to associate the code you are writing with the paper you are reading.
There are many opportunities in this course
 Homework is cuttingedge
 Actually do experiments that people haven't done before observation. Few deep learning practitioners know what you know now. We are studying things that others haven't seen before
 So please try to do a lot of experiments, especially in your field
 And consider writing a blog. Although it is not perfect, write it down.
 Don't wait for perfection to start communicating. Write something for you six months ago. That's your audience.
 If you don't have a blog, please try media.com
part1 review
So I suppose you remember the first part. Here is the first part. In practice, you are unlikely to remember all these things, because no one is perfect, so I actually want you to do what I am doing. You're thinking I don't know what he's talking about. You'll go back and watch the video about it. Don't just move forward, because I assume you already know the first part. In particular, if you are not confident about the second half of the first part, where we have a deeper understanding of what is the real activation and what is the real parameter, it is exactly the same as the work of SGD, especially in today's course, I assume you really understand those things, so if you don't understand, go back and watch those videos again, Go back to time like SGD from scratch and take some time.
I designed this course to keep most people busy and go straight to the next course, so please feel free to take the time to dig deeper.
 For topics you have not mastered, please go back to the previous lesson.
① Over fitting ② there is no third step in reducing over fitting
 The most important thing is to try to make sure we can train a good model
 There are three steps to train to get a very good model

First, we try to create something larger than we need (try to create a complex model)
 No regularization
 Over fitting
 Over fitting means that your training loss is lower than the verification loss ✘

Over fitting does not mean that the training loss is lower than the verification loss
 For a well fitted model, train loss is always lower than valid loss
 The sign of overfitting is when you actually see the validation loss get worse,

Visual inputs and outputs:
 See what happened
 The first step is usually easy, but the second step is usually difficult.
Five steps to avoid over fitting
Five things can avoid over fitting!
 More data
 Data enhancement
 General architecture
 Regularization
 Reduce architecture complexity

Most beginners start with 5, but this should be the last one
 Unless the model is too slow

It's not that difficult, but basically these five things you can do in order of priority. If you can get more data, you should do it first. If you can do more data enhancement, you should do it. If you can use it, you should also use a more general architecture. Then if all these things are completed, you can start adding regularization, For example, dropout or weight attenuation, but please remember that at this time, you are reducing the ability to effectively effect your model, so regularization is not as good as the first three things, and then finally reduce the complexity of the architecture. Most people and most beginners especially start to reduce the complexity of the architecture, but this should be the last thing you try, unless your architecture is too complex, It's too slow for your question, so this is a summary of what we learned in part 1 and what we want to do,
It's time to start reading papers
So we will read the papers in part1 that we didn't read. Reading a paper can be very daunting. The simplest calculation on excel may be a lot of symbols in the paper.

Even familiar things look complex in the paper!
 Overcome the fear of the Greek alphabet

Papers are important for indepth learning beyond the basics, but they are difficult to read

Google searched a blog post describing the paper
 They were not chosen for their excellent communication clarity
 Usually, blog posts do better than papers

Tip 1: learn the pronunciation of Greek letters to make the equation more approachable.

Tip 2: learn mathematical symbols  check Wikipedia. Detexify  use machine learning to determine the symbol you are viewing. The advantage of this is that it provides latex name.

https://en.wikipedia.org/wiki/List_of_mathematical_symbols

Or use detexify
Syllabus
Steps of basic modern CNN model
For the next few classes. We will make a qualified CNN model.
 Matrix multiplication
 Relu / initialization
 Forward full connection layer
 Reverse full connection layer
 train loop
 Conv
 Optimize Optim
 bacth normalization
 Resnet
 We have learned it in the last lesson of Part 1
The goal of today's class
 From matrix multiplication to inverse transfer
Because we did it, in the last course, we already had a layer for creating ResNet, and we actually got good results, so we just need to do all these things to get us from here to here. This is just the next few classes, and we will go further,
 Today we will try until FC backpropagation is calculated correctly
 We will build a model that accepts the input array, and then we're trying to create a simple fully connected network, so it will have a hidden layer, so we will start matrix multiplication from some inputs, Matmul  > relu  > Matmul  > relu – > loss
 Input  > matrix multiplication  > relu  > matrix multiplication  > loss, forward propagation, calculate loss
 Then calculate the gradient of the weight, and then the gradient decreases to update the parameters
 Repeat the above process several times
I'm here to show you how I will build our library in Jupiter notebooks. Many very smart people assure me that it is impossible to develop an effective library in Jupiter notebooks, which is a shame because I have established a library. But our notebooks, so anyway, people will often tell you that things are impossible, but t I will tell you my point, that is, I have been programming for more than 30 years, but my development is a stupid notebook, * * I guess my productivity has increased by about two to three times** Yes, in the past two or three years, I've built more useful things than I've done before, so I'll tell you how we need to do something.
We can't just use our entire library to create a huge notebook. We must be able to extract those small gemstones in some way. We think Oh, this is the code. OK, let's keep it. We must extract it into a package that we can reuse in order to tell our system that this is a cell that I want you to keep and reuse.
I use this special comment cache export at the top of the cell, and then I have a program called notebook2script, which traverses the notebook and finds those cells and puts them into the Python module to convert ipynb into py file.
from 00_exports.ipynb Start,
Lesson 00.ipynb
How to extract some code from jupyter into a package

How to build applications on jupyter notebooks to be more efficient on jupyter notebooks
 Use a special comment #export to tell the system which cells you want to keep and reuse.
 Then use notebook2script.py to pass through the program's file and find the cells #export with special comments and put them into the python module.
 Path.stem.split("") is used to output the file name, so the output name is the first part before undesrcore. If there is no underscore, it is the full name.
 The exported module goes to the module named exp

Then we can use the imported and exported module from exp.nb_00 import *

Create test framework
 Test and test_eq using assert

For run_notebook.py run tests outside of jupyter notebook
 python run_notebook.py 01_matmul.ipynb runs tests outside the jupyter notebook
 We can see the assert error when the terminal is running
 The function is converted to a command line interface

Now we have an automated unit testing framework on the Jupiter notebook

Trigger execution function
 fire's library is a very concise library that allows you to use any function like this and automatically convert it to a commandline interface
 It accepts any function and automatically converts it to a command line interface
 The input of the function is converted to an argument on the command line

Notebooks are json files.
 We can import cells and convert them into json files using the Jupiter notebook file
 SON loading is the easiest way, especially when I built my jupiter notebook infrastructure in jupiter notebooks.
 This is a very good environment, which can automate your things and run scripts on it, so this is all. This is all the content of our development infrastructure

The benefits of using notebooks for unit testing: there is context information. If the test fails, you can check each input and output. This is a very good way to repair those failed tests.
Notebook 01 matrix multiplication (file) 01_matmul.ipynb)

There are some parts of the standard library, but numpy is not allowed
 take
 The external reference or file is automatically reloaded
 And will do so at specific time intervals
 Will be drawn in the notebook
 matplotlibmpl will use grayscale because we will use MNIST
get data
 Import mnist, extract mnist into train and y, valid for numpy array
 Convert numpy array to tensor (np is not allowed)
 Tensors were previously imported from pytorch
 Get the number of columns and rows from the training data
 Some visualization and statistics
 Do some obvious tests from above
%load_ext autoreload %autoreload 2 %matplotlib inline mpl.rcParams['image.cmap'] = 'gray' #export # standard libraries from pathlib import Path from IPython.core.debugger import set_trace import pickle, gzip, math, torch, matplotlib as mpl import matplotlib.pyplot as plt # datasets from fastai import datasets # basic pytorch from torch import tensor MNIST_URL='http://deeplearning.net/data/mnist/mnist.pkl'
Download the mnist dataset and load it
path = datasets.download_data(MNIST_URL, ext='.gz'); path # unzips the download with gzip.open(path, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin1')
numpy is not allowed, so python's map is mapped to tensor format. python's tensor is a good choice.
# maps the pytorch tensor function against each # of the loaded arrays to make pytorch versions of them x_train,y_train,x_valid,y_valid = map(tensor, (x_train,y_train,x_valid,y_valid)) # store the number of # n = rows # c = columns n,c = x_train.shape # take a look at the values and the shapes x_train, x_train.shape, y_train, y_train.shape, y_train.min(), y_train.max()
(tensor([[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]]), torch.Size([50000, 784]), tensor([5, 0, 4, ..., 8, 4, 8]), torch.Size([50000]), tensor(0), tensor(9))
Let's test our input data
 Line check: check whether the number of lines x_train is the same as the shape, y_train, and the number should be 50000
 Column check: check whether the number of columns is * * 28 * 28, * * because this is the total number of pixels of the expanded image
 Class check: test whether 10 different classes are found in y_train0  9
assert n==y_train.shape[0]==50000 test_eq(c,28*28) test_eq(y_train.min(),0) test_eq(y_train.max(),9)
Peek at one of the pictures
img = x_train[0] img.view(28,28).type()
'torch.FloatTensor'
# note that there is a single vector that is reshaped into the square format plt.imshow(img.view((28,28)));
initial model
We will first try the linear model:
Y=W^T X+b will be the first model we will try. We will need the following:
 w: Weight
 b: Baseline or deviation
weights = torch.randn(784,10) bias = torch.zeros(10)
Matrix multiplication
We will do this often, so it's good to be familiar with this. There is a great website matrixmultiplication.xyz, which explains how matrix multiplication works.
Matrix multiplication function: the following function multiplies two arrays one by one
def matmul(a,b): # gets the shapes of the input arrays ar,ac = a.shape # n_rows * n_cols br,bc = b.shape # checks to make sure that the # inner dimensions are the same assert ac==br # initializes the new array c = torch.zeros(ar, bc) # loops by row in A for i in range(ar): # loops by col in B for j in range(bc): # for each value for k in range(ac): # or br c[i,j] += a[i,k] * b[k,j] return c
Let's do a quick example
The first 5 images in the validation data will be used and multiplied by the weight of the matrix
m1 = x_valid[:5] m2 = weights m1.shape, m2.shape
(torch.Size([5, 784]), torch.Size([784, 10]))
Will time operation
%time t1=matmul(m1, m2)
CPU times: user 605 ms, sys: 2.21 ms, total: 607 ms Wall time: 606 ms
t1.shape torch.Size([5, 10])
How can we do this faster?
We can do this by operating element by element. We will use pytorch's tensor to illustrate this. When using a pytorch object, the operators (+, , *, /, >, <, = =) are usually element by element. Examples of operations by element:
a = tensor([10., 6, 4]) b = tensor([2., 8, 7]) m = tensor([[1., 2, 3], [4,5,6], [7,8,9]]); a, b, m
(tensor([10., 6., 4.]), tensor([2., 8., 7.]), tensor([[1., 2., 3.], [4., 5., 6.], [7., 8., 9.]]))
# Addition print(a + b) # comparisons print(a < b) # can summarize print((a < b).float().mean()) # frobenius norm calculation print((m*m).sum().sqrt())
tensor([12., 14., 3.]) tensor([0, 1, 1], dtype=torch.uint8) tensor(0.6667) tensor(16.8819)
If we adjust matmul
for k in range(ac): # or br c[i,j] += a[i,k] * b[k,j]
Will be replaced
c[i,j] = (a[i,:] * b[:,j]).sum()
def matmul(a,b): # gets the shapes of the input arrays ar,ac = a.shape # n_rows * n_cols br,bc = b.shape # checks to make sure that the # inner dimensions are the same assert ac==br # initializes the new array c = torch.zeros(ar, bc) # loops by row in A for i in range(ar): # loops by col in B for j in range(bc): c[i,j] = (a[i,:] * b[:,j]).sum() return c
After performance changes, multiplication is much faster
%time t1=matmul(m1, m2)
CPU times: user 1.57 ms, sys: 864 µs, total: 2.44 ms Wall time: 1.57 ms
To test it, we'll write another function to compare the matrices. The reason is that due to the rounding error of mathematical operation, the matrix may not be exactly the same. Therefore, we hope to have a function that will "be equal to B within a certain error"
#export def near(a,b): return torch.allclose(a, b, rtol=1e3, atol=1e5) def test_near(a,b): test(a,b,near)
test_near(t1, matmul(m1, m2))
radio broadcast
The broadcast describes how arrays with different shapes are processed during arithmetic operations. The term broadcast was first used by Numpy.
How can we achieve a > 0? 0 is broadcasting to have the same dimension as a.
For example, you can use broadcast to standardize our dataset by subtracting the mean (one scalar) from the entire dataset (matrix) and dividing it by the standard deviation (another scalar).
Example: Broadcast vector for matrix
You can use special values for indexing [None] or for unsqueeze() to convert a onedimensional array to a twodimensional array (although one of the dimensions has a value of 1). This is important later when using matrix multiplication in modeling
We don't really copy. It looks like we copied, but actually step=0.
Back to our function
Let's take advantage of the broadcast and reduce the loop in the matmul function:
a[i,:] view level 1 tensor
**. unsqueeze(1) * * makes it 2d, which  1 means the last dimension
\*b end of broadcast b
. sum(dim=0) sums along the first axis
def matmul(a,b): # gets the shapes of the input arrays ar,ac = a.shape # n_rows * n_cols br,bc = b.shape # checks to make sure that the # inner dimensions are the same assert ac==br # initializes the new array c = torch.zeros(ar, bc) # loops by row in A for i in range(ar): c[i] = (a[i].unsqueeze(1) * b).sum(dim=0) return c
%time t1=matmul(m1, m2)
CPU times: user 440 µs, sys: 283 µs, total: 723 µs Wall time: 421 µs
test_near(t1, matmul(m1, m2))
Broadcasting rules
Since multidimensional broadcasting can be complex, it is important to follow some rules
When operating on two arrays / tensors, Numpy/PyTorch compares their shapes by element. It starts with the trailing dimension and then moves forward. When two dimensions are compatible
 They are equal, or
 One of them is 1, in which case the dimension is broadcast to make it the same size
Arrays do not need to have the same dimension. For example, if you have an array of 256 RGB values and you want to scale each color in the image by a different value, you can multiply the image by a onedimensional array with 3 values. The sizes of the trailing axes of these arrays are arranged according to the broadcast rules, indicating that they are compatible:
Einstein's summation
Einstein sum is a compact representation of the sum of product and sum in a general way. From numpy document:
The subscript string is a comma separated list of subscript labels, where each label refers to a dimension of the corresponding operand. Whenever a label is repeated, it is summed, so np.einsum('i,i', a, b) is equivalent to np.inner(a,b). If a label appears only once, it will not be summed, so np.einsum('i', a) will produce a unchanged view. "
c[i,j] += a[i,k] * b[k,j] c[i,j] = (a[i,:] * b[:,j]).sum()
Consider some rearranging, moving the target to the right and removing the name
a[i,k] * b[k,j] > c[i,j] [i,k] * [k,j] > [i,j] ik,kj > [ij]
# c[i,j] += a[i,k] * b[k,j] # c[i,j] = (a[i,:] * b[:,j]).sum() def matmul(a,b): return torch.einsum('ik,kj>ij', a, b)
%timeit n 10 _=matmul(m1, m2)
47.7 µs ± 4.04 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performance considerations
Unfortunately, another very highperformance language is hidden in einsum. At present, people have a lot of interest and development in highperformance languages. This is a link to some work done by a language called "halide"
pytorch opUnfo
We have increased the speed, but also the pytorch operation has been more optimized. Even with vectorization, there are slow and fast ways to handle memory. Unfortunately, most programmers have no access to this and lack the use of functions (basic linear algebra subroutines) provided in the BLAS library
Topic to find: tensor understanding
%timeit n 10 t2 = m1.matmul(m2)
14 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
t2 = m1@m2
!python notebook2script.py lesson82.ipynb
Converted lesson82.ipynb to nb_lesson82.py
CPU matrix multiplication time consumption  

Three cycles  330ms  
pytorch matrix point multiplication  709us  
pytorch broadcast multiplication  289us  
Einstein's summation  16.6us 
Lesson 8 making Relu / initializing
%load_ext autoreload %autoreload 2 %matplotlib inline
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
%reload_ext autoreload #export from exp.nb_lesson81 import * from exp.nb_lesson82 import * def test(a,b,cmp,cname=None): if cname is None: cname=cmp.__name__ assert cmp(a,b),f"{cname}:\n{a}\n{b}" def near(a,b): return torch.allclose(a, b, rtol=1e3, atol=1e5) def test_near(a,b): test(a,b,near) def get_data(): """ Loads the MNIST data from before """ path = datasets.download_data(MNIST_URL, ext='.gz') with gzip.open(path, 'rb') as f: ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin1') return map(tensor, (x_train,y_train,x_valid,y_valid)) def normalize(x, m, s): """ Normalizes an input array Subtract the mean and divide by standard dev result should be mean 0, std 1 """ return (xm)/s def test_near_zero(a,tol=1e3): assert a.abs()<tol, f"Near zero: {a}"
Load MNIST data and standardize
Forward and backward passes
 x train, y train, xy get data
 Get standard deviation
 Normalization using standard deviation
 Note that the mean and standard deviation of the training set are used to normalize the validation set
 This means close to zero and the standard close to 1
 Is the test function really standardized
 n. M get xtrain shape
 c output size
# load the data x_train, y_train, x_valid, y_valid = get_data() # calculate the mean and standard deviation train_mean,train_std = x_train.mean(),x_train.std() print("original mean and std:", train_mean,train_std) # normalize the values x_train = normalize(x_train, train_mean, train_std) x_valid = normalize(x_valid, train_mean, train_std) # check the updated values train_mean,train_std = x_train.mean(),x_train.std() print("normalized mean and std:", train_mean, train_std)
original mean and std: tensor(0.1304) tensor(0.3073) normalized mean and std: tensor(0.0001) tensor(1.)
# check to ensure that mean is near zero test_near_zero(x_train.mean()) # check to ensure that std is near zero test_near_zero(1x_train.std())
Look at the training data
Note the size of the training set
n,m = x_train.shape c = y_train.max()+1 n,m,c
(50000, 784, tensor(10))
Our first model
Our first model will have 50 hidden units. It will also have two hidden layers:
 The first layer (w1): will be input_shapex size hidden units
 The second layer (w2): will be the size hidden units
# our linear layer definition def lin(x, w, b): return x@w + b # number of hidden units nh = 50 # initialize our weights and bias # simplified kaiming init / he init w1 = torch.randn(m,nh)/math.sqrt(m) b1 = torch.zeros(nh) w2 = torch.randn(nh,1)/math.sqrt(nh) b2 = torch.zeros(1)
Define model
 The model has a hidden layer
 Basic Edition
 Infrastructure
 Hidden layers nhis 50
 The two layers are two weight and bias matrices
 w1 is the square root of the random value divided by m
 b is zero
 w2 is the mathematical square of the random value (nh,1) divided by nh
 t is the linear of three vectors
 Divide by the square root m and the tensor has a lower value
 Simplify kaiming initialization and write a paper on it
 Test mean and weight standard 1
 What's really important in training
 [1] Fixup initialization: https : //arxiv.org/abs/1901.09321
 Paper with 10000 layers, just initialize carefully
 How to initialize is really important
 Spend a lot of time on it
 The first layer is defined by relu
 relu is the grad data and clamps min to z (replacing negative numbers with zero)
 Try to find an internal function on pytorch
 Unfortunately, there is no mean zero and standard deviation of 1
 demonstration
 data distribution
 Then take everything small and take it out
 Obviously, the meaning and standard are different
Get standardized weights
If we want our weight to be between 0 and 1. We will divide by these different factors so that the output should also have a mean of 0 and a standard deviation of 1. This is usually done by kaiming normal, but we approximate it by division
t = lin(x_valid, w1, b1) print(t.mean(), t.std())
tensor(0.0155) tensor(1.0006)
Initializing weights is important. Example: use very specific weights to initialize and train large networks https://arxiv.org/abs/1901.09321 5 It turns out that even in single cycle training, those first iterations are very important. We'll go back to this
**This may seem like a very small problem, but as we will see in the next few classes, it is like an important thing in training neural networks, * * in fact, in the past few months, people have really noticed how important it is, such as repair and initialization. These people actually trained a 10000 layer deep neural network without normalization layer, basically through careful initialization, so now people really spend a lot of time thinking about how to initialize. You know, we have had many successes such as single loop training and super convergence, This is related to what happened in the previous iterations. It turns out that it is completely related to initialization, so we will spend a lot of time studying this in depth,
Our ReLu (rectifier linear unit)
def relu(x): """ Will return itself, unless its below 0 then will return 0 """ return x.clamp_min(0.)
Check mean 0 std 1
This will not be true because all negative values will change to 0, so the mean will not be zero and std will change
ReLU changes the mean and variance of the hidden layer activator because the nonlinearity is truncated.
I can write this code in many ways, but if you can implement it using something similar to a single function in pytorch, it's almost always faster because it's usually written in C
t = relu(lin(x_valid, w1, b1)) print(t.mean(), t.std())
tensor(0.3896) tensor(0.5860)
 Since ReLU removes the value < 0, the activation elements after ReLU are no longer with a mean value of 0 and a standard deviation of 1.
 So this is one of the best insights and one of the most extraordinary papers in the past few years. It is the paper of the 2015 image network winner led by the person we mentioned, He Kaiming
 This is full of great ideas. Reading the winner's paper is a very, very good idea, because they are often you know that ordinary papers will spend page after page trying to prove why a small adjustment they make should be accepted into Europe, but the winner of the competition has 20 good ideas, which can only be mentioned by the way
How to handle relu  > (0,1)

Imagenet winner's paper

The winner's paper has many good ideas. Here we introduce ReLu, resnet and kaiming normalization

In Section 2.2, the ReLU network is easy to train; Networks with more than 8 layers are difficult to converge
"Rectifier networks are easier to train"
"Very deep models > 8 conv layers have difficulties to converge"
You may see Glorot initialization (2010). The paper is very good and has a great impact. In the next few lessons, we will actually re implement most of this article, which describes a suggestion on how to initialize the neural network
But when the network gets deeper, the gradient will disappear, and the denominator value is too large. So the ImageNet people made some changes, changing 6 to 2
# kaiming init / he init for relu w1 = torch.randn(m,nh)*math.sqrt(2/m)
s
w1.mean(),w1.std()
(tensor(0.0003), tensor(0.0506))
t = relu(lin(x_valid, w1, b1)) t.mean(),t.std()
Now the result is closer to the mean 0 and the standard 1
(tensor(0.5896), tensor(0.8658))
This paper is worthy of indepth study. Another interesting topic they solved is that the conv layer is very similar to matrix multiplication
b may not be very important,
Then they will take you step by step to understand how the variance changes throughout the network.
 Forward transfer is matrix multiplication, and backward transfer is matrix multiplication with transpose. They finally recommended sqrt(2 over activations). Now that we know how to normalize weights and calculate kaiming normals, let's use its pytorch version
 std becomes 1, but mean is still 0.5 because ReLU deletes activation elements less than 0.
 I don't see anyone talking about this in the literature. This is what I just tried last week. It's an obvious thing. Instead of max(0,x), I use max(0.5, x)
 In my short experiment, this seems to help, so there's another thing you can try to see if it really helps, or if I just imagine something, it's sure to get you back to the correct average,
fan_in retains the magnitude of the weight variance in forward propagation
fan_out preserves amplitude in back propagation
#export from torch.nn import init w1 = torch.zeros(m,nh) init.kaiming_normal_(w1, mode='fan_out') t = relu(lin(x_valid, w1, b1))
 Basically, it means you divide by root M or root NH, because if you divide by root M, as you will see in that part of the paper, I suggest you read, this will keep the variance at 1 during the forward pass, but if you use NH, it will give you the correct unit variance to keep 1 in the backward pass
 So why are we doing this, fan_out? Do you divide by row(m) or by row(nh). Because our weight shape is 784 x 50. pytorch is actually the opposite (50 x 784). How does this work?
import torch.nn torch.nn.Linear(m,nh).weight.shape  torch.Size([50, 784])
doc(torch.nn.Linear(.forward), in pytorch * * F always refers to torch.nn.functional * *,
... # Source: @weak_script_method def forward(self, input): return F.linear(input, self.weight, self.bias) ...
torch.nn.functional.linear?, we see in the document string that we use the following phrase to transpose, * * weight.t() * * this is the reason for dimension reversal
@torch._jit_internal.weak_script def linear(input, weight, bias=None): # type: (Tensor, Tensor, Optional[Tensor]) > Tensor r""" Applies a linear transformation to the incoming data: :math:`y = xA^T + b`. Shape:  Input: :math:`(N, *, in\_features)` where `*` means any number of additional dimensions  Weight: :math:`(out\_features, in\_features)`  Bias: :math:`(out\_features)`  Output: :math:`(N, *, out\_features)` """ if input.dim() == 2 and bias is not None: # fused op is marginally faster ret = torch.addmm(torch.jit._unwrap_optional(bias), input, weight.t()) else: output = input.matmul(weight.t()) # So it's transposed, and the weight matrix in pytorch is inverse if bias is not None: output += torch.jit._unwrap_optional(bias) ret = output return ret
How does pytorch initialize linear and convolution layers?
torch.nn.Conv2d??
torch.nn.modules.conv._ConvNd.reset_parameters??
So the initialization operation of the volume layer: note that it is divided by math.sqrt(5), and the result is not very good.
kaiming_uniform is used, which is basically the same as ordinary kaiming_norm, but 5 \sqrt{5} 5 , this 5 \sqrt{5} 5 It seems that there is no documentary record. This one 5 \sqrt{5} 5 Seems to work pretty bad, so it's very useful to look at the source code.
# Source: def reset_parameters(self): n = self.in_channels # Note that it is divided by math.sqrt(5) and the result is not very good. init.kaiming_uniform_(self.weight, a=math.sqrt(5)) # kaiming_uniform is used for initialization if self.bias is not None: fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight) bound = 1 / math.sqrt(fan_in) init.uniform_(self.bias, bound, bound)
Back to the activation function
Now we see that the mean is zero and the standard deviation is close to 1
 Let's try ReLU0.5, which is the value after shift ing ReLU.
 It is equivalent to redefining our new activation function
So we have to try this thing and subtract 0.5 from our ReLu, so it's cool. It's bad for us to have designed our own new activation function. I don't know, but like this, when people write papers, you know it's an adjustment level tweak. The overall level is like a small change to a line of code. It may be interesting to see how helpful it is .
 The mean() becomes 0. It seems that the difference of the other party is also helpful, and the variance is reduced.
 These two points make sense. Why do I think I will see better results
def relu(x): return x.clamp_min(0.)  0.5 # Redefined our new activation function for i in range(10): # kaiming init / he init for relu w1 = torch.randn(m,nh)*math.sqrt(2./m ) t1 = relu(lin(x_valid, w1, b1)) print(t1.mean(), t1.std(), ' ')
tensor(0.0482) tensor(0.7982)  tensor(0.0316) tensor(0.8060)  tensor(0.1588) tensor(0.9367)  tensor(0.0863) tensor(0.8403)  tensor(0.0310) tensor(0.7310)  tensor(0.0467) tensor(0.7965)  tensor(0.1252) tensor(0.8700)  tensor(0.0610) tensor(0.7189)  tensor(0.0264) tensor(0.7755)  tensor(0.1081) tensor(0.8605) 
 With init, ReLU and matrix multiplication, we can do a forward propagation
Our first model
In pytoch, model can also be a function function,
def relu(x): return x.clamp_min(0.)  0.5 def lin(x, w, b): return x@w + b def model(xb): l1 = lin(xb, w1, b1) l2 = relu(l1) l3 = lin(l2, w2, b2) return l3
So this is our model. It's just a function that executes a linear layer, a ReLU layer, and a linear layer. Let's try to run it. Well, it takes 8 milliseconds to run the model on the verification set, so it's fast enough to train
# timing it on the validation set %timeit n 10 _=model(x_valid)  6.71 ms ± 456 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
assert model(x_valid).shape==torch.Size([x_valid.shape[0],1])
Loss function: MSE
One thing is the loss function. As I said, we will now simplify things by using the principal square error, although this is obviously a stupid idea. Our model is returning something with a size of 10000, but
We need * * squeeze() to remove the trailing (, 1) * *) in order to use mse. (of course, mse is not a suitable loss function for multi class classification; we will use a better loss function soon. For simplicity, we will use mse now.)
model(x_valid).shape  torch.Size([10000, 1])
Here lazy, output squeeze(), many times the fastai forum report code breaks, usually because they batch size =1, and then called squeeze(), it becomes a scalar, and then collapsed. Therefore, it is best to specify the dimension when using squeeze.
#export def mse(output, targ): # we want to drop the last dimension return (output.squeeze(1)  targ).pow(2).mean()
# converting to float (from tensors), convert these values to float y_train, y_valid = y_train.float(), y_valid.float() # make our predictions preds = model(x_train) # Forward propagation print(preds.shape) # Calculate loss # check our mse print(mse(preds, y_train))  torch.Size([50000, 1]) tensor(22.1963)
Gradient and back propagation
How much should you know about matrix calculus? It's up to you, but there's a good reference article: The Matrix Calculus You need for deep learning
 One thing you should learn is the chain rule.
MSE
If we take the derivative, we'll get, apply it to the code $(x^2) '= 2x$
#MSE (output.squeeze(1)  targ).pow(2).mean() #MSE grad inp.g = 2. * (inp.squeeze()  targ).unsqueeze(1) / inp.shape[0] # Store the gradient of the previous layer, because in the chain rule, the gradient is to be multiplied.
Gradient of ReLU
 True (or 1) if you enter > 0
 False (or 0) if < = 0 is entered
 Multiply by timeout. g (this is the gradient)
This is back propagation
We save the intermediate calculation, so we don't have to calculate it twice. Note that there is usually no loss when calculating forward and backward propagation
def mse_grad(inp, targ): # grad of loss with respect to output of previous layer loss # the derivative of squared output x^2 => 2x inp.g = 2. * (inp.squeeze()  targ).unsqueeze(1) / inp.shape[0] def relu_grad(inp, out): # grad of relu with respect to input activations inp.g = (inp>0).float() * out.g # Gradient solution of linear layer: def lin_grad(inp, out, w, b): # grad of matmul with respect to input inp.g = out.g @ w.t() w.g = (inp.unsqueeze(1) * out.g.unsqueeze(1)).sum(0) b.g = out.g.sum(0) # Forward propagation def forward_and_backward(inp, targ): # forward pass: l1 = inp @ w1 + b1 l2 = relu(l1) out = l2 @ w2 + b2 # we don't actually need the loss in backward! loss = mse(out, targ) # backward pass: mse_grad(out, targ) lin_grad(l2, out, w2, b2) relu_grad(l1, l2) lin_grad(inp, l1, w1, b1)
Back propagation is the chain rule, which only saves intermediate calculations, and it is not necessary to calculate them every time.
loss basically does not appear in the gradient and is not used in back propagation.
Test and compare with pytorch version
orward_and_backward(x_train, y_train) # Save for testing against later w1g = w1.g.clone() w2g = w2.g.clone() b1g = b1.g.clone() b2g = b2.g.clone() ig = x_train.g.clone() # ========================================= # PYTORCH version for checking # ========================================= # check against pytorch's version xt2 = x_train.clone().requires_grad_(True) w12 = w1.clone().requires_grad_(True) w22 = w2.clone().requires_grad_(True) b12 = b1.clone().requires_grad_(True) b22 = b2.clone().requires_grad_(True) def forward(inp, targ): # forward pass: l1 = inp @ w12 + b12 l2 = relu(l1) out = l2 @ w22 + b22 # we don't actually need the loss in backward! return mse(out, targ)
Comparing the results we wrote with those of pytoch, we found that they were almost the same.
loss = forward(xt2, y_train) loss.backward() test_near(w22.grad, w2g) test_near(b22.grad, b2g) test_near(w12.grad, w1g) test_near(b12.grad, b1g) test_near(xt2.grad, ig )
Refactoring
Let's do some interesting Refactoring:
This is very similar to the pytorch API. For each of these functions, we combine forward and backward functions in one class. Relu will have its own forward and backward functions
__call_ treats a class as a function
Take all layers as a class and implement forward propagation and back propagation. With _call _ means that this class can be regarded as a function.
class Relu(): def __call__(self, inp): self.inp = inp self.out = inp.clamp_min(0.)0.5 return self.out def backward(self): self.inp.g = (self.inp>0).float() * self.out.g
As a reminder, in the linear layer, Lin, we need to output the gradient relative to the weight and the output relative to the deviation
class Lin(): def __init__(self, w, b): self.w,self.b = w,b def __call__(self, inp): self.inp = inp self.out = inp@self.w + self.b return self.out def backward(self): self.inp.g = self.out.g @ self.w.t() # Creating a giant outer product, just to sum it, is inefficient! self.w.g = (self.inp.unsqueeze(1) * self.out.g.unsqueeze(1)).sum(0) self.b.g = self.out.g.sum(0) class Mse(): def __call__(self, inp, targ): self.inp = inp self.targ = targ self.out = (inp.squeeze()  targ).pow(2).mean() return self.out def backward(self): self.inp.g = 2. * (self.inp.squeeze()  self.targ).unsqueeze(1) / self.targ.shape[0]
Let's also make our model a class. There is no pytorch function or utility in this class
class Model(): def __init__(self, w1, b1, w2, b2): self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)] self.loss = Mse() def __call__(self, x, targ): for l in self.layers: x = l(x) return self.loss(x, targ) def backward(self): self.loss.backward() # iterates through layers for l in reversed(self.layers): l.backward()
Let's train
# initialize the gradients w1.g, b1.g, w2.g, b2.g = [None]*4 # create the model model = Model(w1, b1, w2, b2)
The time consumption is quite long
%time loss = model(x_train, y_train)  CPU times: user 274 ms, sys: 44.9 ms, total: 319 ms Wall time: 59.6 ms
Design around general classes with general functions
Let's try to reduce the number of duplicate code. This will be designed in the generic module class. Then, for each function, we will extend the basic module for each function.
 einsum will also be used instead of the previous array operation to speed up the linear layer
 The parameters are too complex and confusing.
 Create a new class of Module
# ============================================ # Base class # ============================================ class Module(): def __call__(self, *args): self.args = args self.out = self.forward(*args) return self.out def forward(self): """ will be implemented when extended""" raise Exception('not implemented') def backward(self): self.bwd(self.out, *self.args) # ============================================ # Relu extended from module class # ============================================ class Relu(Module): def forward(self, inp): return inp.clamp_min(0.)0.5 def bwd(self, out, inp): inp.g = (inp>0).float() * out.g # ============================================ # linear layer extended from module class # ============================================ class Lin(Module): def __init__(self, w, b): self.w,self.b = w,b def forward(self, inp): return inp@self.w + self.b def bwd(self, out, inp): inp.g = out.g @ self.w.t() # Implementing Einstein uses Einstein's summation self.w.g = torch.einsum("bi,bj>ij", inp, out.g) self.b.g = out.g.sum(0) # ============================================ # MSE extended from module # ============================================ class Mse(Module): def forward (self, inp, targ): return (inp.squeeze()  targ).pow(2).mean() def bwd(self, out, inp, targ): inp.g = 2*(inp.squeeze()targ).unsqueeze(1) / targ.shape[0] # ============================================ # Remake the model # ============================================ class Model(): def __init__(self): self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)] self.loss = Mse() def __call__(self, x, targ): for l in self.layers: x = l(x) return self.loss(x, targ) def backward(self): self.loss.backward() for l in reversed(self.layers): l.backward()
Let's re time it
w1.g,b1.g,w2.g,b2.g = [None]*4 model = Model()
s
%time loss = model(x_train, y_train)  CPU times: user 294 ms, sys: 11.2 ms, total: 306 ms Wall time: 44.3 ms
%time model.backward()  CPU times: user 454 ms, sys: 92.4 ms, total: 547 ms Wall time: 174 ms
No Einstein
class Lin(Module): def __init__(self, w, b): self.w,self.b = w,b def forward(self, inp): return inp@self.w + self.b def bwd(self, out, inp): inp.g = out.g @ self.w.t() self.w.g = inp.t() @ out.g self.b.g = out.g.sum(0)
s
w1.g,b1.g,w2.g,b2.g = [None]*4 model = Model()
s
%time loss = model(x_train, y_train)  CPU times: user 280 ms, sys: 33.7 ms, total: 314 ms Wall time: 45.8 ms  %time model.backward()  CPU times: user 442 ms, sys: 70.9 ms, total: 513 ms Wall time: 158 ms
Pytorch version nn.Module with nn.Linear and
from torch import nn class Model(nn.Module): def __init__(self, n_in, nh, n_out): super()._0_init__() self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)] self.loss = mse def __call__(self, x, targ): for l in self.layers: x = l(x) return self.loss(x.squeeze(), targ)
s
model = Model(m, nh, 1) %time loss = model(x_train, y_train)  CPU times: user 280 ms, sys: 36.7 ms, total: 316 ms Wall time: 40.5 ms
s
%time loss.backward()  CPU times: user 183 ms, sys: 6.87 ms, total: 190 ms Wall time: 33.8 ms
odel.backward()
CPU times: user 454 ms, sys: 92.4 ms, total: 547 ms
Wall time: 174 ms
### No Einstein ```python class Lin(Module): def __init__(self, w, b): self.w,self.b = w,b def forward(self, inp): return inp@self.w + self.b def bwd(self, out, inp): inp.g = out.g @ self.w.t() self.w.g = inp.t() @ out.g self.b.g = out.g.sum(0)
s
w1.g,b1.g,w2.g,b2.g = [None]*4 model = Model()
s
%time loss = model(x_train, y_train)  CPU times: user 280 ms, sys: 33.7 ms, total: 314 ms Wall time: 45.8 ms  %time model.backward()  CPU times: user 442 ms, sys: 70.9 ms, total: 513 ms Wall time: 158 ms
Pytorch version nn.Module with nn.Linear and
from torch import nn class Model(nn.Module): def __init__(self, n_in, nh, n_out): super()._0_init__() self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)] self.loss = mse def __call__(self, x, targ): for l in self.layers: x = l(x) return self.loss(x.squeeze(), targ)
s
model = Model(m, nh, 1) %time loss = model(x_train, y_train)  CPU times: user 280 ms, sys: 36.7 ms, total: 316 ms Wall time: 40.5 ms
s
%time loss.backward()  CPU times: user 183 ms, sys: 6.87 ms, total: 190 ms Wall time: 33.8 ms