fastai 2019 lesson8 notes notes


  • Video address:


This second part is very different from the 2018 version. The course name is "deep learning from the foundation". We will learn to implement many things in Fastai and PyTorch. Basically, we will learn something that can be used to build our own deep learning library. In this process, we will learn how to implement the paper, which is an important skill to master when making the most advanced model.

Basic, but it basically means starting from scratch, so we will study basic matrix calculus, create a training cycle from scratch, create an optimizer and many different layers and architectures from scratch, and so on, not just create some stupid library that is useless for anything, But actually build something from scratch that can train cutting-edge world-class models, so this is a goal we have never done before. We think no one has done this before, so I don't know exactly how far we will go, but you know, this is our ongoing journey, and we will see how we will move forward.

Therefore, in this process, we will have to read and correctly implement the papers, because Fastai library is full of implementation papers, so if you do not read and implement the papers, you will not be able to do this, and we will also implement most of the pytorch. As you will see, we will also solve some applications that are not fully integrated into the Fastai library, so a lot of customization work will be required, such as object detection, sequence seq2seq, sorting using attention converter, transformer excel loop gain audio and other things, so I will study some performance considerations more deeply, such as distributed multi GPU training using our new real-time compiler, which is called JIT and C + + from now on, so this is the first five lessons.

The last two classes are to use Swift to realize some of these applications, which is the in-depth learning of implicit. This part is opposite to part1.

  • Part 1 is top-down, understanding how to use deep learning in practice, how to use it, and how to get results
  • Part 2 will be bottom-up: let you see the connections between various things. Then you can customize your algorithm for your own problems and do what you need it to do.

We have changed this year for several reasons:

  • ① So many papers have been published, and the scale of literature in this field has grown so fast that I can't pick out 12 papers you really need to know in the next seven weeks, because there are too many, and it's a little meaningless, because once you enter, you realize that almost all papers are talking about small changes in the same thing, so on the contrary, What I want to do is to show you the foundation, let you read the twelve papers you care about and realize that this is the small adjustment. Now I have all the tools needed to implement it, test and experiment it, so this is a very key question why we should move in this direction, And more and more clearly let you know that some of us used to call them cutting-edge technology learning of coders.
  • The frontier of deep learning is actually engineering, not just papers. The difference between efficient people and others in deep learning is that they can make useful things with code, but there are few such people. So part2 is to let everyone practice deeply and achieve work with code.
  • Therefore, part2 is an implantable work, which is usually done by the library. Usually, you won't do it.
  • part1's course is top-down, so you can already understand the context, so Part2 is bottom-up. When you build all the contents from the bottom, you can see the connections between all different things and see that they are variants of the same thing you know, Then you can customize algorithm A or algorithm B, create your own algorithm to solve your own problems, and only do what you need it to do. Then you can make sure that you know it performs well, you can debug it and maintain it, because you understand that all parts are normal.

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-81hky12m-163287844338) (snipaste_2021-09-27_17-34-01. PNG)]

What are we going to do?

  • The second part is very different from last year
  • We will implement the fastai library from the foundation (from scratch)
    • Basic matrix calculus
    • Training cycle
    • Optimizer customization
    • Custom annealing
    • It's actually a place where world-class models can be trained
  • Reading and implementing papers
  • Resolve applications that are not fully supported in the fastai library
  • Finally, it is implemented on Swift
  • There are too many papers now
    • And say small changes in the same thing
  • Show the basics so that you can choose 12 papers
  • Cutting edge technology is actually about engineers, not about papers
    • Who can bake those things with code
  • The second part will be more about bottom-up (with code)
  • Create your own algorithm to solve what you do.
  • Today we will implement matrix multiplication from scratch in Python

why swift?

  • Chris Lattner is a unique one. He established the world's most widely used compiler framework LOVM

  • The default C and C + + compilers have also been created for Mac, clang

  • He founded Swift, perhaps the world's fastest-growing computer language, and now focuses on deep learning

    • When you really see internal structures like tensorflow, it seems that tensorflow was designed by a group of declining people, not compilers.
    • So I always wanted to have a good digital programming language. It was built by people who really mastered the programming language, but it never happened
    • Python languages are not built to be good at data analysis. They are not built by those who really have a deep understanding of compilers. Of course, they are not built for our modern highly parallel processor, but Swift is right. Therefore, we have encountered this unique situation. You know a really widely used language for the first time. A very well-designed language from scratch is actually for digital programming and deep learning, so I can't miss that boat. I don't want you to miss it.
    • Another Giulia language has great potential, but the number of users is 10 times less than Swift. But Swift doesn't have the same level of community as Giulia, but it's still exciting. And Giulia goes further in simplicity.
  • Previous languages were not designed for multiprocessor operation, but were designed by compiler experts

  • Another option is Julia for numerical programming, which goes further; So now there are two options.

  • Jeremy plays with the swift library during the Christmas holiday.

    • I'm glad to find that I can build code from scratch, and the code is comparable to the fastest manually tuned linear algebra library.
    • Swift is simple and efficient
    • Chris Lattner himself will also attend the last two classes to ask you to use Swift
    • Sift for TensorFlow, S4TF has some advantages and disadvantages, as well as pytorch. The two are opposite

Advantages and disadvantages of Swift and pytoch

  • pytorch has a good ecosystem and excellent documents and tutorials. It can get work done quickly, practice and solve problems quickly
  • But S4TF is not. It has few ecology and few documents. People say Swift is an iPhone programming language, but it is actually a well-designed and powerful language.
  • LLVM is a compiler. Swift communicates directly with LLVM compiler. Swift is the thin layer above. When you write something with swift, it is really easy. LLVM likes to compile it into super fast optimized code.
  • When python is translated into other languages, the impedance mismatch between what I try to write and what I actually run makes it difficult to explore the kind of depth we will do

[the external chain picture transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the picture and upload it directly (img-hdus3nxo-16328784340) (snipaste_2021-09-28_10-45-08. PNG)]

what do we mean by from the foundations

Rewrite fastai and many functions of pytorch: matrix multiplication, torch.nn, torch.optim, and top-level data set loading data loader

We can use python and some python standard libraries


  • We can use pure python
  • Standard python Library
  • Non data science module
  • pytorch - array and rng only
  • fastai.datasets (for source material)
  • matplotlib
  • Once we recreate a function, we can use the real version downstream


  • We need to really understand what is going on in the model and what really happened in the training. You will find this in the experiments we will do in the next few classes
  • We'll actually come up with some new ideas if you can create something from scratch and understand it,
  • Then once you create something from scratch and you really understand it, you can adjust. You will suddenly realize that target detection, architecture and optimizer are not as perfect as those in the library, but a pile of semi arbitrary specific choices, and your specific problem is likely to require a different set of knobs and choices.
  • For those who want to contribute to fastai open source, you will learn how fastai is built, which parts work well, and how you know how to contribute tests or documents or new functions or create your own libraries,
  • For those who are interested in further research, you will implement the paper, which means that you will be able to associate the code you are writing with the paper you are reading.

There are many opportunities in this course

  • Homework is cutting-edge
  • Actually do experiments that people haven't done before observation. Few deep learning practitioners know what you know now. We are studying things that others haven't seen before
  • So please try to do a lot of experiments, especially in your field
  • And consider writing a blog. Although it is not perfect, write it down.
    • Don't wait for perfection to start communicating. Write something for you six months ago. That's your audience.
    • If you don't have a blog, please try

part1 review

So I suppose you remember the first part. Here is the first part. In practice, you are unlikely to remember all these things, because no one is perfect, so I actually want you to do what I am doing. You're thinking I don't know what he's talking about. You'll go back and watch the video about it. Don't just move forward, because I assume you already know the first part. In particular, if you are not confident about the second half of the first part, where we have a deeper understanding of what is the real activation and what is the real parameter, it is exactly the same as the work of SGD, especially in today's course, I assume you really understand those things, so if you don't understand, go back and watch those videos again, Go back to time like SGD from scratch and take some time.

I designed this course to keep most people busy and go straight to the next course, so please feel free to take the time to dig deeper.

  • For topics you have not mastered, please go back to the previous lesson.

① Over fitting ② there is no third step in reducing over fitting

  • The most important thing is to try to make sure we can train a good model
    • There are three steps to train to get a very good model
  1. First, we try to create something larger than we need (try to create a complex model)

    • No regularization
    • Over fitting
    • Over fitting means that your training loss is lower than the verification loss ✘
  2. Over fitting does not mean that the training loss is lower than the verification loss

    • For a well fitted model, train loss is always lower than valid loss
    • The sign of overfitting is when you actually see the validation loss get worse,
  3. Visual inputs and outputs:

    • See what happened
  • The first step is usually easy, but the second step is usually difficult.

Five steps to avoid over fitting

Five things can avoid over fitting!

  1. More data
  2. Data enhancement
  3. General architecture
  4. Regularization
  5. Reduce architecture complexity
  • Most beginners start with 5, but this should be the last one

    • Unless the model is too slow
  • It's not that difficult, but basically these five things you can do in order of priority. If you can get more data, you should do it first. If you can do more data enhancement, you should do it. If you can use it, you should also use a more general architecture. Then if all these things are completed, you can start adding regularization, For example, dropout or weight attenuation, but please remember that at this time, you are reducing the ability to effectively effect your model, so regularization is not as good as the first three things, and then finally reduce the complexity of the architecture. Most people and most beginners especially start to reduce the complexity of the architecture, but this should be the last thing you try, unless your architecture is too complex, It's too slow for your question, so this is a summary of what we learned in part 1 and what we want to do,

It's time to start reading papers

So we will read the papers in part1 that we didn't read. Reading a paper can be very daunting. The simplest calculation on excel may be a lot of symbols in the paper.

  • Even familiar things look complex in the paper!

    • Overcome the fear of the Greek alphabet
  • Papers are important for in-depth learning beyond the basics, but they are difficult to read

  • Google searched a blog post describing the paper

    • They were not chosen for their excellent communication clarity
    • Usually, blog posts do better than papers
  • Tip 1: learn the pronunciation of Greek letters to make the equation more approachable.

  • Tip 2: learn mathematical symbols - check Wikipedia. Detexify - use machine learning to determine the symbol you are viewing. The advantage of this is that it provides latex name.


  • Or use detexify


Steps of basic modern CNN model

For the next few classes. We will make a qualified CNN model.

  • Matrix multiplication
  • Relu / initialization
  • Forward full connection layer
  • Reverse full connection layer
  • train loop
  • Conv
  • Optimize Optim
  • bacth normalization
  • Resnet
    • We have learned it in the last lesson of Part 1

The goal of today's class

  • From matrix multiplication to inverse transfer

Because we did it, in the last course, we already had a layer for creating ResNet, and we actually got good results, so we just need to do all these things to get us from here to here. This is just the next few classes, and we will go further,

  • Today we will try until FC backpropagation is calculated correctly
  • We will build a model that accepts the input array, and then we're trying to create a simple fully connected network, so it will have a hidden layer, so we will start matrix multiplication from some inputs, Matmul - > relu - > Matmul - > relu – > loss
  • Input - > matrix multiplication - > relu - > matrix multiplication - > loss, forward propagation, calculate loss
  • Then calculate the gradient of the weight, and then the gradient decreases to update the parameters
  • Repeat the above process several times

I'm here to show you how I will build our library in Jupiter notebooks. Many very smart people assure me that it is impossible to develop an effective library in Jupiter notebooks, which is a shame because I have established a library. But our notebooks, so anyway, people will often tell you that things are impossible, but t I will tell you my point, that is, I have been programming for more than 30 years, but my development is a stupid notebook, * * I guess my productivity has increased by about two to three times** Yes, in the past two or three years, I've built more useful things than I've done before, so I'll tell you how we need to do something.

We can't just use our entire library to create a huge notebook. We must be able to extract those small gemstones in some way. We think Oh, this is the code. OK, let's keep it. We must extract it into a package that we can reuse in order to tell our system that this is a cell that I want you to keep and reuse.

I use this special comment cache export at the top of the cell, and then I have a program called notebook2script, which traverses the notebook and finds those cells and puts them into the Python module to convert ipynb into py file.

from 00_exports.ipynb Start,

Lesson 00.ipynb

How to extract some code from jupyter into a package

  • How to build applications on jupyter notebooks to be more efficient on jupyter notebooks

    • Use a special comment #export to tell the system which cells you want to keep and reuse.
    • Then use to pass through the program's file and find the cells #export with special comments and put them into the python module.
    • Path.stem.split("-") is used to output the file name, so the output name is the first part before undesrcore. If there is no underscore, it is the full name.
    • The exported module goes to the module named exp
  • Then we can use the imported and exported module from exp.nb_00 import *

  • Create test framework

    • Test and test_eq using assert
  • For run tests outside of jupyter notebook

    • python 01_matmul.ipynb runs tests outside the jupyter notebook
    • We can see the assert error when the terminal is running
    • The function is converted to a command line interface
  • Now we have an automated unit testing framework on the Jupiter notebook

  • Trigger execution function

    • fire's library is a very concise library that allows you to use any function like this and automatically convert it to a command-line interface
    • It accepts any function and automatically converts it to a command line interface
    • The input of the function is converted to an argument on the command line
  • Notebooks are json files.

    • We can import cells and convert them into json files using the Jupiter notebook file
    • SON loading is the easiest way, especially when I built my jupiter notebook infrastructure in jupiter notebooks.
    • This is a very good environment, which can automate your things and run scripts on it, so this is all. This is all the content of our development infrastructure
  • The benefits of using notebooks for unit testing: there is context information. If the test fails, you can check each input and output. This is a very good way to repair those failed tests.

Notebook 01 matrix multiplication (file) 01_matmul.ipynb)

  • There are some parts of the standard library, but numpy is not allowed

    • take
    • The external reference or file is automatically reloaded
    • And will do so at specific time intervals
    • Will be drawn in the notebook
    • matplotlibmpl will use grayscale because we will use MNIST

    get data

    • Import mnist, extract mnist into train and y, valid for numpy array
    • Convert numpy array to tensor (np is not allowed)
    • Tensors were previously imported from pytorch
    • Get the number of columns and rows from the training data
    • Some visualization and statistics
    • Do some obvious tests from above
  %load_ext autoreload
  %autoreload 2
  %matplotlib inline
  mpl.rcParams['image.cmap'] = 'gray'
  # standard libraries
  from pathlib import Path
  from IPython.core.debugger import set_trace
  import pickle, gzip, math, torch, matplotlib as mpl
  import matplotlib.pyplot as plt
  # datasets
  from fastai import datasets
  # basic pytorch
  from torch import tensor

Download the mnist dataset and load it

path = datasets.download_data(MNIST_URL, ext='.gz'); path

# unzips the download
with, 'rb') as f:
    ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')

numpy is not allowed, so python's map is mapped to tensor format. python's tensor is a good choice.

# maps the pytorch tensor function against each
# of the loaded arrays to make pytorch versions of them
x_train,y_train,x_valid,y_valid = map(tensor, (x_train,y_train,x_valid,y_valid))

# store the number of 
# n = rows
# c = columns
n,c = x_train.shape

# take a look at the values and the shapes
x_train, x_train.shape, y_train, y_train.shape, y_train.min(), y_train.max()
(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 torch.Size([50000, 784]),
 tensor([5, 0, 4,  ..., 8, 4, 8]),

Let's test our input data

  1. Line check: check whether the number of lines x_train is the same as the shape, y_train, and the number should be 50000
  2. Column check: check whether the number of columns is * * 28 * 28, * * because this is the total number of pixels of the expanded image
  3. Class check: test whether 10 different classes are found in y_train0 - 9
assert n==y_train.shape[0]==50000

Peek at one of the pictures

img = x_train[0]
# note that there is a single vector that is reshaped into the square format

initial model

We will first try the linear model:

Y=W^T X+b will be the first model we will try. We will need the following:

  • w: Weight
  • b: Baseline or deviation
weights = torch.randn(784,10)
bias = torch.zeros(10)

Matrix multiplication

We will do this often, so it's good to be familiar with this. There is a great website, which explains how matrix multiplication works.

Matrix multiplication function: the following function multiplies two arrays one by one

def matmul(a,b):
    # gets the shapes of the input arrays
    ar,ac = a.shape # n_rows * n_cols
    br,bc = b.shape
    # checks to make sure that the
    # inner dimensions are the same
    assert ac==br
    # initializes the new array
    c = torch.zeros(ar, bc)
    # loops by row in A
    for i in range(ar):
        # loops by col in B
        for j in range(bc):
            # for each value
            for k in range(ac): # or br
                c[i,j] += a[i,k] * b[k,j]
    return c

Let's do a quick example

The first 5 images in the validation data will be used and multiplied by the weight of the matrix

m1 = x_valid[:5]
m2 = weights
m1.shape, m2.shape
(torch.Size([5, 784]), torch.Size([784, 10]))

Will time operation

%time t1=matmul(m1, m2)
CPU times: user 605 ms, sys: 2.21 ms, total: 607 ms 
Wall time: 606 ms
torch.Size([5, 10])

How can we do this faster?

We can do this by operating element by element. We will use pytorch's tensor to illustrate this. When using a pytorch object, the operators (+, -, *, /, >, <, = =) are usually element by element. Examples of operations by element:

a = tensor([10., 6, -4])
b = tensor([2., 8, 7])
m = tensor([[1., 2, 3], [4,5,6], [7,8,9]]); 
a, b, m
(tensor([10.,  6., -4.]), tensor([2., 8., 7.]), tensor([[1., 2., 3.],
         [4., 5., 6.],
         [7., 8., 9.]]))
# Addition
print(a + b)
# comparisons
print(a < b)
# can summarize
print((a < b).float().mean())
# frobenius norm calculation
tensor([12., 14.,  3.])
tensor([0, 1, 1], dtype=torch.uint8)

If we adjust matmul

for k in range(ac): # or br
    c[i,j] += a[i,k] * b[k,j]

Will be replaced

c[i,j] = (a[i,:] * b[:,j]).sum()
def matmul(a,b):   
    # gets the shapes of the input arrays
    ar,ac = a.shape # n_rows * n_cols
    br,bc = b.shape 
    # checks to make sure that the
    # inner dimensions are the same
    assert ac==br
    # initializes the new array
    c = torch.zeros(ar, bc) 
    # loops by row in A
    for i in range(ar):  
        # loops by col in B
        for j in range(bc): 
            c[i,j] = (a[i,:] * b[:,j]).sum()
    return c

After performance changes, multiplication is much faster

%time t1=matmul(m1, m2)
CPU times: user 1.57 ms, sys: 864 µs, total: 2.44 ms
Wall time: 1.57 ms

To test it, we'll write another function to compare the matrices. The reason is that due to the rounding error of mathematical operation, the matrix may not be exactly the same. Therefore, we hope to have a function that will "be equal to B within a certain error"

def near(a,b): 
    return torch.allclose(a, b, rtol=1e-3, atol=1e-5)

def test_near(a,b): 
test_near(t1, matmul(m1, m2))

radio broadcast

The broadcast describes how arrays with different shapes are processed during arithmetic operations. The term broadcast was first used by Numpy.

How can we achieve a > 0? 0 is broadcasting to have the same dimension as a.

For example, you can use broadcast to standardize our dataset by subtracting the mean (one scalar) from the entire dataset (matrix) and dividing it by the standard deviation (another scalar).

Example: Broadcast vector for matrix

You can use special values for indexing [None] or for unsqueeze() to convert a one-dimensional array to a two-dimensional array (although one of the dimensions has a value of 1). This is important later when using matrix multiplication in modeling

We don't really copy. It looks like we copied, but actually step=0.

Back to our function

Let's take advantage of the broadcast and reduce the loop in the matmul function:

a[i,:] view level 1 tensor

**. unsqueeze(-1) * * makes it 2d, which - 1 means the last dimension

\*b end of broadcast b

. sum(dim=0) sums along the first axis

def matmul(a,b):
    # gets the shapes of the input arrays
    ar,ac = a.shape # n_rows * n_cols
    br,bc = b.shape
    # checks to make sure that the
    # inner dimensions are the same
    assert ac==br  
    # initializes the new array
    c = torch.zeros(ar, bc) 
    # loops by row in A
    for i in range(ar):
        c[i] = (a[i].unsqueeze(-1) * b).sum(dim=0)
    return c
%time t1=matmul(m1, m2)
CPU times: user 440 µs, sys: 283 µs, total: 723 µs
Wall time: 421 µs
test_near(t1, matmul(m1, m2))

Broadcasting rules

Since multidimensional broadcasting can be complex, it is important to follow some rules

When operating on two arrays / tensors, Numpy/PyTorch compares their shapes by element. It starts with the trailing dimension and then moves forward. When two dimensions are compatible

  • They are equal, or
  • One of them is 1, in which case the dimension is broadcast to make it the same size

Arrays do not need to have the same dimension. For example, if you have an array of 256 RGB values and you want to scale each color in the image by a different value, you can multiply the image by a one-dimensional array with 3 values. The sizes of the trailing axes of these arrays are arranged according to the broadcast rules, indicating that they are compatible:

Einstein's summation

Einstein sum is a compact representation of the sum of product and sum in a general way. From numpy document:

The subscript string is a comma separated list of subscript labels, where each label refers to a dimension of the corresponding operand. Whenever a label is repeated, it is summed, so np.einsum('i,i', a, b) is equivalent to np.inner(a,b). If a label appears only once, it will not be summed, so np.einsum('i', a) will produce a unchanged view. "

c[i,j] += a[i,k] * b[k,j]
c[i,j] = (a[i,:] * b[:,j]).sum()

Consider some rearranging, moving the target to the right and removing the name

a[i,k] * b[k,j] -> c[i,j]
[i,k] * [k,j] -> [i,j]
ik,kj -> [ij]
# c[i,j] += a[i,k] * b[k,j]
# c[i,j] = (a[i,:] * b[:,j]).sum()
def matmul(a,b): return torch.einsum('ik,kj->ij', a, b)
%timeit -n 10 _=matmul(m1, m2)
47.7 µs ± 4.04 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Performance considerations

Unfortunately, another very high-performance language is hidden in einsum. At present, people have a lot of interest and development in high-performance languages. This is a link to some work done by a language called "halide" 9

pytorch opUnfo

We have increased the speed, but also the pytorch operation has been more optimized. Even with vectorization, there are slow and fast ways to handle memory. Unfortunately, most programmers have no access to this and lack the use of functions (basic linear algebra subroutines) provided in the BLAS library

Topic to find: tensor understanding

%timeit -n 10 t2 = m1.matmul(m2)
14 µs ± 4.44 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
t2 = m1@m2
!python lesson82.ipynb
Converted lesson82.ipynb to
CPU matrix multiplication time consumption
Three cycles330ms
pytorch matrix point multiplication709us
pytorch broadcast multiplication289us
Einstein's summation16.6us

Lesson 8 making Relu / initializing

%load_ext autoreload
%autoreload 2

%matplotlib inline
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
  %reload_ext autoreload
from exp.nb_lesson81 import *
from exp.nb_lesson82 import *

def test(a,b,cmp,cname=None):
    if cname is None: cname=cmp.__name__
    assert cmp(a,b),f"{cname}:\n{a}\n{b}"

def near(a,b): 
    return torch.allclose(a, b, rtol=1e-3, atol=1e-5)

def test_near(a,b): 

def get_data():
    Loads the MNIST data from before
    path = datasets.download_data(MNIST_URL, ext='.gz')
    with, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train,y_train,x_valid,y_valid))

def normalize(x, m, s): 
    Normalizes an input array
    Subtract the mean and divide by standard dev
    result should be mean 0, std 1
    return (x-m)/s

def test_near_zero(a,tol=1e-3): 
    assert a.abs()<tol, f"Near zero: {a}"

Load MNIST data and standardize

Forward and backward passes

  • x train, y train, xy get data
  • Get standard deviation
  • Normalization using standard deviation
  • Note that the mean and standard deviation of the training set are used to normalize the validation set
  • This means close to zero and the standard close to 1
  • Is the test function really standardized
  • n. M get xtrain shape
  • c output size
# load the data
x_train, y_train, x_valid, y_valid = get_data()

# calculate the mean and standard deviation
train_mean,train_std = x_train.mean(),x_train.std()
print("original mean and std:", train_mean,train_std)

# normalize the values
x_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)

# check the updated values
train_mean,train_std = x_train.mean(),x_train.std()
print("normalized mean and std:", train_mean, train_std)
original mean and std: tensor(0.1304) tensor(0.3073)
normalized mean and std: tensor(0.0001) tensor(1.)
# check to ensure that mean is near zero

# check to ensure that std is near zero

Look at the training data

Note the size of the training set

n,m = x_train.shape
c = y_train.max()+1
(50000, 784, tensor(10))

Our first model

Our first model will have 50 hidden units. It will also have two hidden layers:

  1. The first layer (w1): will be input_shapex size hidden units
  2. The second layer (w2): will be the size hidden units
# our linear layer definition

def lin(x, w, b):
    return x@w + b

# number of hidden units
nh = 50

# initialize our weights and bias
# simplified kaiming init / he init
w1 = torch.randn(m,nh)/math.sqrt(m)
b1 = torch.zeros(nh)

w2 = torch.randn(nh,1)/math.sqrt(nh)
b2 = torch.zeros(1)

Define model

  • The model has a hidden layer
  • Basic Edition
  • Infrastructure
  • Hidden layers nhis 50
  • The two layers are two weight and bias matrices
  • w1 is the square root of the random value divided by m
  • b is zero
  • w2 is the mathematical square of the random value (nh,1) divided by nh
  • t is the linear of three vectors
  • Divide by the square root m and the tensor has a lower value
  • Simplify kaiming initialization and write a paper on it
  • Test mean and weight standard 1
  • What's really important in training
  • [1] Fixup initialization: https : //
    • Paper with 10000 layers, just initialize carefully
  • How to initialize is really important
  • Spend a lot of time on it
  • The first layer is defined by relu
  • relu is the grad data and clamps min to z (replacing negative numbers with zero)
  • Try to find an internal function on pytorch
  • Unfortunately, there is no mean zero and standard deviation of 1
  • demonstration
    • data distribution
    • Then take everything small and take it out
    • Obviously, the meaning and standard are different

Get standardized weights

If we want our weight to be between 0 and 1. We will divide by these different factors so that the output should also have a mean of 0 and a standard deviation of 1. This is usually done by kaiming normal, but we approximate it by division

t = lin(x_valid, w1, b1)
print(t.mean(), t.std())
tensor(-0.0155) tensor(1.0006)

Initializing weights is important. Example: use very specific weights to initialize and train large networks 5 It turns out that even in single cycle training, those first iterations are very important. We'll go back to this

**This may seem like a very small problem, but as we will see in the next few classes, it is like an important thing in training neural networks, * * in fact, in the past few months, people have really noticed how important it is, such as repair and initialization. These people actually trained a 10000 layer deep neural network without normalization layer, basically through careful initialization, so now people really spend a lot of time thinking about how to initialize. You know, we have had many successes such as single loop training and super convergence, This is related to what happened in the previous iterations. It turns out that it is completely related to initialization, so we will spend a lot of time studying this in depth,

Our ReLu (rectifier linear unit)

def relu(x):
    Will return itself, unless its below 0
    then will return 0
    return x.clamp_min(0.)

Check mean 0 std 1

This will not be true because all negative values will change to 0, so the mean will not be zero and std will change

ReLU changes the mean and variance of the hidden layer activator because the nonlinearity is truncated.

I can write this code in many ways, but if you can implement it using something similar to a single function in pytorch, it's almost always faster because it's usually written in C

t = relu(lin(x_valid, w1, b1))
print(t.mean(), t.std())
tensor(0.3896) tensor(0.5860)
  • Since ReLU removes the value < 0, the activation elements after ReLU are no longer with a mean value of 0 and a standard deviation of 1.
  • So this is one of the best insights and one of the most extraordinary papers in the past few years. It is the paper of the 2015 image network winner led by the person we mentioned, He Kaiming
  • This is full of great ideas. Reading the winner's paper is a very, very good idea, because they are often you know that ordinary papers will spend page after page trying to prove why a small adjustment they make should be accepted into Europe, but the winner of the competition has 20 good ideas, which can only be mentioned by the way

How to handle relu -- > (0,1)

  • Imagenet winner's paper

  • The winner's paper has many good ideas. Here we introduce ReLu, resnet and kaiming normalization

  • In Section 2.2, the ReLU network is easy to train; Networks with more than 8 layers are difficult to converge

    "Rectifier networks are easier to train"

    "Very deep models > 8 conv layers have difficulties to converge"

You may see Glorot initialization (2010). The paper is very good and has a great impact. In the next few lessons, we will actually re implement most of this article, which describes a suggestion on how to initialize the neural network

But when the network gets deeper, the gradient will disappear, and the denominator value is too large. So the ImageNet people made some changes, changing 6 to 2

# kaiming init / he init for relu
w1 = torch.randn(m,nh)*math.sqrt(2/m)


(tensor(0.0003), tensor(0.0506))
t = relu(lin(x_valid, w1, b1))

Now the result is closer to the mean 0 and the standard 1

(tensor(0.5896), tensor(0.8658))

This paper is worthy of in-depth study. Another interesting topic they solved is that the conv layer is very similar to matrix multiplication

b may not be very important,

Then they will take you step by step to understand how the variance changes throughout the network.

  • Forward transfer is matrix multiplication, and backward transfer is matrix multiplication with transpose. They finally recommended sqrt(2 over activations). Now that we know how to normalize weights and calculate kaiming normals, let's use its pytorch version
  • std becomes 1, but mean is still 0.5 because ReLU deletes activation elements less than 0.
  • I don't see anyone talking about this in the literature. This is what I just tried last week. It's an obvious thing. Instead of max(0,x), I use max(-0.5, x)
  • In my short experiment, this seems to help, so there's another thing you can try to see if it really helps, or if I just imagine something, it's sure to get you back to the correct average,

fan_in retains the magnitude of the weight variance in forward propagation

fan_out preserves amplitude in back propagation

from torch.nn import init

w1 = torch.zeros(m,nh)
init.kaiming_normal_(w1, mode='fan_out')
t = relu(lin(x_valid, w1, b1))
  • Basically, it means you divide by root M or root NH, because if you divide by root M, as you will see in that part of the paper, I suggest you read, this will keep the variance at 1 during the forward pass, but if you use NH, it will give you the correct unit variance to keep 1 in the backward pass
  • So why are we doing this, fan_out? Do you divide by row(m) or by row(nh). Because our weight shape is 784 x 50. pytorch is actually the opposite (50 x 784). How does this work?
import torch.nn
  torch.Size([50, 784])

doc(torch.nn.Linear(.forward), in pytorch * * F always refers to torch.nn.functional * *,

# Source:   
    def forward(self, input):
        return F.linear(input, self.weight, self.bias)

torch.nn.functional.linear?, we see in the document string that we use the following phrase to transpose, * * weight.t() * * this is the reason for dimension reversal

def linear(input, weight, bias=None):
    # type: (Tensor, Tensor, Optional[Tensor]) -> Tensor
    Applies a linear transformation to the incoming data: :math:`y = xA^T + b`.


        - Input: :math:`(N, *, in\_features)` where `*` means any number of
          additional dimensions
        - Weight: :math:`(out\_features, in\_features)`
        - Bias: :math:`(out\_features)`
        - Output: :math:`(N, *, out\_features)`
    if input.dim() == 2 and bias is not None:
        # fused op is marginally faster
        ret = torch.addmm(torch.jit._unwrap_optional(bias), input, weight.t())
        output = input.matmul(weight.t())	# So it's transposed, and the weight matrix in pytorch is inverse 
        if bias is not None:
            output += torch.jit._unwrap_optional(bias)
        ret = output
    return ret

How does pytorch initialize linear and convolution layers?


So the initialization operation of the volume layer: note that it is divided by math.sqrt(5), and the result is not very good.

kaiming_uniform is used, which is basically the same as ordinary kaiming_norm, but 5 \sqrt{5} 5 , this 5 \sqrt{5} 5 It seems that there is no documentary record. This one 5 \sqrt{5} 5 Seems to work pretty bad, so it's very useful to look at the source code.

# Source:
    def reset_parameters(self):
        n = self.in_channels
        # Note that it is divided by math.sqrt(5) and the result is not very good.
        init.kaiming_uniform_(self.weight, a=math.sqrt(5)) # kaiming_uniform is used for initialization
        if self.bias is not None:
            fan_in, _ = init._calculate_fan_in_and_fan_out(self.weight)
            bound = 1 / math.sqrt(fan_in)
            init.uniform_(self.bias, -bound, bound)

Back to the activation function

Now we see that the mean is zero and the standard deviation is close to 1

  • Let's try ReLU-0.5, which is the value after shift ing ReLU.
  • It is equivalent to redefining our new activation function

So we have to try this thing and subtract 0.5 from our ReLu, so it's cool. It's bad for us to have designed our own new activation function. I don't know, but like this, when people write papers, you know it's an adjustment level tweak. The overall level is like a small change to a line of code. It may be interesting to see how helpful it is .

  • The mean() becomes 0. It seems that the difference of the other party is also helpful, and the variance is reduced.
  • These two points make sense. Why do I think I will see better results
def relu(x): 
    return x.clamp_min(0.) - 0.5 # Redefined our new activation function

for i in range(10):
    # kaiming init / he init for relu
    w1 = torch.randn(m,nh)*math.sqrt(2./m )
    t1 = relu(lin(x_valid, w1, b1))
    print(t1.mean(), t1.std(), '| ')
tensor(0.0482) tensor(0.7982) | 
tensor(0.0316) tensor(0.8060) | 
tensor(0.1588) tensor(0.9367) | 
tensor(0.0863) tensor(0.8403) | 
tensor(-0.0310) tensor(0.7310) | 
tensor(0.0467) tensor(0.7965) | 
tensor(0.1252) tensor(0.8700) | 
tensor(-0.0610) tensor(0.7189) | 
tensor(0.0264) tensor(0.7755) | 
tensor(0.1081) tensor(0.8605) | 
  • With init, ReLU and matrix multiplication, we can do a forward propagation

Our first model

In pytoch, model can also be a function function,

def relu(x): 
    return x.clamp_min(0.) - 0.5

def lin(x, w, b):
    return x@w + b

def model(xb):
    l1 = lin(xb, w1, b1)
    l2 = relu(l1)
    l3 = lin(l2, w2, b2)
    return l3

So this is our model. It's just a function that executes a linear layer, a ReLU layer, and a linear layer. Let's try to run it. Well, it takes 8 milliseconds to run the model on the verification set, so it's fast enough to train

# timing it on the validation set
%timeit -n 10 _=model(x_valid)
6.71 ms ± 456 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
assert model(x_valid).shape==torch.Size([x_valid.shape[0],1])

Loss function: MSE

One thing is the loss function. As I said, we will now simplify things by using the principal square error, although this is obviously a stupid idea. Our model is returning something with a size of 10000, but

We need * * squeeze() to remove the trailing (, 1) * *) in order to use mse. (of course, mse is not a suitable loss function for multi class classification; we will use a better loss function soon. For simplicity, we will use mse now.)

torch.Size([10000, 1])

Here lazy, output squeeze(), many times the fastai forum report code breaks, usually because they batch size =1, and then called squeeze(), it becomes a scalar, and then collapsed. Therefore, it is best to specify the dimension when using squeeze.

def mse(output, targ): 
    # we want to drop the last dimension
    return (output.squeeze(-1) - targ).pow(2).mean()
# converting to float (from tensors), convert these values to float
y_train, y_valid = y_train.float(), y_valid.float()

# make our predictions
preds = model(x_train)		# Forward propagation
print(preds.shape)			# Calculate loss
# check our mse
print(mse(preds, y_train))
torch.Size([50000, 1])

Gradient and back propagation

How much should you know about matrix calculus? It's up to you, but there's a good reference article: The Matrix Calculus You need for deep learning

  • One thing you should learn is the chain rule.


If we take the derivative, we'll get, apply it to the code $(x^2) '= 2x$

(output.squeeze(-1) - targ).pow(2).mean()

#MSE grad
inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]
# Store the gradient of the previous layer, because in the chain rule, the gradient is to be multiplied.

Gradient of ReLU

  1. True (or 1) if you enter > 0
  2. False (or 0) if < = 0 is entered
  3. Multiply by timeout. g (this is the gradient)

This is back propagation

We save the intermediate calculation, so we don't have to calculate it twice. Note that there is usually no loss when calculating forward and backward propagation

def mse_grad(inp, targ): 
    # grad of loss with respect to output of previous layer loss
    # the derivative of squared output x^2 => 2x
    inp.g = 2. * (inp.squeeze() - targ).unsqueeze(-1) / inp.shape[0]
def relu_grad(inp, out):
    # grad of relu with respect to input activations
    inp.g = (inp>0).float() * out.g

# Gradient solution of linear layer:
def lin_grad(inp, out, w, b):
    # grad of matmul with respect to input
    inp.g = out.g @ w.t()
    w.g = (inp.unsqueeze(-1) * out.g.unsqueeze(1)).sum(0)
    b.g = out.g.sum(0)
    # Forward propagation
def forward_and_backward(inp, targ):
    # forward pass:
    l1 = inp @ w1 + b1
    l2 = relu(l1)
    out = l2 @ w2 + b2
    # we don't actually need the loss in backward!
    loss = mse(out, targ)
        # backward pass:
    mse_grad(out, targ)
    lin_grad(l2, out, w2, b2)
    relu_grad(l1, l2)
    lin_grad(inp, l1, w1, b1)

Back propagation is the chain rule, which only saves intermediate calculations, and it is not necessary to calculate them every time.

loss basically does not appear in the gradient and is not used in back propagation.

Test and compare with pytorch version

orward_and_backward(x_train, y_train)

# Save for testing against later
w1g = w1.g.clone()
w2g = w2.g.clone()
b1g = b1.g.clone()
b2g = b2.g.clone()
ig  = x_train.g.clone()

# =========================================
# PYTORCH version for checking
# =========================================

# check against pytorch's version
xt2 = x_train.clone().requires_grad_(True)
w12 = w1.clone().requires_grad_(True)
w22 = w2.clone().requires_grad_(True)
b12 = b1.clone().requires_grad_(True)
b22 = b2.clone().requires_grad_(True)

def forward(inp, targ):
    # forward pass:
    l1 = inp @ w12 + b12
    l2 = relu(l1)
    out = l2 @ w22 + b22
    # we don't actually need the loss in backward!
    return mse(out, targ)

Comparing the results we wrote with those of pytoch, we found that they were almost the same.

loss = forward(xt2, y_train)
test_near(w22.grad, w2g)
test_near(b22.grad, b2g)
test_near(w12.grad, w1g)
test_near(b12.grad, b1g)
test_near(xt2.grad, ig )


Let's do some interesting Refactoring:

This is very similar to the pytorch API. For each of these functions, we combine forward and backward functions in one class. Relu will have its own forward and backward functions

__call_ treats a class as a function

Take all layers as a class and implement forward propagation and back propagation. With _call _ means that this class can be regarded as a function.

class Relu():
    def __call__(self, inp):
        self.inp = inp
        self.out = inp.clamp_min(0.)-0.5
        return self.out
    def backward(self):
        self.inp.g = (self.inp>0).float() * self.out.g

As a reminder, in the linear layer, Lin, we need to output the gradient relative to the weight and the output relative to the deviation

class Lin():
    def __init__(self, w, b): 
        self.w,self.b = w,b
    def __call__(self, inp):
        self.inp = inp
        self.out = inp@self.w + self.b
        return self.out
    def backward(self):
        self.inp.g = self.out.g @ self.w.t()
        # Creating a giant outer product, just to sum it, is inefficient!
        self.w.g = (self.inp.unsqueeze(-1) * self.out.g.unsqueeze(1)).sum(0)
        self.b.g = self.out.g.sum(0)
class Mse():
    def __call__(self, inp, targ):
        self.inp = inp
        self.targ = targ
        self.out = (inp.squeeze() - targ).pow(2).mean()
        return self.out
    def backward(self):
        self.inp.g = 2. * (self.inp.squeeze() - self.targ).unsqueeze(-1) / self.targ.shape[0]

Let's also make our model a class. There is no pytorch function or utility in this class

class Model():
    def __init__(self, w1, b1, w2, b2):
        self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
        self.loss = Mse()
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return self.loss(x, targ)
    def backward(self):
        # iterates through layers
        for l in reversed(self.layers): 

Let's train

# initialize the gradients
w1.g, b1.g, w2.g, b2.g = [None]*4

# create the model
model = Model(w1, b1, w2, b2)

The time consumption is quite long

%time loss = model(x_train, y_train)
CPU times: user 274 ms, sys: 44.9 ms, total: 319 ms
Wall time: 59.6 ms

Design around general classes with general functions

Let's try to reduce the number of duplicate code. This will be designed in the generic module class. Then, for each function, we will extend the basic module for each function.

  • einsum will also be used instead of the previous array operation to speed up the linear layer
  • The parameters are too complex and confusing.
  • Create a new class of Module
# ============================================
# Base class
# ============================================

class Module():
    def __call__(self, *args):
        self.args = args
        self.out = self.forward(*args)
        return self.out
    def forward(self): 
        """ will be implemented when extended"""
        raise Exception('not implemented')
    def backward(self): 
        self.bwd(self.out, *self.args)
# ============================================
# Relu extended from module class
# ============================================     

class Relu(Module):
    def forward(self, inp): 
        return inp.clamp_min(0.)-0.5
    def bwd(self, out, inp): 
        inp.g = (inp>0).float() * out.g
# ============================================
# linear layer extended from module class
# ============================================
class Lin(Module):
    def __init__(self, w, b): 
        self.w,self.b = w,b
    def forward(self, inp): 
        return inp@self.w + self.b
    def bwd(self, out, inp):
        inp.g = out.g @ self.w.t()
        # Implementing Einstein uses Einstein's summation
        self.w.g = torch.einsum("bi,bj->ij", inp, out.g)
        self.b.g = out.g.sum(0)
# ============================================
# MSE extended from module
# ============================================
class Mse(Module):
    def forward (self, inp, targ):
        return (inp.squeeze() - targ).pow(2).mean()
    def bwd(self, out, inp, targ): 
        inp.g = 2*(inp.squeeze()-targ).unsqueeze(-1) / targ.shape[0]
# ============================================
# Remake the model
# ============================================
class Model():
    def __init__(self):
        self.layers = [Lin(w1,b1), Relu(), Lin(w2,b2)]
        self.loss = Mse()
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return self.loss(x, targ)
    def backward(self):
        for l in reversed(self.layers): l.backward()

Let's re time it

w1.g,b1.g,w2.g,b2.g = [None]*4
model = Model()


%time loss = model(x_train, y_train)
CPU times: user 294 ms, sys: 11.2 ms, total: 306 ms
Wall time: 44.3 ms
%time model.backward()
CPU times: user 454 ms, sys: 92.4 ms, total: 547 ms
Wall time: 174 ms

No Einstein

class Lin(Module):
    def __init__(self, w, b): self.w,self.b = w,b
    def forward(self, inp): return inp@self.w + self.b
    def bwd(self, out, inp):
        inp.g = out.g @ self.w.t()
        self.w.g = inp.t() @ out.g
        self.b.g = out.g.sum(0)


w1.g,b1.g,w2.g,b2.g = [None]*4
model = Model()


%time loss = model(x_train, y_train)

CPU times: user 280 ms, sys: 33.7 ms, total: 314 ms
Wall time: 45.8 ms
%time model.backward()
CPU times: user 442 ms, sys: 70.9 ms, total: 513 ms
Wall time: 158 ms

Pytorch version nn.Module with nn.Linear and

from torch import nn

class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)]
        self.loss = mse
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return self.loss(x.squeeze(), targ)


model = Model(m, nh, 1)
%time loss = model(x_train, y_train)

CPU times: user 280 ms, sys: 36.7 ms, total: 316 ms
Wall time: 40.5 ms


%time loss.backward()
CPU times: user 183 ms, sys: 6.87 ms, total: 190 ms
Wall time: 33.8 ms


CPU times: user 454 ms, sys: 92.4 ms, total: 547 ms
Wall time: 174 ms

### No Einstein

class Lin(Module):
    def __init__(self, w, b): self.w,self.b = w,b
    def forward(self, inp): return inp@self.w + self.b
    def bwd(self, out, inp):
        inp.g = out.g @ self.w.t()
        self.w.g = inp.t() @ out.g
        self.b.g = out.g.sum(0)


w1.g,b1.g,w2.g,b2.g = [None]*4
model = Model()


%time loss = model(x_train, y_train)

CPU times: user 280 ms, sys: 33.7 ms, total: 314 ms
Wall time: 45.8 ms
%time model.backward()
CPU times: user 442 ms, sys: 70.9 ms, total: 513 ms
Wall time: 158 ms

Pytorch version nn.Module with nn.Linear and

from torch import nn

class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        self.layers = [nn.Linear(n_in,nh), nn.ReLU(), nn.Linear(nh,n_out)]
        self.loss = mse
    def __call__(self, x, targ):
        for l in self.layers: x = l(x)
        return self.loss(x.squeeze(), targ)


model = Model(m, nh, 1)
%time loss = model(x_train, y_train)

CPU times: user 280 ms, sys: 36.7 ms, total: 316 ms
Wall time: 40.5 ms


%time loss.backward()
CPU times: user 183 ms, sys: 6.87 ms, total: 190 ms
Wall time: 33.8 ms

Tags: Deep Learning Pytorch NLP

Posted by Visualant on Wed, 29 Sep 2021 06:25:27 +0530