[in depth learning]: model evaluation and selection on the seventh day of 100 days of learning PyTorch: under fitting and over fitting (including source code)

[in depth learning]: model evaluation and selection of 100 days of learning PyTorch (Part 1): under fitting and over fitting

  • ✨ This article is included in [deep learning]: learning PyTorch together in 100 days This column mainly records how to use PyTorch to realize in-depth learning notes, and try to keep updating every week. You are welcome to subscribe!
  • 🌸 Personal home page: JoJo's data analysis Adventure
  • 📝 Personal introduction: I am studying statistics in my junior college and senior college. At present, I have joined statistics top3 colleges and universities to continue studying statistics postgraduates
  • 💌 If the article is helpful to you, welcome ✌ Attention 👍 Like ✌ Collection 👍 Subscribe to columns

Reference: this column mainly uses Mu Shen's "hands-on learning in depth" as the learning material to record his own learning notes. His ability is limited. If there are errors, you are welcome to correct them. At the same time, Mu Shen has uploaded teaching videos and teaching materials, so you can go to study.

1. basic concepts

The task of machine learning is to find a generalized pattern and find the overall rule through the training set, so as to show better accuracy on the unknown data set. But how to judge that our model does not simply remember the data, but really finds a law? Because we can only train the model from a limited sample set. When we collect more data, we will find that the prediction results of these data are completely different from the previous relationship. Let's introduce some basic concepts of machine learning evaluation model.

1.1 training error and generalization error

  • Training error: the error of the model on the training set
  • Generalization error: expectation of model error

In reality, we can never accurately calculate the generalization error. Therefore, in practice, we can only estimate the generalization error by applying the model to an independent test set.

1.2 training set, verification set and test set

  • Training set: used to train model and obtain model parameters
  • Validation set: used to select models and adjust super parameters
  • Test sets: for evaluating models
A very vivid metaphor is: the training set is equivalent to the usual exercises, the verification set is equivalent to the usual quizzes, and the test set is equivalent to the final exam. First of all, we must ensure that the practice accuracy of usual practice is high, so that we can get better results in the final exam. However, if you cheat and read the answers of the exercise questions, you will have a higher correct rate in your usual practice at this time, but you won't get good grades if you don't copy the answers in the final exam. At this time, you need to take a quiz to verify your learning results, so as to prevent you from having a higher correct rate because you peek at the exercise answers.

When training data, I do not want to use the data of the test set, because it is easy to over fit the evaluation results obtained from the test set. Therefore, we need to divide the data set into training set, verification set and test set, but in practical application, the distinction between verification set and test set is often not very clear. Therefore, many times, only the training set and the verification set are set in practice. Therefore, we will focus on the error of the verification set in the follow-up.

1.3 cross validation

We discuss the training error and verification error. We often use the cross validation method to calculate the validation error:

  • Leave one method for cross validation
    Leave one method for cross validation, one sample each time as the validation set, and the remaining n-1 samples as the training set: ( x 2 , y 2 ) , . . . , ( x n , y n ) {(x_2,y_2),...,(x_n,y_n)} (x2, y2), (xn, yn). Fitting model. As shown in the following figure:

    We are equivalent to doing n times of model training, and then estimating the verification error of a specific model by using the average verification error of these n times of fitting. The verification error obtained from the first training is: M S E 1 = ( y 1 − y ^ 1 ) 2 MSE_1=(y_1-\hat{y}_1)^2 MSE1 = (y1 − y^ 1) 2. Repeat n times to get: M S E 2 , . . . , M S E n MSE_2,...,MSE_n MSE2, MSEn. Finally, we take the average value to get the test MSE of LOOCV estimation:

C V ( n ) = 1 n ∑ i = 1 n M S E i . CV_{(n)}=\frac{1}{n}\sum_{i=1}^{n}MSE_i. CV(n)​=n1​i=1∑n​MSEi​.

  • K-fold cross validation

The idea of K-fold cross validation is to randomly average the data set into K groups. The first group is the verification set, and the remaining k-1 group is the training set. When k=n, the left one method cross validation can be regarded as the K-fold cross validation. It is similar to the left one method cross validation, M S E 1 MSE_1 MSE1 can be regarded as the average error of the validation group during the first training. After repeating K times, we can get the verification error of k-fold cross validation:
C V ( k ) = 1 k ∑ i = 1 k M S E i . CV_{(k)}=\frac{1}{k}\sum_{i=1}^{k}MSE_i. CV(k)​=k1​i=1∑k​MSEi​.
The figure below shows the schematic diagram of 5-fold cross validation:

1.4 model complexity

After the training model is obtained, the training error is calculated and verified. There are often two situations, one is over fitting, the other is under fitting

  • Underfitting: underfitting means that the model does not perform well in the training set, and the model cannot fit the training set well
  • Overfitting: the model performs well in the training set, but performs poorly in the test set
  • Regularization: regularization can be used to deal with over fitting problems

When the model is under fitted, we can consider using a more complex model for training. When the model is over fitted, we need to reduce the complexity of the model. The specific relationship is shown in the following figure:

Generally speaking, when there are many data sets, use more complex models; When there are few data sets, simple models are used.
Let's take polynomial regression as an example to see these indicators

2. polynomial regression

After introducing some of the above concepts, let's take a specific example of polynomial. First, the definition of polynomial regression is as follows:
y = β 0 + β 1 X + β 2 X 2 + β 3 X 3 + . . . + β n X n y = \beta_0 + \beta_1X+\beta_2X^2+\beta_3X^3+...+\beta_nX^n y=β0​+β1​X+β2​X2+β3​X3+...+βn​Xn
==When β 2 , . . . , β n \beta_2,...,\beta_n β 2, β When n is all 0, it is a simple univariate linear regression, so higher-order polynomials can include lower order polynomial regression== The higher polynomial model is more complex. Next, we take the data of a cubic polynomial as an example, respectively fit different polynomial regression models, and observe the training error and verification error

import math
import numpy as np
import torch
from torch import nn
from torch.utils import data
from IPython import display

Generate an analog data set whose true relationship is cubic polynomial regression

max_degree = 20  # Maximum order of polynomial
n_train, n_test = 100, 100  # Training and test data set size
true_w = np.zeros(max_degree)  # Set w
true_w[0:4] = np.array([5.1, 1.2, -3.1, 5.1])

features = np.random.normal(size=(n_train + n_test, 1))
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
for i in range(max_degree):
    poly_features[:, i] /= math.gamma(i + 1)  
# Dimensions of labels: (n\u train+n\u test,)
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)

Next, the multidimensional array is converted to a tensor

true_w, features, poly_features, labels = [torch.tensor(x, dtype=
    torch.float32) for x in [true_w, features, poly_features, labels]]

Next, you need to define some basic functions. You can directly download and import the d2l library, which is available in Mu Shen's teaching materials. However, there are errors when installing d2l. Therefore, if you do not want to install d2l, you can refer to the following functions. You can also write these functions into your own package for easy import.

# Define data iterator functions
def load_array(data_arrays, batch_size, is_train=True): 
    """Construct a PyTorch Data iterator"""
    dataset = data.TensorDataset(*data_arrays)#Convert data to tensor
    return data.DataLoader(dataset, batch_size, shuffle=is_train)
# Define a class to receive variables
class Accumulator:  #@save
    """stay n Accumulated on variables"""
    def __init__(self, n):
        self.data = [0.0] * n

    def add(self, *args):
        self.data = [a + float(b) for a, b in zip(self.data, args)]

    def reset(self):
        self.data = [0.0] * len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]
# Define accuracy function
def accuracy(y_hat, y):  #@save
    """Calculate predicted correct quantity"""
    if len(y_hat.shape) > 1 and y_hat.shape[1] > 1:
        y_hat = y_hat.argmax(axis=1)
    cmp = y_hat.type(y.dtype) == y
    return float(cmp.type(y.dtype).sum())
# Calculation error function
def evaluate_loss(net, data_iter, loss):  #@save
    """Evaluate the loss of models on a given dataset"""
    metric = Accumulator(2)  # Sum of losses, number of samples
    for X, y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]
# Training function
def train_epoch(net, train_iter,loss,updater):
    """Three variables, training loss, training accuracy, sample number"""
    metric = Accumulator(3)
    for X,y in train_iter:
        y_hat = net(X)
        l = loss(y_hat,y)
        if isinstance(updater, torch.optim.Optimizer):#If pytorch built-in optimizer
            """Self defined optimizer"""
    return metric[0]/metric[2], metric[1]/metric[2]
# Define axis functions
def set_axes(axes, xlabel, ylabel, xlim, ylim, xscale, yscale, legend):
    """set up matplotlib Axis of"""
    if legend:
# Define save function
def use_svg_display():  #@save
    """apply svg Format in Jupyter Show drawing in"""

# Define an animation class    
class Animator:  #@save
    """Draw data in animation"""
    def __init__(self, xlabel=None, ylabel=None, legend=None, xlim=None,
                 ylim=None, xscale='linear', yscale='linear',
                 fmts=('-', 'm--', 'g-.', 'r:'), nrows=1, ncols=1,
                 figsize=(3.5, 2.5)):
        # Draw multiple lines incrementally
        if legend is None:
            legend = []
        self.fig, self.axes = plt.subplots(nrows, ncols, figsize=figsize)
        if nrows * ncols == 1:
            self.axes = [self.axes, ]
        # Capturing parameters using lambda functions
        self.config_axes = lambda: set_axes(
            self.axes[0], xlabel, ylabel, xlim, ylim, xscale, yscale, legend)
        self.X, self.Y, self.fmts = None, None, fmts

    def add(self, x, y):
        # Add multiple data points to the chart
        if not hasattr(y, "__len__"):
            y = [y]
        n = len(y)
        if not hasattr(x, "__len__"):
            x = [x] * n
        if not self.X:
            self.X = [[] for _ in range(n)]
        if not self.Y:
            self.Y = [[] for _ in range(n)]
        for i, (a, b) in enumerate(zip(x, y)):
            if a is not None and b is not None:
        for x, y, fmt in zip(self.X, self.Y, self.fmts):
            self.axes[0].plot(x, y, fmt)
# Define training functions
def train(train_features, test_features, train_labels, test_labels,
    loss = nn.MSELoss(reduction='none')#Set loss function to MSE
    input_shape = train_features.shape[-1]
    # Don't set the deviation, because we have already set it in the polynomial
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))#Defining linear neural networks
    batch_size = min(10, train_labels.shape[0])#Confirm batch
    train_iter = load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)#Training set
    test_iter = load_array((test_features, test_labels.reshape(-1,1))
                               ,batch_size)#Test set
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)#Training set, using SGD training model
    animator = Animator(xlabel='epoch', ylabel='loss', yscale='log',
                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
                            legend=['train', 'test'])#Drawing related settings
    for epoch in range(num_epochs):
        train_epoch(net, train_iter, loss, trainer)  
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),#Calculate training error and draw
                                     evaluate_loss(net, test_iter, loss)))#Calculate test error and draw

    print('weight:', net[0].weight.data.numpy())

2.1 cubic polynomial regression (normal fitting)

Because the data set we generate is obtained by cubic polynomial regression, the fitting result using cubic polynomial regression will be very accurate

train(poly_features[:n_train, :4], poly_features[n_train:, :4],
      labels[:n_train], labels[n_train:])
weight: [[ 5.1068187  1.2157811 -3.1099443  5.064199 ]]

It can be seen that with the increase of training times, the training error and verification error are continuously reduced to less than 0.01, and the verification error and verification error are basically the same

2.2 univariate linear regression (underfitting)

Next, we use univariate linear regression to fit the data. Because we know that the real data set is cubic, the univariate linear regression cannot be used for accurate fitting at this time, which will lead to large bias of the model, and large training error and verification error

# Select the first 2 dimensions from the polynomial features, i.e. 1 and x
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
      labels[:n_train], labels[n_train:])
weight: [[3.8188436 3.0646155]]

It can be seen from the above figure that the results are consistent with our expected results. Because the model is too simple, even the training set cannot be well fitted, resulting in large training and verification errors

2.3 polynomial of degree 10 (over fitting)

Next, we use the polynomial of degree 10 for fitting. Because the complexity of the model is too high, the model will be over fitted, and the error on the verification set will first decrease and then increase with the increase of training times

# Select all dimensions from polynomial features
train(poly_features[:n_train, :11], poly_features[n_train:, :11],
      labels[:n_train], labels[n_train:], num_epochs=500)
weight: [[ 5.0872297   1.2546227  -2.9732502   4.719495   -0.47507587  1.4278368
  -0.05434499  0.30877623  0.28959352  0.18821514  0.06768304]]

It can be seen from the above figure that, consistent with our expectation, the verification error first decreases and then increases. If we finish the training ahead of time, we can get good results. This will be introduced in the follow-up

3. summary

Over fitting is a common problem in machine learning and in-depth learning, which can be handled by regularization. This will continue to be discussed in the follow-up. If the model underfits, it can increase the complexity of the model. In the next chapter, we will continue to introduce some commonly used methods of processing over fitting.

This is the introduction of this chapter. If the article is helpful to you, please like, collect, comment and pay attention to support!!

Tags: Deep Learning Pytorch Machine Learning

Posted by raj86 on Thu, 02 Jun 2022 06:32:04 +0530