Reading notes of Chapter 3 of Python deep learning

Chapter 3 Introduction to neural networks

layer

Layer is the basic data structure of neural network

Different tensor formats and different data processing types require different layers. For example, simple vector data is stored in a 2D tensor with the shape of (samples, features), and is usually processed with a Dense connected layer (also known as a fully connected layer or a Dense layer, corresponding to the Dense class of Keras). The sequence data is stored in a 3D tensor with shapes (samples, timesteps, features) and is usually processed with a recurrent layer (such as the LSTM layer of Keras). Image data is stored in 4D tensor, and is usually processed by two-dimensional convolution layer (Conv2D of Keras).

model

Model is a network of layers

The selection of network topology means that the possibility space (hypothesis space) is limited to a series of specific tensor operations, and the input data is mapped to the output data. Then, you need to find a suitable set of values for the weight tensors of these tensor operations.

Loss function and optimizer

Neural networks with multiple outputs may have multiple loss functions (each output corresponds to a loss function). However, the gradient descent process must be based on a single scalar loss value. Therefore, for a network with multiple loss functions, it is necessary to average all loss functions into a scalar value.

For binary classification problems, you can use the binary crossentropy loss function; For multi classification problems, the loss function of classification cross entropy can be used; For regression
For the problem, the mean square error loss function can be used; For the sequence learning problem, the loss function of connectionist temporal classification (CTC) can be used

Example 1: two categories of film reviews

This section uses the IMDB dataset, which contains 50000 highly polarized comments from the Internet Movie Database (IMDB). The data set was divided into 25000 comments for training and 25000 comments for testing. Both the training set and the testing set contained 50% positive comments and 50% negative comments.

IMDB datasets are built into the Keras library. It has been preprocessed: comments (sequences of words) have been converted to sequences of integers, where each integer represents a word in the dictionary.

The input data is a vector and the label is a scalar (1 and 0). In other words, this is a binary classification problem. The simple stacking network with relu activated sense performs well on this problem

An instance of the sense layer may look like this: sense (16, activation='relu'), where the parameter (16) passed into the sense layer is the number of hidden cells in the layer. A hidden unit is a dimension of the space represented by this layer. The dimension of representation space can be intuitively understood as "the degree of freedom of internal representation in e-learning". The more hidden units (i.e. higher dimensional representation space), the more complex representations the network can learn, but the computing cost of the network also becomes greater, and may lead to learning bad patterns (this pattern will improve the performance of training data, but will not improve the performance of test data).

For this kind of Dense layer stacking, the following two key architectures need to be determined: how many layers are there in the network; How many hidden cells are there in each layer.

In this example, we choose to use two intermediate layers, each of which has 16 hidden cells; The third layer outputs a scalar to predict the emotion of the current comment

The middle layer uses relu as the activation function, and the last layer uses sigmoid activation to output a probability value within the range of 0~1 (indicating the possibility that the target value of the sample is equal to 1, that is, the possibility that the comment is positive).

The relu (rectified linear unit) function returns all negative values to zero, while the sigmoid function "compresses" any value into the [0,1] range, and its output value can be regarded as a probability value.

 

 

It should be noted that if there is no active function such as relu (also called nonlinear), the sense layer will only contain two linear operations - dot product and addition. This is because there is only output = dot(W, input) + b, so the sense layer can only learn the linear transformation (affine transformation) of the input data: the assumption space of this layer is all possible sets of linear transformations from the input data to the 16 bit space. This assumption space is very limited and cannot take advantage of multiple presentation layers, because multiple linear layers stack to achieve linear operation, and adding layers does not expand the assumption space. In order to get a richer hypothesis space and make full use of the advantages of multi-layer representation, you need to add nonlinear or activation functions. Relu is the most commonly used activation function in deep learning, but there are many other functions available. They all have similar strange names, such as prelu, elu, etc.

We also need to select the loss function and optimizer. Since the problem is a binary classification problem, and the network output is a probability value (the last layer of the network uses sigmoid activation function, which only contains one unit), it is best to use binary_crossentropy loss. This is not the only feasible option. For example, you can also use mean_squared_error (mean square error). But for the model with output probability value, crossentropy is often the best choice. Cross entropy is a concept from the field of information theory, which is used to measure the distance between probability distributions. In this example, it is the distance between the real distribution and the predicted value.

The code of the process is as follows (incidentally, the drawings of Matlab and Matplotlib are very similar, no matter the syntax or the appearance of the drawings, they are definitely brothers hhh):

from keras.datasets import imdb
from keras import models
from keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Vector one-hot coding
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
x_train = vectorize_sequences(train_data)                #Vectorize training data
x_test = vectorize_sequences(test_data)                    #Vectorization of test data
y_train = np.asarray(train_labels).astype('float32')    #Vectorize labels
y_test = np.asarray(test_labels).astype('float32')        #Vectorize labels

#Add layer to network
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

# Set aside 10000 samples from the original training data as the verification set
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

# Pass optimizer, loss function, and metrics as strings( rmsprop,binary_crossentropy and accuracy Both Keras Built in part)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

# Now use a small batch of 512 samples,Train the model for 20 rounds(Namely right x_train and y_train 20 iterations for all samples in the two tensors)
# At the same time, monitor the loss and accuracy on the 10000 samples set aside. Verify data transfer in validation_data parameter
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

# History Object has a member history ,It is a dictionary,It contains all the data in the training process and can be used for drawing
history_dict = history.history 
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
acc = history_dict['acc']
val_acc = history_dict['val_acc']
epochs = range(1, len(loss_values) + 1)

# Mapping training losses and verification losses
plt.figure(1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')                # bo Indicates a blue dot
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')            # b Indicates blue
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

# Mapping training accuracy and verification accuracy
plt.figure(2)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Run the program, and the results are shown in the following figure:

 

 

I am quite experienced in this. I can see at a glance that this is over fitting (the training effect is very good in the training set, but the training effect is extremely poor in the test set)

Based on the above two figures, we can consider stopping the training at the fourth round of iteration. The modified code is as follows:

from keras.datasets import imdb
from keras import models
from keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Vector one-hot coding
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
x_train = vectorize_sequences(train_data)                #Vectorize training data
x_test = vectorize_sequences(test_data)                    #Vectorization of test data
y_train = np.asarray(train_labels).astype('float32')    #Vectorize labels
y_test = np.asarray(test_labels).astype('float32')        #Vectorize labels

#Add layer to network
model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

# Set aside 10000 samples from the original training data as the verification set
x_val = x_train[:10000]
partial_x_train = x_train[10000:]
y_val = y_train[:10000]
partial_y_train = y_train[10000:]

# Pass optimizer, loss function, and metrics as strings( rmsprop,binary_crossentropy and accuracy Both Keras Built in part)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

# To prevent over fitting, we only carry out 4 rounds of iterations. The number of iterations is too small, so it is not necessary to draw a picture
model.fit(partial_x_train, partial_y_train, epochs=4, batch_size=512, validation_data=(x_val, y_val))
results = model.evaluate(x_test, y_test)
print(results)

Run the program, and the results are as follows:

 

 

The second number (0.87508) is accuracy. It can be seen that the accuracy of the network with four rounds of iterations has reached 87%. It feels that it can be roared

This network can also be used for prediction, and the following python statements can be executed:

model.predict(x_test)

For the evaluate and predict functions, see this blog: https://blog.csdn.net/DoReAGON/article/details/88552348 In fact, one is to know x and y for verification, and the other is to only give x and predict y

Summary of this example:

  • It is usually necessary to preprocess the original data in order to transform it into tensor and input it into neural network. For example, word sequences can be encoded as binary vectors
  • Dense layer stack with relu activation can solve many problems (including emotion classification)
  • For binary classification problems (two output classes), the last layer of the network should be the sense layer with only one cell and activated by sigmoid. The network output should be a scalar in the range of 0~1, representing the probability value.
  • For sigmoid scalar output of binary classification problems, binary should be used_ Crossintropy loss function.
  • No matter what your problem is, the rmsprop optimizer is usually a good enough choice. There is no need to worry about this.
  • As the neural network performs better and better on the training data, the model will eventually be over fitted and get worse and worse results on the unprecedented data. Always monitor the performance of the model on data outside the training set.

Example 2: News Classification (single label, multi classification)

Each data point can only be divided into one category, so it is a single label; There are many categories, so there are multiple categories

This section uses the Reuters data set, which contains many short stories and their corresponding topics, published by Reuters in 1986. It is a simple and widely used text classification data set. It includes 46 different topics: some topics have more samples, but each topic in the training set has at least 10 samples.

There are two ways to vectorize labels: you can convert the label list to an integer tensor, or use one-hot coding. In this example, the one-hot encoding of tags is to represent each tag as an all zero vector,
Only the element corresponding to the label index is 1. (this operation is built into Keras)

The stack of sense layers. Each layer can only access the information output from the previous layer. If a layer loses some information related to classification problems, these information cannot be retrieved by the following layers, that is, each layer may become an information bottleneck. The previous example used a 16 dimensional middle layer, but for this example, the 16 dimensional space may be too small to learn to distinguish 46 different categories. This layer with smaller dimensions may become an information bottleneck and permanently lose relevant information. That is, when we try to compress a large amount of information (which is enough to recover 46 categories of split hyperplane) into the middle space with a small dimension, although the network can put most of the necessary information into the four-dimensional representation, it is not all the information. For this reason, a layer with a larger dimension is used in this example, which contains 64 units.

For this example, the best loss function is categorical_crossentropy (classification cross entropy). It is used to measure the distance between two probability distributions, where the two probability distributions are the probability distribution of network output and the real distribution of labels. By minimizing the distance between the two distributions, the training network can make the output results as close to the real label as possible.

The procedure is as follows:

from keras.datasets import reuters
from keras.utils.np_utils import to_categorical
from keras import models
from keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Vector one-hot coding
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

#Vectorization of training data and test data( one-hot Code)
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

#Vectorize training labels and test labels( one-hot Code)
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

# Building networks
# The last layer of the network is 46 Dense Layer, for each input sample,The network will output a 46 dimensional vector, and each element of this vector represents a different category
# The last layer uses softmax Activate. In this way, the network will output probability distributions over 46 different output categories
# For each input sample,The network will output a 46 dimensional vector,among output[i] Yes, the sample belongs to No i Probability of categories. The sum of 46 probabilities is 1
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

# The loss function adopts the classification cross entropy
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# Leave 1000 samples in the training data as the verification set
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

# Start training network
history = model.fit(partial_x_train, partial_y_train, epochs=20, batch_size=512, validation_data=(x_val, y_val))

# Mapping training loss, verification loss, training accuracy, verification accuracy
history_dict = history.history 
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
acc = history_dict['acc']
val_acc = history_dict['val_acc']
epochs = range(1, len(loss_values) + 1)
# Mapping training losses and verification losses
plt.figure(1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')                # bo Indicates a blue dot
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')            # b Indicates blue
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
# Mapping training accuracy and verification accuracy
plt.figure(2)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Run the program, and the results are shown in the following figure:

 

 

To prevent over fitting, we stopped at 9 rounds of iteration and modified the program code as follows:

from keras.datasets import reuters
from keras.utils.np_utils import to_categorical
from keras import models
from keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Vector one-hot coding
def vectorize_sequences(sequences, dimension=10000):
    results = np.zeros((len(sequences), dimension))
    for i, sequence in enumerate(sequences):
        results[i, sequence] = 1.
    return results

(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)

#Vectorization of training data and test data( one-hot Code)
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

#Vectorize training labels and test labels( one-hot Code)
one_hot_train_labels = to_categorical(train_labels)
one_hot_test_labels = to_categorical(test_labels)

# Building networks
# The last layer of the network is 46 Dense Layer, for each input sample,The network will output a 46 dimensional vector, and each element of this vector represents a different category
# The last layer uses softmax Activate. In this way, the network will output probability distributions over 46 different output categories
# For each input sample,The network will output a 46 dimensional vector,among output[i] Yes, the sample belongs to No i Probability of categories. The sum of 46 probabilities is 1
model = models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(46, activation='softmax'))

# The loss function adopts the classification cross entropy
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])

# Leave 1000 samples in the training data as the verification set
x_val = x_train[:1000]
partial_x_train = x_train[1000:]
y_val = one_hot_train_labels[:1000]
partial_y_train = one_hot_train_labels[1000:]

# To prevent overfitting, we stopped at 9 iterations
model.fit(partial_x_train, partial_y_train, epochs=9, batch_size=512, validation_data=(x_val, y_val))
results = model.evaluate(x_test, one_hot_test_labels)
print(results)

Run the program and the results are as follows:

 

 

It can be seen that the accuracy rate is 79%, which is more than

The trained neural network can also be used to predict:

Predictions = model Each element in predict (x\u test) \predictions is a vector with a length of 46, and the sum of all elements of this vector is 1.

The largest element is the prediction category, that is, the category with the greatest probability.

>>> np.argmax(predictions[0])

4

For tags, we can also use other encoding methods: converting them to integer tensors

y_train = np.array(train_labels)
y_test = np.array(test_labels)

In this case, the only change is the choice of loss function. For integer labels, the loss function should use spark_ Category_ Crossintropy

Summary of the case:

  • If you want to classify data points of N categories, the last layer of the network should be the density layer with the size of n.
  • For single label and multi classification problems, the last layer of the network should be activated with softmax, so that the probability distribution on N output categories can be output.
  • The loss function of this problem should almost always use classification cross entropy. It minimizes the distance between the probability distribution of the network output and the real distribution of the target.
  • There are two methods to deal with multi classification problems: encoding tags through classification coding (also known as one-hot coding), and then using category_ Crossintropy as a loss function; If the label is encoded as an integer, use spark_ Category_ Crossintropy loss function.
  • If you need to divide the data into many categories, you should avoid using a too small middle tier to avoid creating information bottlenecks in the network.

Forecast house price: regression problem

This section will predict the median housing price in Boston suburbs in the mid-1970s. Some data points in the suburbs at that time are known, such as crime rate, local real estate tax rate, etc. The dataset used in this section has an interesting difference from the previous two examples. It contains relatively few data points, only 506, divided into 404 training samples and 102 test samples. Each feature of the input data (such as crime rate) has a different value range. For example, some properties are scale, and the value range is 0~1; Some values range from 1 to 12; Other values range from 0 to 100, etc.

Data with very different value ranges should be standardized before being input into the neural network: for each feature of the input data (column in the input data matrix), subtract the feature average and divide it by the standard deviation, so that the feature average is 0 and the standard deviation is 1

In order to evaluate the network while adjusting network parameters (such as the number of training rounds), you can divide the data into training set and verification set, as in the previous example. However, due to the small number of data points, the validation set is very small (for example, about 100 samples). Therefore, the verification score may fluctuate greatly, depending on the verification set and training set you select. In other words, the way the validation set is divided may cause a large variance in the validation score, so the model cannot be reliably evaluated.

In this case, the best practice is to use k-fold cross validation (see Figure 3-11). This method divides the available data into k partitions (k is usually taken as 4 or 5), instantiates K identical models, trains each model on K - 1 partitions, and evaluates the remaining partitions. The validation score of the model is equal to the average of K validation scores.

 

 

 

Write the following code:

from keras.datasets import boston_housing
from keras import models
from keras import layers
import numpy as np

# Due to the small number of samples,We will use a very small network,It contains two hidden layers,64 units per floor
# generally speaking,Less training data,Over fitting will be more serious,Smaller networks can reduce over fitting
def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))                # The last layer of the network has only one cell, which is not activated. It is a linear layer, which is a typical setting of scalar regression
    # adopt mse Loss function, i.e. mean square error (the square of the difference between the predicted value and the target value)
    # Monitoring index is mean absolute error(MAE,mean absolute error). It is the absolute value of the difference between the predicted value and the target value
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

# Data standardization
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

# K Discount verification
k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []

for i in range(k):
    print('processing fold #', i)
    # Prepare validation data:Section k Data for partitions
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]        
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    # Prepare training data:Data for all other partitions
    partial_train_data = np.concatenate([train_data[:i * num_val_samples], 
        train_data[(i + 1) * num_val_samples:]], axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples], 
        train_targets[(i + 1) * num_val_samples:]], axis=0)
    # Build model
    model = build_model()
    # verbose=0 Represents silent mode
    model.fit(partial_train_data, partial_train_targets, epochs=num_epochs, batch_size=1, verbose=0)
    # Evaluate model on validation data
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)

print(all_scores)
print(np.mean(all_scores))

Next, we make the training time longer (500 rounds of iterations) and save the verification score record of each round. The modified code is as follows:

from keras.datasets import boston_housing
from keras import models
from keras import layers
import numpy as np
import matplotlib.pyplot as plt

# Due to the small number of samples,We will use a very small network,It contains two hidden layers,64 units per floor
# generally speaking,Less training data,Over fitting will be more serious,Smaller networks can reduce over fitting
def build_model():
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu', input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))                # The last layer of the network has only one cell, which is not activated. It is a linear layer, which is a typical setting of scalar regression
    # adopt mse Loss function, i.e. mean square error (the square of the difference between the predicted value and the target value)
    # Monitoring index is mean absolute error(MAE,mean absolute error). It is the absolute value of the difference between the predicted value and the target value
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

# Data standardization
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

# K Discount verification
k = 4
num_val_samples = len(train_data) // k
num_epochs = 500
all_mae_histories = []

for i in range(k):
    print('processing fold #', i)
    # Prepare validation data:Section k Data for partitions
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]        
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]
    # Prepare training data:Data for all other partitions
    partial_train_data = np.concatenate([train_data[:i * num_val_samples], 
        train_data[(i + 1) * num_val_samples:]], axis=0)
    partial_train_targets = np.concatenate([train_targets[:i * num_val_samples], 
        train_targets[(i + 1) * num_val_samples:]], axis=0)
    # Build model
    model = build_model()
    # Training, verbose=0 Represents silent mode
    history = model.fit(partial_train_data, partial_train_targets, 
        validation_data=(val_data, val_targets),
        epochs=num_epochs, batch_size=1, verbose=0)
    mae_history = history.history['val_mean_absolute_error']
    all_mae_histories.append(mae_history)

# Calculate the K Average value of discount verification score
# for x in all_mae_histories It means the result of each discount (4 in total, an array of 500 in length)
# On this basis, a round is fixed i,Find four x[i]Average of
average_mae_history = [np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

# Plot the change of verification score
plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

Run the program, and the results are shown in the following figure

 

Because the range of the vertical axis is large and the data variance is relatively large, it is difficult to see the law of this figure. Let's draw a new picture:

Delete the first 10 data points because their value range is different from other points on the curve, and replace each data point with the exponential moving average of the previous data points to obtain a smooth curve.

The relevant function code for smoothing the curve is as follows:

def smooth_curve(points, factor=0.9):
    smoothed_points = []
    for point in points:
        if smoothed_points:
            previous = smoothed_points[-1]
            smoothed_points.append(previous * factor + point * (1 - factor))
        else:
            smoothed_points.append(point)
    return smoothed_points

Modify the code of drawing part as follows:

# Remove the first 10 data and smooth the data
smooth_mae_history = smooth_curve(average_mae_history[10:])

# Plot the change of verification score
plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

Rerun the code, and the results are shown in the following figure:

It can be seen from the figure that after 80 rounds of verification, MAE no longer decreased significantly, and then began to over fit. Therefore, we can control the number of iteration rounds to 80 and re run the program

 

Case summary:

  • The loss function used in regression problems is different from that used in classification problems. The loss function commonly used in regression is mean square error (MSE). Similarly, the evaluation index used in the regression problem is also different from the classification problem. The common regression index is the mean absolute error (MAE).
  • If the features of the input data have different value ranges, they should be preprocessed first, and each feature should be scaled separately.
  • If there is little data available, the model can be reliably evaluated using K-fold validation.
  • If there are few available training data, it is better to use a small network with fewer hidden layers (usually only one or two) to avoid serious over fitting.

Tags: Deep Learning image processing Machine Learning

Posted by trukfixer on Wed, 01 Jun 2022 07:34:37 +0530