Summary of Lecture 7 of "Ten Lectures on Machine Learning"

"Ten Lectures on Machine Learning" - Lecture 7 (Optimization)

The optimization goal of machine learning

Minimize the loss function:

Model - Loss Function - Expression:




Gradient descent:

batch gradient descent: using all training set samples, the computational cost is too high (n~10^6)

mini-batch gradient descent: randomly sample a subset (m~100 or 1000), then use the following formula:

mini-batch is an unbiased estimate. Larger batch sizes reduce the variance of gradient computations.

Stochastic Gradient Descent SGD:

Use mini-batch to calculate the result and then according to the formula of the gradient descent method:To update the parameters, the next step is to randomly sample a subset and repeat the operation. This method is called Stochastic Gradient Descent (SGD).

Therefore, there is a saying: mini-batch is the promotion of SGD, and usually the SGD to which it belongs is mini-batch.

Because it is a training sample to update the parameters once, the gradient is usually in the form of oscillation, but the general direction is still a downward trend:


The setting of the learning rate in SGD is the key.

Theoretically sufficient conditions for SGD convergence are guaranteed:. Therefore, the learning rate needs to be reduced as the number of iterations increases.

How to set the learning rate:


Problems with Gradient Descent in Practical Applications

Sick condition:

Right image: Assuming that the surrounding area is surrounded by cliffs, there will be a vertical descent when the gradient descent algorithm is performed.

Local minimum and global minimum:


As shown in the figure on the right, it can be intuitively seen that the minimum value is at the third concave. However, in the actual algorithm, depending on the set starting point, the obtained "minimum value" may not be the global minimum.

Saddle point:

The gradient is 0, and the Hessian matrix has both positive and negative values; the probability of all eigenvalues ​​of the Heissan matrix being positive is low; for high-dimensional cases, the number of saddle points and local minima is large. as the picture shows:

There are problems when using second-order optimization algorithms.


The gradient is 0, and the Hessian is also 0. as the picture shows:

Need to add noise to jump out of the platform area.

Gradient explosions and cliffs:

Very common in RNN s, resulting from constant multiplication of parameters; long-term time dependence. as the picture shows:

Workaround: Gradient truncation, where heuristic gradient truncation interferes to reduce the step size.

SGD method improvements:

Momentum method:

ρ can get larger as the number of iterations increases, and adjusting ρ over time is more important than shrinking η.

The momentum method overcomes two problems in SGD:

Ill-conditioned problem with Hessian matrix (illustrated):

Instability caused by the variance of the stochastic gradient.

Nesterov momentum method:


Learning rate adaptation: Inversely proportional to the square root of the sum of squared gradient history.

Effect: Greater progress is made in gentler leaning directions, possibly escaping from the saddle point.

Problem: The cumulative sum of squared gradients grows too fast, resulting in a rapidly smaller learning rate and early termination of learning.



On the basis of AdaGrad, the dependence on the early historical gradient is reduced; it is achieved by setting the decay coefficient β2 (recommended β2=0.9)



Consider both momentum and learning rate adaptation. β1 is usually set to 0.9 and β2 is set to 0.999




Second-order optimization method:

Comparison with first-order optimization methods



Newton's method: Using a second-order Taylor expansion:

Usually the Hessian matrix (H^-1) is not positive definite, increase the regularization strategy:

For high-dimensional cases, the cost is too high and an approximation is required: the L-BFGS method.

Graphical comparison of second-order optimization methods with gradient descent:



Optimizing Algorithm Selection

In big data scenarios (large sample size and large feature dimension), the first-order method is most practical (stochastic gradient).

The family of adaptive learning rate algorithms (represented by RMSProp) is quite robust.

Adam is probably the best choice.

The user's familiarity with the algorithm in order to tune the hyperparameters.



Python algorithm implementation and comparison:

#First introduce the algorithm-related packages, matplotlib for drawing
import matplotlib.pyplot as plt
import numpy as np

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import animation
from IPython.display import HTML

from autograd import elementwise_grad, value_and_grad,grad
from scipy.optimize import minimize
from scipy import optimize
from collections import defaultdict
from itertools import zip_longest
plt.rcParams['axes.unicode_minus']=False  # Used to display the negative sign normally
#use python The anonymous function defines the objective function
f1 = lambda x1,x2 : x1**2 + 0.5*x2**2 #function definition
f1_grad = value_and_grad(lambda args : f1(*args)) #function gradient
#Gradient Descent
##Define the gradient_descent method to update the parameters
###func:f1 // func_grad:f1_grad // x0: initial point // learning_rate: learning rate // max_iteration: maximum number of steps
def gradient_descent(func, func_grad, x0, learning_rate=0.1, max_iteration=20):
    #Record how the step is taken (visual use)
    path_list = [x0]
    #Where are you currently
    best_x = x0
    step = 0
    while step < max_iteration:
        update = -learning_rate * np.array(func_grad(best_x)[1])
        if(np.linalg.norm(update) < 1e-4):
        best_x = best_x + update
        step = step + 1
    return best_x, np.array(path_list)
#For example, visualizing the solution path of gradient descent
best_x_gd, path_list_gd = gradient_descent(f1,f1_grad,[-4.0,4.0],0.1,30)

#draw function surface
##First use np.meshgrid to generate a grid point coordinate matrix. The display ranges from -5 to 5 for each of the two dimensions. The function value corresponding to the grid point is stored in z
x1,x2 = np.meshgrid(np.linspace(-5.0,5.0,50), np.linspace(-5.0,5.0,50))
z = f1(x1,x2 )
minima = np.array([0, 0]) #for function f1,We know that the minimum point is(0,0)
##The plot_surface function draws a 3D surface
%matplotlib inline
fig = plt.figure(figsize=(8, 8))
ax = plt.axes(projection='3d', elev=50, azim=-50)

ax.plot_surface(x1,x2, z, alpha=.8,
ax.plot([minima[0]],[minima[1]],[f1(*minima)], 'r*', markersize=10)


ax.set_xlim((-5, 5))
ax.set_ylim((-5, 5))

#Plot Contours and Gradient Fields
##The contour method can draw contour lines, and clabel can display the height (function value) of the corresponding line. Here we reserve two decimal places (fmt='%.2f').
dz_dx1 = elementwise_grad(f1, argnum=0)(x1, x2)
dz_dx2 = elementwise_grad(f1, argnum=1)(x1, x2)
fig, ax = plt.subplots(figsize=(6, 6))

contour = ax.contour(x1, x2, z,levels=20,
ax.plot(*minima, 'r*', markersize=18)


ax.set_xlim((-5, 5))
ax.set_ylim((-5, 5))

#Use within a gradient field quiver function to draw the path
fig, ax = plt.subplots(figsize=(6, 6))

ax.contour(x1, x2, z, levels=20, line
#draw trajectory arrows
ax.quiver(path_list_gd[:-1,0], path_list_gd[:-1,1], path_list_gd[1:,0]-path_list_gd[:-1,0], path_list_gd[1:,1]-path_list_gd[:-1,1], scale_units='xy', angles='xy', scale=1, color='k')
#Label the optimal value point
ax.plot(*minima, 'r*', markersize=18)


ax.set_xlim((-5, 5))
ax.set_ylim((-5, 5))

#use animation animation
path = path_list_gd #Optimization path for gradient descent
fig, ax = plt.subplots(figsize=(6, 6))
line, = ax.plot([], [], 'b', label='Gradient Descent', lw=2) #save route
point, = ax.plot([], [], 'bo') #save the last point of the path
#What did you draw first?
def init_draw(): 
    ax.contour(x1, x2, z, levels=20,
    ax.plot(*minima, 'r*', markersize=18) #Draw the minimum point as a red five-pointed star
    ax.set_xlim((-5, 5))
    ax.set_ylim((-5, 5))
    return line, point
#Update what to draw at each step
def update_draw(i):
    return line, point

anim = animation.FuncAnimation(fig, update_draw, init_func=init_draw,frames=path.shape[0], interval=60, repeat_delay=5, blit=True)

#use scipy.optimize The module solves optimization problems. Since we need to visualize the optimization path, minimize The function needs to specify a callback function parameter callback. 
x0 = np.array([-4, 4])
def make_minimize_cb(path=[]):
    def minimize_cb(xk):

    return minimize_cb
#Some common optimization methods
methods = [ "CG", "BFGS","Newton-CG","L-BFGS-B"]
#every method call minimize
import warnings
warnings.filterwarnings('ignore') #The purpose of this line of code is to hide the warning message
x0 = [-4.0,4.0]
paths = []
zpaths = []
for method in methods:
    path = [x0]
    #Callback make_minimize_cb
    res = minimize(fun=f1_grad, x0=x0,jac=True,method = method,callback=make_minimize_cb(path), bounds=[(-5, 5), (-5, 5)], tol=1e-20)
#Add the gradient descent method you just wrote to it
zpaths = [f1(path[:,0],path[:,1]) for path in paths]
#package one TrajectoryAnimation kind ,Animate the optimized paths obtained by different algorithms.
##The code is from the URL:
class TrajectoryAnimation(animation.FuncAnimation):
    def __init__(self, paths, labels=[], fig=None, ax=None, frames=None, 
                 interval=60, repeat_delay=5, blit=True, **kwargs):
        #If incoming fig and ax If the parameter is empty, create a new one fig object and ax object
        if fig is None:
            if ax is None:
                fig, ax = plt.subplots()
                fig = ax.get_figure()
            if ax is None:
                ax = fig.gca()
        self.fig = fig = ax 
        self.paths = paths
        #The number of frames of the animation is equal to the longest path length
        if frames is None:
            frames = max(path.shape[0] for path in paths) #Get the longest path length
        self.lines = [ax.plot([], [], label=label, lw=2)[0] 
                      for _, label in zip_longest(paths, labels)]
        self.points = [ax.plot([], [], 'o', color=line.get_color())[0] 
                       for line in self.lines]
        super(TrajectoryAnimation, self).__init__(fig, self.animate, init_func=self.init_anim,
                                                  frames=frames, interval=interval, blit=blit,
                                                  repeat_delay=repeat_delay, **kwargs)
    def init_anim(self):
        for line, point in zip(self.lines, self.points):
            line.set_data([], [])
            point.set_data([], [])
        return self.lines + self.points

    def animate(self, i):
        for line, point, path in zip(self.lines, self.points, self.paths):
        return self.lines + self.points
#Final animation comparison
fig, ax = plt.subplots(figsize=(8, 8))

ax.contour(x1, x2, z,
ax.plot(*minima, 'r*', markersize=10)


ax.set_xlim((-5, 5))
ax.set_ylim((-5, 5))

anim = TrajectoryAnimation(paths, labels=methods, ax=ax)

ax.legend(loc='upper left')

#3D visualization
##Here is a function with multiple local minima and saddle points
f2 = lambda x1, x2 :((4 - 2.1*x1**2 + x1**4 / 3.) * x1**2 + x1 * x2  + (-4 + 4*x2**2) * x2 **2)
f2_grad = value_and_grad(lambda args: f2(*args))

x1,x2 = np.meshgrid(np.linspace(-2.0,2.0,50), np.linspace(-1.0,1.0,50))
z = f2(x1,x2 )
#inline change to notebook Dynamic viewing can be set, and the platform runs on a white screen, the reason is unknown
%matplotlib inline
fig = plt.figure(figsize=(6, 6))
ax = plt.axes(projection='3d', elev=50, azim=-50)

ax.plot_surface(x1,x2, z, alpha=.8,


ax.set_xlim((-2.0, 2.0))
ax.set_ylim((-1.0, 1.0))

#use Scipy The different optimization methods implemented in and the gradient descent method we implemented in this case to solve.
x02 = [-1.0,-0.5]  #Initial point, try different initial points,[-1.0,-0.5] ,[1.5,0.75],[-0.8,0.25]
_, path_list_gd2 = gradient_descent(f2,f2_grad,x02,0.1,30) #Solve using gradient descent

paths = []
zpaths = []
#different methods
methods = [ "CG", "BFGS","Newton-CG","L-BFGS-B"]
for method in methods:
    path = [x02]
    res = minimize(fun=f2_grad, x0=x02,jac=True,method = method,callback=make_minimize_cb(path), bounds=[(-2.0, 2.0), (-1.0, 1.0)], tol=1e-20)
#Add your own gradient descent method
zpaths = [f2(path[:,0],path[:,1]) for path in paths]
#Display in animation
%matplotlib inline
fig, ax = plt.subplots(figsize=(8, 8))

contour = ax.contour(x1, x2, z, levels=50,

ax.set_xlim((-2.0, 2.0))
ax.set_ylim((-1.0, 1.0))

anim = TrajectoryAnimation(paths, labels=methods, ax=ax)
ax.legend(loc='upper left')

Here, the initial points are different in different diagrams. No modification is made in the code. Only the images of each initial point are given:

[-1.0, -0.5]


 [1.5, 0.75]

[-0.8, 0.25]

Use the MNIST dataset for algorithm application:

#load MNIST data set
import numpy as np
f = np.load("input/mnist.npz") 
X_train, y_train, X_test, y_test = f['x_train'], f['y_train'],f['x_test'], f['y_test']
x_train = X_train.reshape((-1, 28*28)) / 255.0
x_test = X_test.reshape((-1, 28*28)) / 255.0
#print some numbers randomly
rndperm = np.random.permutation(len(x_train))
%matplotlib inline
import matplotlib.pyplot as plt
fig = plt.figure( figsize=(8,8) )
for i in range(0,100):
    ax = fig.add_subplot(10,10,i+1)
    ax.matshow(x_train[rndperm[i]].reshape((28,28))) #remove border
    plt.axis("off")#Do not show axes

#One-Hot coding
import pandas as pd
y_train_onehot = pd.get_dummies(y_train)


We want to use TensorFlow to build a neural network. The neural network to be built is shown in the figure:

#Building a Neural Network for Handwritten Digit Recognition
import tensorflow as tf
import tensorflow.keras.layers as layers
#Build and print the model
inputs = layers.Input(shape=(28*28,), name='inputs')
#The first three layers of the activation function are set to relu,The last layer is softmax
hidden1 = layers.Dense(100, activation='relu', name='hidden1')(inputs)
hidden2 = layers.Dense(100, activation='relu', name='hidden2')(hidden1)
hidden3 = layers.Dense(50, activation='relu', name='hidden3')(hidden2)
outputs = layers.Dense(10, activation='softmax', name='outputs')(hidden3)
deep_networks = tf.keras.Model(inputs,outputs)

#loss function categorical_crossentropy,Optimization SGD
##The optimization method (optimizer) can also be defined as RMSprop,Adam,Adagrad, Nadam
###Training 10 times, verbose=1 shows the training process
%time history =, y_train_onehot, batch_size=500, epochs=10,validation_split=0.5,verbose=1)

#Printing error curve
fig, ax = plt.subplots(figsize=(20, 8))

ax.plot(history.epoch, history.history["loss"])


#Test and print accuracy
test_loss, test_acc = deep_networks.evaluate(x_test,  pd.get_dummies(y_test), verbose=2)

print('\nTest accuracy:', test_acc)


The accuracy given in the example is:


This is because there will be errors in each training, which is a normal phenomenon.


Posted by GravityFX on Fri, 03 Jun 2022 10:18:27 +0530