"Ten Lectures on Machine Learning" - Lecture 7 (Optimization)
The optimization goal of machine learning
Minimize the loss function:
Model - Loss Function - Expression:
Icon:
method
Gradient descent:
batch gradient descent: using all training set samples, the computational cost is too high (n~10^6)
mini-batch gradient descent: randomly sample a subset (m~100 or 1000), then use the following formula:
mini-batch is an unbiased estimate. Larger batch sizes reduce the variance of gradient computations.
Stochastic Gradient Descent SGD:
Use mini-batch to calculate the result and then according to the formula of the gradient descent method:To update the parameters, the next step is to randomly sample a subset and repeat the operation. This method is called Stochastic Gradient Descent (SGD).
Therefore, there is a saying: mini-batch is the promotion of SGD, and usually the SGD to which it belongs is mini-batch.
Because it is a training sample to update the parameters once, the gradient is usually in the form of oscillation, but the general direction is still a downward trend:
The setting of the learning rate in SGD is the key.
Theoretically sufficient conditions for SGD convergence are guaranteed:. Therefore, the learning rate needs to be reduced as the number of iterations increases.
How to set the learning rate:
Problems with Gradient Descent in Practical Applications
Sick condition:
Right image: Assuming that the surrounding area is surrounded by cliffs, there will be a vertical descent when the gradient descent algorithm is performed.
Local minimum and global minimum:
As shown in the figure on the right, it can be intuitively seen that the minimum value is at the third concave. However, in the actual algorithm, depending on the set starting point, the obtained "minimum value" may not be the global minimum.
Saddle point:
The gradient is 0, and the Hessian matrix has both positive and negative values; the probability of all eigenvalues of the Heissan matrix being positive is low; for high-dimensional cases, the number of saddle points and local minima is large. as the picture shows:
There are problems when using second-order optimization algorithms.
platform:
The gradient is 0, and the Hessian is also 0. as the picture shows:
Need to add noise to jump out of the platform area.
Gradient explosions and cliffs:
Very common in RNN s, resulting from constant multiplication of parameters; long-term time dependence. as the picture shows:
Workaround: Gradient truncation, where heuristic gradient truncation interferes to reduce the step size.
SGD method improvements:
Momentum method:
ρ can get larger as the number of iterations increases, and adjusting ρ over time is more important than shrinking η.
The momentum method overcomes two problems in SGD:
Ill-conditioned problem with Hessian matrix (illustrated):
Instability caused by the variance of the stochastic gradient.
Nesterov momentum method:
AdaGrad:
Learning rate adaptation: Inversely proportional to the square root of the sum of squared gradient history.
Effect: Greater progress is made in gentler leaning directions, possibly escaping from the saddle point.
Problem: The cumulative sum of squared gradients grows too fast, resulting in a rapidly smaller learning rate and early termination of learning.
formula:
RMSProp:
On the basis of AdaGrad, the dependence on the early historical gradient is reduced; it is achieved by setting the decay coefficient β2 (recommended β2=0.9)
formula:
Adam:
Consider both momentum and learning rate adaptation. β1 is usually set to 0.9 and β2 is set to 0.999
formula:
Second-order optimization method:
Comparison with first-order optimization methods
Newton's method: Using a second-order Taylor expansion:
Usually the Hessian matrix (H^-1) is not positive definite, increase the regularization strategy:
For high-dimensional cases, the cost is too high and an approximation is required: the L-BFGS method.
Graphical comparison of second-order optimization methods with gradient descent:
Optimizing Algorithm Selection
In big data scenarios (large sample size and large feature dimension), the first-order method is most practical (stochastic gradient).
The family of adaptive learning rate algorithms (represented by RMSProp) is quite robust.
Adam is probably the best choice.
The user's familiarity with the algorithm in order to tune the hyperparameters.
Example
URL: http://cookdata.cn/note/view_static_note/24b53e7838cde188f1dfa6b62824edbb/
Python algorithm implementation and comparison:
#First introduce the algorithm-related packages, matplotlib for drawing import matplotlib.pyplot as plt import numpy as np from mpl_toolkits.mplot3d import Axes3D from matplotlib import animation from IPython.display import HTML from autograd import elementwise_grad, value_and_grad,grad from scipy.optimize import minimize from scipy import optimize from collections import defaultdict from itertools import zip_longest plt.rcParams['axes.unicode_minus']=False # Used to display the negative sign normally #use python The anonymous function defines the objective function f1 = lambda x1,x2 : x1**2 + 0.5*x2**2 #function definition f1_grad = value_and_grad(lambda args : f1(*args)) #function gradient
#Gradient Descent ##Define the gradient_descent method to update the parameters ###func:f1 // func_grad:f1_grad // x0: initial point // learning_rate: learning rate // max_iteration: maximum number of steps def gradient_descent(func, func_grad, x0, learning_rate=0.1, max_iteration=20): #Record how the step is taken (visual use) path_list = [x0] #Where are you currently best_x = x0 step = 0 while step < max_iteration: update = -learning_rate * np.array(func_grad(best_x)[1]) if(np.linalg.norm(update) < 1e-4): break best_x = best_x + update path_list.append(best_x) step = step + 1 return best_x, np.array(path_list)
#For example, visualizing the solution path of gradient descent best_x_gd, path_list_gd = gradient_descent(f1,f1_grad,[-4.0,4.0],0.1,30) path_list_gd
#draw function surface ##First use np.meshgrid to generate a grid point coordinate matrix. The display ranges from -5 to 5 for each of the two dimensions. The function value corresponding to the grid point is stored in z x1,x2 = np.meshgrid(np.linspace(-5.0,5.0,50), np.linspace(-5.0,5.0,50)) z = f1(x1,x2 ) minima = np.array([0, 0]) #for function f1,We know that the minimum point is(0,0) ax.plot_surface? ##The plot_surface function draws a 3D surface %matplotlib inline fig = plt.figure(figsize=(8, 8)) ax = plt.axes(projection='3d', elev=50, azim=-50) ax.plot_surface(x1,x2, z, alpha=.8, cmap=plt.cm.jet) ax.plot([minima[0]],[minima[1]],[f1(*minima)], 'r*', markersize=10) ax.set_xlabel('$x1$') ax.set_ylabel('$x2$') ax.set_zlabel('$f$') ax.set_xlim((-5, 5)) ax.set_ylim((-5, 5)) plt.show()
#Plot Contours and Gradient Fields ##The contour method can draw contour lines, and clabel can display the height (function value) of the corresponding line. Here we reserve two decimal places (fmt='%.2f'). dz_dx1 = elementwise_grad(f1, argnum=0)(x1, x2) dz_dx2 = elementwise_grad(f1, argnum=1)(x1, x2) fig, ax = plt.subplots(figsize=(6, 6)) contour = ax.contour(x1, x2, z,levels=20,cmap=plt.cm.jet) ax.clabel(contour,fontsize=10,colors='k',fmt='%.2f') ax.plot(*minima, 'r*', markersize=18) ax.set_xlabel('$x1$') ax.set_ylabel('$x2$') ax.set_xlim((-5, 5)) ax.set_ylim((-5, 5)) plt.show()
#Use within a gradient field quiver function to draw the path fig, ax = plt.subplots(figsize=(6, 6)) ax.contour(x1, x2, z, levels=20,cmap=plt.cm.jet)#contour line #draw trajectory arrows ax.quiver(path_list_gd[:-1,0], path_list_gd[:-1,1], path_list_gd[1:,0]-path_list_gd[:-1,0], path_list_gd[1:,1]-path_list_gd[:-1,1], scale_units='xy', angles='xy', scale=1, color='k') #Label the optimal value point ax.plot(*minima, 'r*', markersize=18) ax.set_xlabel('$x1$') ax.set_ylabel('$x2$') ax.set_xlim((-5, 5)) ax.set_ylim((-5, 5)) plt.show()
#use animation animation path = path_list_gd #Optimization path for gradient descent fig, ax = plt.subplots(figsize=(6, 6)) line, = ax.plot([], [], 'b', label='Gradient Descent', lw=2) #save route point, = ax.plot([], [], 'bo') #save the last point of the path #What did you draw first? def init_draw(): ax.contour(x1, x2, z, levels=20, cmap=plt.cm.jet) ax.plot(*minima, 'r*', markersize=18) #Draw the minimum point as a red five-pointed star ax.set_xlabel('$x$') ax.set_ylabel('$y$') ax.set_xlim((-5, 5)) ax.set_ylim((-5, 5)) return line, point #Update what to draw at each step def update_draw(i): line.set_data(path[:i,0],path[:i,1]) point.set_data(path[i-1:i,0],path[i-1:i,1]) plt.close() return line, point anim = animation.FuncAnimation(fig, update_draw, init_func=init_draw,frames=path.shape[0], interval=60, repeat_delay=5, blit=True) HTML(anim.to_jshtml())
#use scipy.optimize The module solves optimization problems. Since we need to visualize the optimization path, minimize The function needs to specify a callback function parameter callback. x0 = np.array([-4, 4]) def make_minimize_cb(path=[]): def minimize_cb(xk): path.append(np.copy(xk)) return minimize_cb #Some common optimization methods methods = [ "CG", "BFGS","Newton-CG","L-BFGS-B"] #every method call minimize import warnings warnings.filterwarnings('ignore') #The purpose of this line of code is to hide the warning message x0 = [-4.0,4.0] paths = [] zpaths = [] for method in methods: path = [x0] #Callback make_minimize_cb res = minimize(fun=f1_grad, x0=x0,jac=True,method = method,callback=make_minimize_cb(path), bounds=[(-5, 5), (-5, 5)], tol=1e-20) paths.append(np.array(path)) #Add the gradient descent method you just wrote to it methods.append("GD") paths.append(path_list_gd) zpaths = [f1(path[:,0],path[:,1]) for path in paths] #package one TrajectoryAnimation kind ,Animate the optimized paths obtained by different algorithms. ##The code is from the URL: http://louistiao.me/notes/visualizing-and-animating-optimization-algorithms-with-matplotlib/ class TrajectoryAnimation(animation.FuncAnimation): def __init__(self, paths, labels=[], fig=None, ax=None, frames=None, interval=60, repeat_delay=5, blit=True, **kwargs): #If incoming fig and ax If the parameter is empty, create a new one fig object and ax object if fig is None: if ax is None: fig, ax = plt.subplots() else: fig = ax.get_figure() else: if ax is None: ax = fig.gca() self.fig = fig self.ax = ax self.paths = paths #The number of frames of the animation is equal to the longest path length if frames is None: frames = max(path.shape[0] for path in paths) #Get the longest path length self.lines = [ax.plot([], [], label=label, lw=2)[0] for _, label in zip_longest(paths, labels)] self.points = [ax.plot([], [], 'o', color=line.get_color())[0] for line in self.lines] super(TrajectoryAnimation, self).__init__(fig, self.animate, init_func=self.init_anim, frames=frames, interval=interval, blit=blit, repeat_delay=repeat_delay, **kwargs) def init_anim(self): for line, point in zip(self.lines, self.points): line.set_data([], []) point.set_data([], []) return self.lines + self.points def animate(self, i): for line, point, path in zip(self.lines, self.points, self.paths): line.set_data(path[:i,0],path[:i,1]) point.set_data(path[i-1:i,0],path[i-1:i,1]) plt.close() return self.lines + self.points #Final animation comparison fig, ax = plt.subplots(figsize=(8, 8)) ax.contour(x1, x2, z, cmap=plt.cm.jet) ax.plot(*minima, 'r*', markersize=10) ax.set_xlabel('$x1$') ax.set_ylabel('$x2$') ax.set_xlim((-5, 5)) ax.set_ylim((-5, 5)) anim = TrajectoryAnimation(paths, labels=methods, ax=ax) ax.legend(loc='upper left') HTML(anim.to_jshtml())
#3D visualization ##Here is a function with multiple local minima and saddle points f2 = lambda x1, x2 :((4 - 2.1*x1**2 + x1**4 / 3.) * x1**2 + x1 * x2 + (-4 + 4*x2**2) * x2 **2) f2_grad = value_and_grad(lambda args: f2(*args)) x1,x2 = np.meshgrid(np.linspace(-2.0,2.0,50), np.linspace(-1.0,1.0,50)) z = f2(x1,x2 ) #inline change to notebook Dynamic viewing can be set, and the platform runs on a white screen, the reason is unknown %matplotlib inline fig = plt.figure(figsize=(6, 6)) ax = plt.axes(projection='3d', elev=50, azim=-50) ax.plot_surface(x1,x2, z, alpha=.8, cmap=plt.cm.jet) ax.set_xlabel('$x1$') ax.set_ylabel('$x2$') ax.set_zlabel('$f$') ax.set_xlim((-2.0, 2.0)) ax.set_ylim((-1.0, 1.0)) plt.show()
#use Scipy The different optimization methods implemented in and the gradient descent method we implemented in this case to solve. x02 = [-1.0,-0.5] #Initial point, try different initial points,[-1.0,-0.5] ,[1.5,0.75],[-0.8,0.25] _, path_list_gd2 = gradient_descent(f2,f2_grad,x02,0.1,30) #Solve using gradient descent paths = [] zpaths = [] #different methods methods = [ "CG", "BFGS","Newton-CG","L-BFGS-B"] for method in methods: path = [x02] res = minimize(fun=f2_grad, x0=x02,jac=True,method = method,callback=make_minimize_cb(path), bounds=[(-2.0, 2.0), (-1.0, 1.0)], tol=1e-20) paths.append(np.array(path)) #Add your own gradient descent method methods.append("GD") paths.append(path_list_gd2) zpaths = [f2(path[:,0],path[:,1]) for path in paths] #Display in animation %matplotlib inline fig, ax = plt.subplots(figsize=(8, 8)) contour = ax.contour(x1, x2, z, levels=50, cmap=plt.cm.jet) ax.clabel(contour,fontsize=10,colors='k',fmt='%.2f') ax.set_xlabel('$x1$') ax.set_ylabel('$x2$') ax.set_xlim((-2.0, 2.0)) ax.set_ylim((-1.0, 1.0)) anim = TrajectoryAnimation(paths, labels=methods, ax=ax) ax.legend(loc='upper left') HTML(anim.to_jshtml())
Here, the initial points are different in different diagrams. No modification is made in the code. Only the images of each initial point are given:
[-1.0, -0.5]
[1.5, 0.75]
[-0.8, 0.25]
Use the MNIST dataset for algorithm application:
#load MNIST data set import numpy as np f = np.load("input/mnist.npz") X_train, y_train, X_test, y_test = f['x_train'], f['y_train'],f['x_test'], f['y_test'] f.close() x_train = X_train.reshape((-1, 28*28)) / 255.0 x_test = X_test.reshape((-1, 28*28)) / 255.0 #print some numbers randomly rndperm = np.random.permutation(len(x_train)) %matplotlib inline import matplotlib.pyplot as plt plt.gray() fig = plt.figure( figsize=(8,8) ) for i in range(0,100): ax = fig.add_subplot(10,10,i+1) ax.matshow(x_train[rndperm[i]].reshape((28,28))) plt.box(False) #remove border plt.axis("off")#Do not show axes plt.show()
#One-Hot coding import pandas as pd y_train_onehot = pd.get_dummies(y_train) y_train_onehot.head()
We want to use TensorFlow to build a neural network. The neural network to be built is shown in the figure:
#Building a Neural Network for Handwritten Digit Recognition import tensorflow as tf import tensorflow.keras.layers as layers #Build and print the model inputs = layers.Input(shape=(28*28,), name='inputs') #The first three layers of the activation function are set to relu,The last layer is softmax hidden1 = layers.Dense(100, activation='relu', name='hidden1')(inputs) hidden2 = layers.Dense(100, activation='relu', name='hidden2')(hidden1) hidden3 = layers.Dense(50, activation='relu', name='hidden3')(hidden2) outputs = layers.Dense(10, activation='softmax', name='outputs')(hidden3) deep_networks = tf.keras.Model(inputs,outputs) deep_networks.summary()
#loss function categorical_crossentropy,Optimization SGD ##The optimization method (optimizer) can also be defined as RMSprop,Adam,Adagrad, Nadam deep_networks.compile(optimizer='SGD',loss='categorical_crossentropy',metrics=['accuracy']) ###Training 10 times, verbose=1 shows the training process %time history = deep_networks.fit(x_train, y_train_onehot, batch_size=500, epochs=10,validation_split=0.5,verbose=1)
#Printing error curve fig, ax = plt.subplots(figsize=(20, 8)) ax.plot(history.epoch, history.history["loss"]) ax.set_xlabel('$epoch$') ax.set_ylabel('$loss$')
#Test and print accuracy test_loss, test_acc = deep_networks.evaluate(x_test, pd.get_dummies(y_test), verbose=2) print('\nTest accuracy:', test_acc)
The accuracy given in the example is:
This is because there will be errors in each training, which is a normal phenomenon.