1. questions and data
If you want to sell your house, you want to know what a good market price is. One way to do this is to first collect information about houses that have recently been sold. In this part of the exercise, you will use multiple linear regression to predict house prices.
Data ex1data2 Txt contains data in 47 rows and 3 columns (47,3); The first column shows the area of the house in square feet, the second column shows the number of bedrooms, and the third column shows the price of the house. Specific data are as follows:
2104,3,399900 1600,3,329900 2400,3,369000 1416,2,232000 3000,4,539900 1985,4,299900 1534,3,314900 1427,3,198999 1380,3,212000 1494,3,242500 1940,4,239999 2000,3,347000 1890,3,329999 4478,5,699900 1268,3,259900 2300,4,449900 1320,2,299900 1236,3,199900 2609,4,499998 3031,4,599000 1767,3,252900 1888,2,255000 1604,3,242900 1962,4,259900 3890,3,573900 1100,3,249900 1458,3,464500 2526,3,469000 2200,3,475000 2637,3,299900 1839,2,349900 1000,1,169900 2040,4,314900 3137,3,579900 1811,4,285900 1437,3,249900 1239,3,229900 2132,4,345000 4215,4,549000 2162,4,287000 1664,2,368500 2238,3,329900 2567,4,314000 1200,3,299000 852,2,179900 1852,4,299900 1203,3,239500
2. data import and preliminary analysis
Import packages. numpy and pandas are libraries for operations, and matplotlib is a library for drawing.
import numpy as np import pandas as pd import matplotlib.pyplot as plt
path = 'ex1data1.txt' # Data path, to and py file under the same file data = pd.read_csv(path, header=None, names=['Population', 'Profit']) #names column name, the first column is Population, and the second column is Profit print(data.head()) # Preview the first five rows of data
The operation output results are as follows:
0 1 2 0 2104 3 399900 1 1600 3 329900 2 2400 3 369000 3 1416 2 232000 4 3000 4 539900
At this time, due to different eigenvalues, the mean normalization is required.
If the house price is not normalized, the difference between its order of magnitude and the normalized order of magnitude of your input value is too large. Regression with hundreds of thousands of orders of magnitude and decimal places cannot guarantee convergence. The error between predicted y and actual y is too much.
data2 = (data2 - data2.mean()) / data2.std() # This is a standard universal mean normalization formula print(data2.head())
After the mean normalization, the data size is very uniform. The mean normalization results are as follows:
0 1 2 0 0.130010 -0.223675 0.475747 1 -0.504190 -0.223675 -0.084074 2 0.502476 -0.223675 0.228626 3 -0.735723 -1.537767 -0.867025 4 1.257476 1.090417 1.595389
Gradient descent is now used to implement linear regression to minimize the cost function.
First, we will create a θ (theta) is the cost function of the characteristic function. (Note: the terms cost function and cost function mean the same thing J ( θ ) J(\theta) J( θ) This Cost Function.
Since the problem is multivariable and has two x's, in this topic:
Based on the above two formulas, our modeling formula is as follows (you won't go to see mine“ Wuenda machine learning punch in day1-P8"):
Calculation cost function J( θ)
def computeCost(X2, y2, theta): inner = np.power(((X2 * theta.T) - y2), 2) # That is, to the right of the summation symbol of the cost function formula T is the meaning of transpose matrix return np.sum(inner) / (2 * len(X2)) # np.sum() can add up all the elements in the matrix, that is, the sum sign of the cost function formula and (1/2m) # Return returns the following value / result to the defined function. Otherwise, the function name will be printed as a null value; With return, the value / result can only appear when the function name is printed below
Let's add a column to the training set so that we can use the vectorized solution to calculate the cost and gradient,
That is to make h θ ( x ) = θ 0 + θ 1 X h_\theta(x) = \theta_0 +\theta_1 X h θ (x)= θ 0+ θ 1 X medium θ 0 of system number by 1 \theta_ The coefficient of 0 is 1 θ The coefficient of 0 is 1.
#Insert a column with all 1 columns in the leftmost part of the dataset for calculation data2.insert(0, 'ones', 1)
Now let's do some variable initialization.
# . Shape will get the total number of rows and columns of the matrix shape is the length of the first dimension, and shape is the length of the second dimension, that is, the column. # pandas iloc select data loc; ',' The first part indicates the selected row, and the second part indicates the selected column. At this time, there are three columns # set X (training data) and y (target variable) cols = data2.shape # Indicates that there are three columns in the second dimension (row), that is, cols=3 X2 = data2.iloc[ : , 0:cols-1] # [X is all rows, y is columns from 0 to cols-1 (i.e. the first and second columns)] y2 = data2.iloc[ : , cols-1: cols] # [X is all rows, y is cols-1 to cols column (i.e. the third column)]
Print the X header and observe whether the X (training set) is obtained correctly.
Output X header:
ones 0 1 0 1 0.130010 -0.223675 1 1 -0.504190 -0.223675 2 1 0.502476 -0.223675 3 1 -0.735723 -1.537767 4 1 1.257476 1.090417
Print the Y header and observe whether y (training set) is obtained correctly.
Output y header:
2 0 0.475747 1 -0.084074 2 0.228626 3 -0.867025 4 1.595389
The cost function should be a numpy matrix, so we need to convert x and y into matrices before we can use them. We also need to initialize theta, that is, set all elements of theta to 0.
# The data read by pandas is in the form of DataFrame. The advantage is that it can perform many operations on the data. However, to perform matrix operations, it is necessary to convert the form of DataFrame into a matrix, such as x=np matrix(X.values) X2 = np.matrix(X2.values) # Convert X to matrix y2 = np.matrix(y2.values) # Convert y to matrix # theta = np.matrix(np.array([0, 0, 0])) # Because X becomes three columns, theta naturally has three columns theta = np.full((1, X2.shape), 0) # np.full() is a function that fills in 0, which is the same as np Zeros() is about the same; Line 79 ravel got rid of it. I don't think it's useful # np. Full ((number of rows and columns), the number to be filled in)
#Print theta check. It should be a zero matrix with one row and two columns
theta: [[0 0 0]]
Take a look at the dimensions. Only when the dimensions are right can the matrix be multiplied correctly. The code errors of this operation are often caused by the wrong dimensions of the matrix.
print('X.shape:', X.shape) print('theta.shape', theta.shape) print('y.shape', y.shape)
Output dimensions of X2, theta, y2:
X2.shape: (47, 3) theta.shape (1, 3) y2.shape (47, 1)
Calculate the initial cost function (theta initial value is 0)
computeCost(X, y, theta) print('Cost_init:', computeCost(X, y, theta)) #Get the initial cost (iteration has not yet started)
Initial cost function result
3. batch gradient descent
The key point is that theta0 and thata1 should be updated at the same time, and temp should be used for temporary storage.
temp[0, j] this line is θ \theta θ Is also the core of gradient descent function. Namely:
def gradientDescent(X2, y2, theta, alpha, iters): temp = np.matrix(np.zeros(theta.shape)) # Create a (1, 3) zero value matrix temp according to theta's shape. In order to temporarily store theta, it is convenient to update iteratively. parameters = int(theta.shape) # theta.shape is 3; Travel calculates the number of parameters to be solved, and the function reduces the multidimensional array to one dimension print(parameters, 'parameters') cost = np.zeros(iters) # Build an array of 0 items to store cost for i in range(iters): error = (X2 * theta.T) - y2 # Calculate the error of each group of data, and each row is the error of the sample for j in range(parameters): # The formula derived from theta by linear regression equation is sum of the square of (theta.T*X-y)*xj # So here we use x*theta T. That is, the column direction is saved, and then multiplied by each corresponding column in X to facilitate saving and summation. # multiply multiplies the corresponding elements of the array, so this term is the data derived from theta that has not been added term = np.multiply(error, X2[ : , j]) temp[0, j] = theta[0, j] - ((alpha / len(X2)) * np.sum(term)) # It doesn't matter if the previous formula is different from the formula learned in class, because the book # The transpose on is to consider the sum. We transpose here to become a multiplicable square. As for the sum, we use sum to directly calculate theta = temp cost[i] = computeCost(X2, y2, theta) # Call the cost function once every iteration to calculate the value of the objective equation; Coexist as the ith value in the cost array return theta, cost
Initialize learning rate and iterations
alpha = 0.01 iters = 1000
#Call the gradient descent function to calculate the optimal value after iteration
g2, cost2 = gradientDescent(X2, y2, theta, alpha, iters) # g is theta after iteration; (sort by theta, cost of return) print(g2, "g2") # At this time, g is theta, which satisfies the optimal value of minimizing the cost function print(cost2, "cost2")
The output optimal g2 is:
[[-1.10910099e-16 8.78503652e-01 -4.69166570e-02]] g2
Substitute the optimal value theta to calculate the minimum cost
minCost = computeCost(X2, y2, g2) # Substitute the optimal value theta to calculate the minimum cost function print(minCost, "minCost")
The minimum output cost is:
Since the gradient equation function outputs a cost vector in each training iteration, we can also plot the cost function. Note that the cost is always decreasing, which is a feature of convex optimization problems.
# fig represents the drawing window (Figure); Ax represents the coordinate system (axis) on the drawing window. Generally, ax will continue to operate fig, ax = plt.subplots(figsize=(12, 8)) ax.plot(np.arange(iters), cost2, 'r') # np.arange() automatically returns an arithmetic array ax.set_xlabel('Iterations') # Set x-axis name ax.set_ylabel('Cost') # Set y-axis name ax.set_title('Error vs. Training Epoch') # Set the name of the entire table plt.show()
Output the plotted iteration curve: