Wuenda's machine learning homework 1.1 - linear regression with multiple variables

1. questions and data

If you want to sell your house, you want to know what a good market price is. One way to do this is to first collect information about houses that have recently been sold. In this part of the exercise, you will use multiple linear regression to predict house prices.
Data ex1data2 Txt contains data in 47 rows and 3 columns (47,3); The first column shows the area of the house in square feet, the second column shows the number of bedrooms, and the third column shows the price of the house. Specific data are as follows:


2. data import and preliminary analysis

Import packages. numpy and pandas are libraries for operations, and matplotlib is a library for drawing.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Import dataset

path = 'ex1data1.txt'   # Data path, to and py file under the same file
data = pd.read_csv(path, header=None, names=['Population', 'Profit'])  #names column name, the first column is Population, and the second column is Profit
print(data.head())  # Preview the first five rows of data

The operation output results are as follows:

      0  1       2
0  2104  3  399900
1  1600  3  329900
2  2400  3  369000
3  1416  2  232000
4  3000  4  539900

At this time, due to different eigenvalues, the mean normalization is required.
If the house price is not normalized, the difference between its order of magnitude and the normalized order of magnitude of your input value is too large. Regression with hundreds of thousands of orders of magnitude and decimal places cannot guarantee convergence. The error between predicted y and actual y is too much.

data2 = (data2 - data2.mean()) / data2.std()  # This is a standard universal mean normalization formula

After the mean normalization, the data size is very uniform. The mean normalization results are as follows:

          0         1         2
0  0.130010 -0.223675  0.475747
1 -0.504190 -0.223675 -0.084074
2  0.502476 -0.223675  0.228626
3 -0.735723 -1.537767 -0.867025
4  1.257476  1.090417  1.595389

Gradient descent is now used to implement linear regression to minimize the cost function.
First, we will create a θ (theta) is the cost function of the characteristic function. (Note: the terms cost function and cost function mean the same thing J ( θ ) J(\theta) J( θ) This Cost Function.

Since the problem is multivariable and has two x's, in this topic:

Based on the above two formulas, our modeling formula is as follows (you won't go to see mine“ Wuenda machine learning punch in day1-P8"):

Calculation cost function J( θ)

def computeCost(X2, y2, theta):
    inner = np.power(((X2 * theta.T) - y2), 2)  # That is, to the right of the summation symbol of the cost function formula T is the meaning of transpose matrix
    return np.sum(inner) / (2 * len(X2))  # np.sum() can add up all the elements in the matrix, that is, the sum sign of the cost function formula and (1/2m)
# Return returns the following value / result to the defined function. Otherwise, the function name will be printed as a null value; With return, the value / result can only appear when the function name is printed below

Let's add a column to the training set so that we can use the vectorized solution to calculate the cost and gradient,
That is to make h θ ( x ) = θ 0 + θ 1 X h_\theta(x) = \theta_0 +\theta_1 X h θ ​(x)= θ 0​+ θ 1 X medium θ 0 of system number by 1 \theta_ The coefficient of 0 is 1 θ The coefficient of 0 is 1.

#Insert a column with all 1 columns in the leftmost part of the dataset for calculation
data2.insert(0, 'ones', 1)

Now let's do some variable initialization.

# . Shape will get the total number of rows and columns of the matrix shape[0] is the length of the first dimension, and shape[1] is the length of the second dimension, that is, the column.
# pandas iloc select data loc; ',' The first part indicates the selected row, and the second part indicates the selected column. At this time, there are three columns
# set X (training data) and y (target variable)
cols = data2.shape[1]  # Indicates that there are three columns in the second dimension (row), that is, cols=3
X2 = data2.iloc[ : , 0:cols-1]  # [X is all rows, y is columns from 0 to cols-1 (i.e. the first and second columns)]
y2 = data2.iloc[ : , cols-1: cols]  # [X is all rows, y is cols-1 to cols column (i.e. the third column)]

Print the X header and observe whether the X (training set) is obtained correctly.


Output X header:

   ones         0         1
0     1  0.130010 -0.223675
1     1 -0.504190 -0.223675
2     1  0.502476 -0.223675
3     1 -0.735723 -1.537767
4     1  1.257476  1.090417

Print the Y header and observe whether y (training set) is obtained correctly.


Output y header:

0  0.475747
1 -0.084074
2  0.228626
3 -0.867025
4  1.595389

The cost function should be a numpy matrix, so we need to convert x and y into matrices before we can use them. We also need to initialize theta, that is, set all elements of theta to 0.

# The data read by pandas is in the form of DataFrame. The advantage is that it can perform many operations on the data. However, to perform matrix operations, it is necessary to convert the form of DataFrame into a matrix, such as x=np matrix(X.values)
X2 = np.matrix(X2.values)  # Convert X to matrix
y2 = np.matrix(y2.values)  # Convert y to matrix
# theta = np.matrix(np.array([0, 0, 0]))  # Because X becomes three columns, theta naturally has three columns
theta = np.full((1, X2.shape[1]), 0)   # np.full() is a function that fills in 0, which is the same as np Zeros() is about the same; Line 79 ravel got rid of it. I don't think it's useful
# np. Full ((number of rows and columns), the number to be filled in)

#Print theta check. It should be a zero matrix with one row and two columns

print(theta, 'theta')

Output theta:

theta: [[0 0 0]]

Take a look at the dimensions. Only when the dimensions are right can the matrix be multiplied correctly. The code errors of this operation are often caused by the wrong dimensions of the matrix.

print('X.shape:', X.shape)
print('theta.shape', theta.shape)
print('y.shape', y.shape)

Output dimensions of X2, theta, y2:

X2.shape: (47, 3)
theta.shape (1, 3)
y2.shape (47, 1)

Calculate the initial cost function (theta initial value is 0)

computeCost(X, y, theta)
print('Cost_init:', computeCost(X, y, theta))  #Get the initial cost (iteration has not yet started)

Initial cost function result

Cost_init: 32.072733877455676

3. batch gradient descent

The key point is that theta0 and thata1 should be updated at the same time, and temp should be used for temporary storage.
temp[0, j] this line is θ \theta θ Is also the core of gradient descent function. Namely:

def gradientDescent(X2, y2, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))   # Create a (1, 3) zero value matrix temp according to theta's shape. In order to temporarily store theta, it is convenient to update iteratively.
    parameters = int(theta.shape[1])  # theta.shape[1] is 3; Travel calculates the number of parameters to be solved, and the function reduces the multidimensional array to one dimension
    print(parameters, 'parameters')
    cost = np.zeros(iters) # Build an array of 0 items to store cost

    for i in range(iters):
        error = (X2 * theta.T) - y2  # Calculate the error of each group of data, and each row is the error of the sample

        for j in range(parameters):
            # The formula derived from theta by linear regression equation is sum of the square of (theta.T*X-y)*xj
            # So here we use x*theta T. That is, the column direction is saved, and then multiplied by each corresponding column in X to facilitate saving and summation.
            # multiply multiplies the corresponding elements of the array, so this term is the data derived from theta that has not been added
            term = np.multiply(error, X2[ : , j])
            temp[0, j] = theta[0, j] - ((alpha / len(X2)) * np.sum(term))  # It doesn't matter if the previous formula is different from the formula learned in class, because the book
            # The transpose on is to consider the sum. We transpose here to become a multiplicable square. As for the sum, we use sum to directly calculate

        theta = temp
        cost[i] = computeCost(X2, y2, theta)  # Call the cost function once every iteration to calculate the value of the objective equation; Coexist as the ith value in the cost array

    return theta, cost

Initialize learning rate and iterations

alpha = 0.01
iters = 1000

#Call the gradient descent function to calculate the optimal value after iteration

g2, cost2 = gradientDescent(X2, y2, theta, alpha, iters)
# g is theta after iteration; (sort by theta, cost of return)
print(g2, "g2")  # At this time, g is theta, which satisfies the optimal value of minimizing the cost function
print(cost2, "cost2")

The output optimal g2 is:

[[-1.10910099e-16  8.78503652e-01 -4.69166570e-02]] g2

Substitute the optimal value theta to calculate the minimum cost

minCost = computeCost(X2, y2, g2)  # Substitute the optimal value theta to calculate the minimum cost function
print(minCost, "minCost")

The minimum output cost is:

0.1307033696077189 minCost

Since the gradient equation function outputs a cost vector in each training iteration, we can also plot the cost function. Note that the cost is always decreasing, which is a feature of convex optimization problems.

# fig represents the drawing window (Figure); Ax represents the coordinate system (axis) on the drawing window. Generally, ax will continue to operate
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(np.arange(iters), cost2, 'r')  # np.arange() automatically returns an arithmetic array
ax.set_xlabel('Iterations')  # Set x-axis name
ax.set_ylabel('Cost')  # Set y-axis name
ax.set_title('Error vs. Training Epoch')  # Set the name of the entire table

Output the plotted iteration curve:

[1] https://blog.csdn.net/weixin_43455338/article/details/104794760
[2] https://github.com/PlayPurEo/ML-and-DL/tree/master/basic-model
[3] https://www.bilibili.com/video/BV1cX4y1G7h2?spm_id_from=333.999.0.0
[4] https://www.bilibili.com/video/BV1Xt411s7KY?spm_id_from=333.999.0.0

Tags: Algorithm Python AI Machine Learning

Posted by artist-ink on Fri, 03 Jun 2022 03:42:53 +0530