principle
Basic principles
Solution of loss function
- There are many methods to minimize the loss function (maximum likelihood function) of binary logistic regression. The most common methods are gradient descent method, coordinate axis descent method, Newton method, etc. the most common method is gradient descent to continuously approach the optimal solution.
- Gradient descent method: random gradient descent (SGD), batch gradient descent (BGD), small batch gradient descent (MBGD).
Advantages and disadvantages
Advantages:
(1) The training speed is fast, and the amount of computation is only related to the number of features in classification;
(2) It is simple and easy to understand. The interpretability of the model is very good. From the weight of features, we can see the impact of different features on the final results;
(3) It is suitable for binary classification problems without scaling input features;
(4) The memory consumption is small, because only the characteristic values of each dimension need to be stored;
Disadvantages:
(1) Logistic regression can not be used to solve nonlinear problems, because the decision-making of logistic is linear;
(2) Sensitive to multicollinearity data;
(3) It is difficult to deal with the problem of data imbalance;
(4) The accuracy is not very high, because the form is very simple (very similar to the linear model), it is difficult to fit the real distribution of the data;
(5) Logistic regression itself cannot filter features. Sometimes gbdt is used to filter features, and then logistic regression is applied.
Python implementation
LR_Train training model
# coding:UTF-8 # Author:xwj # Date:2020-7-2 # Email:xwj770427414@126.com # Environment:Python3.7 import numpy as np def sig(x): """ Logarithmic probability function Sigmoid function :param x: feature * x + b :return:P(y=1|x,w,b) """ return 1.0/(1+np.exp(-x)) def lr_train_bfd(feature, label, maxCycle, alpha): """ Training with gradient descent method LR model :param feature:mat,feature :param label: mar,label :param maxCycle: int,Maximum iterations :param alpha: float,Learning rate :return: w,weight """ n = np.shape(feature)[1] # Number of features w = np.mat(np.ones((n, 1))) # Initialize weights i = 0 while i <= maxCycle: # Loop within maximum iterations i += 1 h = sig(feature * w) # Calculate the value of Sigmoid err = label - h # error if i % 100 == 0: # Once every 100 iterations print('\t--------Iterations = ' + str(i) + ',Training error rate = ' + str(error_rate(h, label)) ) w = w + alpha * feature.T * err # Weight correction return w def error_rate(h, label): """ Calculate loss function value :param h: mat,Estimate :param label: mat,actual value :return: float,err/m error rate """ m = np.shape(h)[0] # Number of predicted values sum_err = 0.0 # Initialization error rate for i in range(m): # m predicted value iterations if h[i, 0] > 0 and (1 - h[i, 0]) > 0: # Predictor slice sum_err -= (label[i, 0] * np.log(h[i, 0]) + (1 - label[i, 0]) * np.log(1 - h[i, 0])) # Loss function formula calculation else: sum_err -= 0 return sum_err / m def load_data(file_name): """ Cleaning data, importing data :param file_name: Dataset name :return: Eigenvalues and labels in matrix form """ f = open(file_name) feature_data = [] # Characteristic data label_data = [] # Label data for line in f.readlines(): # Data set read row by row feature_tmp = [] # Staging characteristics label_tmp = [] # Temporary label lines = line.strip().split('\t') # Remove the special symbols (\n, etc.) at the end of the data, and divide the data into lists at \t intervals. feature_tmp.append(1) # The initial offset term b is 1 and merged into the feature for i in range(len(lines) - 1): # Read the characteristic data one by one and remove the label at the end feature_tmp.append(float(lines[i])) # The features are floating-point numbered one by one and added to the temporary features to form a list label_tmp.append(float(lines[-1])) # Add labels to temporary labels to form a list feature_data.append(feature_tmp) # Add staging list to general list label_data.append(label_tmp) # Add staging list to general list f.close() return np.mat(feature_data), np.mat(label_data) # Matrix feature sequence and label sequence def save_model(file_name, w): """ Save weights for final model :param file_name:Dataset name :param w: weight :return: Save model file """ m = np.shape(w)[0] f_w = open(file_name, 'w') w_array = [] for i in range(m): w_array.append(str(w[i, 0])) f_w.write('\t'.join(w_array)) f_w.close() if __name__ == "__main__": print('------1.Import data------') feature, label = load_data("data.txt") print('------2.Training model------') w = lr_train_bfd(feature, label, 1000, 0.01) print('------3.Save model------') save_model("weights.txt", w)
test model
# coding:UTF-8 # Author:xwj # Date:2020-7-2 # Email:xwj770427414@126.com # Environment:Python3.7 import numpy as np def sig(x): """ Logarithmic probability function Sigmoid function :param x: feature * x + b :return:P(y=1|x,w,b) """ return 1.0/(1+np.exp(-x)) def load_weight(w): """ Import LR Training model :param w: w Weight storage location :return: np.mat(w),Matrix of weights """ f = open(w) w = [] for line in f.readlines(): lines = line.strip().split('\t') w_tmp = [] for x in lines: w_tmp.append(float(x)) w.append(w_tmp) f.close() return np.mat(w) def load_data(file_name, n): """ Import test data :param file_name:Test data location :param n: Number of features :return: np.mat(feature)Characteristics of test sets """ f = open(file_name) feature_data = [] for line in f.readlines(): feature_tmp = [] lines = line.strip().split('\t') if len(lines) != n - 1: continue feature_tmp.append(1) # The initial offset term b is 1 and merged into the feature for x in lines: feature_tmp.append(float(x)) feature_data.append(feature_tmp) f.close() return np.mat(feature_data) def predict(data, w): """ Predict test data :param data: mat,Characteristics of the model :param w: Parameters of the model :return: h,mat,Final forecast results """ h = sig(data * w.T) m = np.shape(h)[0] for i in range(m): if h[i, 0] < 0/5: h[1, 0] = 0.0 else: h[i, 0] = 1.0 return h def save_result(file_name, result): """ Save final forecast results :param file_name: Forecast result save file name :param result: mat,Predicted results """ m = np.shape(result)[0] tmp = [] for i in range(m): tmp.append(str(h[i, 0])) f_result = open(file_name, "w") f_result.write("\t".join(tmp)) f_result.close() if __name__ == "__main__": print('------1.Import model------') w = load_weight("weights.txt") n = np.shape(w)[1] print('------2.Import data------') testData = load_data("test_data", n) print('------3.Forecast data------') h = predict(testData, w) print('------4.Save results------') save_result("result.txt", h)