Logic regression principle and Python implementation (including data set)

principle

Basic principles



Solution of loss function

  • There are many methods to minimize the loss function (maximum likelihood function) of binary logistic regression. The most common methods are gradient descent method, coordinate axis descent method, Newton method, etc. the most common method is gradient descent to continuously approach the optimal solution.
  • Gradient descent method: random gradient descent (SGD), batch gradient descent (BGD), small batch gradient descent (MBGD).

Advantages and disadvantages

Advantages:
(1) The training speed is fast, and the amount of computation is only related to the number of features in classification;
(2) It is simple and easy to understand. The interpretability of the model is very good. From the weight of features, we can see the impact of different features on the final results;
(3) It is suitable for binary classification problems without scaling input features;
(4) The memory consumption is small, because only the characteristic values of each dimension need to be stored;

Disadvantages:
(1) Logistic regression can not be used to solve nonlinear problems, because the decision-making of logistic is linear;
(2) Sensitive to multicollinearity data;
(3) It is difficult to deal with the problem of data imbalance;
(4) The accuracy is not very high, because the form is very simple (very similar to the linear model), it is difficult to fit the real distribution of the data;
(5) Logistic regression itself cannot filter features. Sometimes gbdt is used to filter features, and then logistic regression is applied.

Python implementation

LR_Train training model

# coding:UTF-8
# Author:xwj
# Date:2020-7-2
# Email:xwj770427414@126.com
# Environment:Python3.7
import numpy as np


def sig(x):
    """
    Logarithmic probability function  Sigmoid function
    :param x: feature * x + b
    :return:P(y=1|x,w,b)
    """
    return 1.0/(1+np.exp(-x))


def lr_train_bfd(feature, label, maxCycle, alpha):
    """
    Training with gradient descent method LR model
    :param feature:mat,feature 
    :param label: mar,label
    :param maxCycle: int,Maximum iterations
    :param alpha: float,Learning rate
    :return: w,weight
    """
    n = np.shape(feature)[1]  # Number of features
    w = np.mat(np.ones((n, 1)))  # Initialize weights
    i = 0
    while i <= maxCycle:  # Loop within maximum iterations
        i += 1
        h = sig(feature * w)  # Calculate the value of Sigmoid
        err = label - h  # error
        if i % 100 == 0:  # Once every 100 iterations
            print('\t--------Iterations = ' + str(i) + ',Training error rate = ' + str(error_rate(h, label)) )
            w = w + alpha * feature.T * err  # Weight correction
    return w


def error_rate(h, label):
    """
    Calculate loss function value
    :param h: mat,Estimate
    :param label: mat,actual value
    :return: float,err/m error rate
    """
    m = np.shape(h)[0]   # Number of predicted values
    sum_err = 0.0  # Initialization error rate
    for i in range(m):   # m predicted value iterations
        if h[i, 0] > 0 and (1 - h[i, 0]) > 0:  # Predictor slice
            sum_err -= (label[i, 0] * np.log(h[i, 0]) + (1 - label[i, 0]) * np.log(1 - h[i, 0]))  # Loss function formula calculation
        else:
            sum_err -= 0
    return sum_err / m


def load_data(file_name):
    """
    Cleaning data, importing data
    :param file_name: Dataset name
    :return: Eigenvalues and labels in matrix form
    """
    f = open(file_name)
    feature_data = []  # Characteristic data
    label_data = []  # Label data
    for line in f.readlines():  # Data set read row by row
        feature_tmp = []  # Staging characteristics
        label_tmp = []  # Temporary label
        lines = line.strip().split('\t')  # Remove the special symbols (\n, etc.) at the end of the data, and divide the data into lists at \t intervals.
        feature_tmp.append(1)  # The initial offset term b is 1 and merged into the feature
        for i in range(len(lines) - 1):  # Read the characteristic data one by one and remove the label at the end
            feature_tmp.append(float(lines[i]))  # The features are floating-point numbered one by one and added to the temporary features to form a list
        label_tmp.append(float(lines[-1]))  # Add labels to temporary labels to form a list
        feature_data.append(feature_tmp)  # Add staging list to general list
        label_data.append(label_tmp)  # Add staging list to general list
    f.close()
    return np.mat(feature_data), np.mat(label_data)  # Matrix feature sequence and label sequence


def save_model(file_name, w):
    """
    Save weights for final model
    :param file_name:Dataset name
    :param w: weight
    :return: Save model file
    """
    m = np.shape(w)[0]
    f_w = open(file_name, 'w')
    w_array = []
    for i in range(m):
        w_array.append(str(w[i, 0]))
    f_w.write('\t'.join(w_array))
    f_w.close()


if __name__ == "__main__":
    print('------1.Import data------')
    feature, label = load_data("data.txt")
    print('------2.Training model------')
    w = lr_train_bfd(feature, label, 1000, 0.01)
    print('------3.Save model------')
    save_model("weights.txt", w)

test model

# coding:UTF-8
# Author:xwj
# Date:2020-7-2
# Email:xwj770427414@126.com
# Environment:Python3.7
import numpy as np


def sig(x):
    """
    Logarithmic probability function  Sigmoid function
    :param x: feature * x + b
    :return:P(y=1|x,w,b)
    """
    return 1.0/(1+np.exp(-x))


def load_weight(w):
    """
    Import LR Training model
    :param w: w Weight storage location
    :return: np.mat(w),Matrix of weights
    """
    f = open(w)
    w = []
    for line in f.readlines():
        lines = line.strip().split('\t')
        w_tmp = []
        for x in lines:
            w_tmp.append(float(x))
        w.append(w_tmp)
    f.close()
    return np.mat(w)


def load_data(file_name, n):
    """
    Import test data
    :param file_name:Test data location
    :param n: Number of features
    :return: np.mat(feature)Characteristics of test sets
    """
    f = open(file_name)
    feature_data = []
    for line in f.readlines():
        feature_tmp = []
        lines = line.strip().split('\t')
        if len(lines) != n - 1:
            continue
        feature_tmp.append(1)  # The initial offset term b is 1 and merged into the feature
        for x in lines:
            feature_tmp.append(float(x))
        feature_data.append(feature_tmp)
    f.close()
    return np.mat(feature_data)


def predict(data, w):
    """
    Predict test data
    :param data: mat,Characteristics of the model
    :param w: Parameters of the model
    :return: h,mat,Final forecast results
    """
    h = sig(data * w.T)
    m = np.shape(h)[0]
    for i in range(m):
        if h[i, 0] < 0/5:
            h[1, 0] = 0.0
        else:
            h[i, 0] = 1.0
    return h


def save_result(file_name, result):
    """
    Save final forecast results
    :param file_name: Forecast result save file name
    :param result: mat,Predicted results
    """
    m = np.shape(result)[0]
    tmp = []
    for i in range(m):
        tmp.append(str(h[i, 0]))
    f_result = open(file_name, "w")
    f_result.write("\t".join(tmp))
    f_result.close()


if __name__ == "__main__":
    print('------1.Import model------')
    w = load_weight("weights.txt")
    n = np.shape(w)[1]
    print('------2.Import data------')
    testData = load_data("test_data", n)
    print('------3.Forecast data------')
    h = predict(testData, w)
    print('------4.Save results------')
    save_result("result.txt", h)

data set

  1. Training data set
  2. Test data set

Tags: Python Machine Learning logistic regressive

Posted by tiggy on Mon, 30 May 2022 22:20:47 +0530