[machine learning practice] - Titanic dataset -- Boosting (XGBOOST)

1. write in front:

This part belongs to the actual combat part, and pays more attention to the application of the algorithm in the actual project. For further understanding of the perceptron algorithm itself, please refer to the following links, which have played a great role in my learning process:

[1] Principle summary of XGBoost algorithm https://www.cnblogs.com/pinard/p/10979808.html

[2]XGBoost Parameters https://xgboost.readthedocs.io/en/latest/parameter.html

[3]XGBoost Python API Reference https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn

 

2. dataset:

Dataset address: https://www.kaggle.com/c/titanic

The Titanic dataset is one of the most attended projects on Kaggle. The data itself is simple and compact. It is suitable for beginners to learn and compare various machine learning algorithms.

The data set contains 11 variables: PassengerID, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. These data are used to predict whether passengers survive the Titanic accident.

 

3. algorithm Introduction:

XGBoost is actually an optimization of GBDT, which uses Taylor's first-order expansion to fit the negative gradient; XGBoost, on the other hand, solves all J leaf node regions $r of the optimal decision tree at one time by expanding Taylor's second-order loss function_ {tj}$and the optimal solution $c of each leaf node region_ {tj}$.

 

4. actual combat:

  1 import pandas as pd
  2 import numpy as np
  3 import matplotlib.pyplot as plt
  4 
  5 from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, OrdinalEncoder
  6 from sklearn.impute import SimpleImputer
  7 from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split, cross_validate
  8 from sklearn.pipeline import Pipeline, FeatureUnion
  9 from sklearn.tree import DecisionTreeClassifier
 10 from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier, RandomForestClassifier
 11 from sklearn.metrics import accuracy_score, precision_score, recall_score, auc
 12 from sklearn.base import BaseEstimator, TransformerMixin
 13 from sklearn.utils import class_weight
 14 
 15 import xgboost as xgb
 16 
 17 
 18 class DataFrameSelector(BaseEstimator, TransformerMixin):
 19     def __init__(self, attribute_name):
 20         self.attribute_name = attribute_name
 21 
 22     def fit(self, x, y=None):
 23         return self
 24 
 25     def transform(self, x):
 26         return x[self.attribute_name].values
 27 
 28 
 29 # Load data
 30 data = pd.read_csv('train.csv')
 31 
 32 data_x = data.drop('Survived', axis=1)
 33 data_y = data['Survived']
 34 
 35 # Data cleaning
 36 cat_attribs = ['Pclass', 'Sex', 'Embarked']
 37 dis_attribs = ['SibSp', 'Parch']
 38 con_attribs = ['Age', 'Fare']
 39 
 40 # encoder: OneHotEncoder(),OrdinalEncoder()
 41 cat_pipeline = Pipeline([
 42     ('selector', DataFrameSelector(cat_attribs)),
 43     ('imputer', SimpleImputer(strategy='most_frequent')),
 44     ('encoder', OneHotEncoder()),
 45 ])
 46 
 47 dis_pipeline = Pipeline([
 48     ('selector', DataFrameSelector(dis_attribs)),
 49     ('scaler', MinMaxScaler()),
 50     ('imputer', SimpleImputer(strategy='most_frequent')),
 51 ])
 52 
 53 con_pipeline = Pipeline([
 54     ('selector', DataFrameSelector(con_attribs)),
 55     ('scaler', MinMaxScaler()),
 56     ('imputer', SimpleImputer(strategy='mean')),
 57 ])
 58 
 59 full_pipeline = FeatureUnion(
 60     transformer_list=[
 61         ('con_pipeline', con_pipeline),
 62         ('dis_pipeline', dis_pipeline),
 63         ('cat_pipeline', cat_pipeline),
 64     ]
 65 )
 66 
 67 data_x_cleaned = full_pipeline.fit_transform(data_x)
 68 
 69 X_train, X_test, y_train, y_test = train_test_split(data_x_cleaned, data_y, stratify=data_y, test_size=0.25, random_state=1992)
 70 
 71 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)
 72 
 73 # XGBoost
 74 xgb_cla = xgb.XGBClassifier(use_label_encoder=False, verbosity=1, objective='binary:logistic', random_state=1992)
 75 cls_wt = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)  # cls_wt[1]/cls_wt[0]
 76 
 77 param_grid = [{
 78         'learning_rate': [0.5],
 79         'n_estimators': [5],
 80         'max_depth': [5],
 81         'min_child_weight': [5],
 82         'gamma': [5],
 83         'scale_pos_weight': [1],
 84         'subsample': [0.8],
 85                 }]
 86 
 87 grid_search = GridSearchCV(xgb_cla, param_grid=param_grid, cv=cv, scoring='accuracy', n_jobs=-1, return_train_score=True)
 88 grid_search.fit(X_train, y_train)
 89 cv_results_grid_search = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score')
 90 
 91 predicted_train_xgb = grid_search.predict(X_train)
 92 predicted_test_xgb = grid_search.predict(X_test)
 93 
 94 print('------------XGBoost grid_search: Results of train------------')
 95 print(accuracy_score(y_train, predicted_train_xgb))
 96 print(precision_score(y_train, predicted_train_xgb))
 97 print(recall_score(y_train, predicted_train_xgb))
 98 
 99 print('------------XGBoost grid search: Results of test------------')
100 print(accuracy_score(y_test, predicted_test_xgb))
101 print(precision_score(y_test, predicted_test_xgb))
102 print(recall_score(y_test, predicted_test_xgb))

 

4.1 custom Eval in xgboost_ metric:

Like Adaboost and GBDT, we also want to perform the step-by-step strong learner of each step on the training set and verification set. Because xgboost does not directly stage_ For the predict method, you need to use the parameter "eval_metric" to specify the evaluation metric, which comes with "error". Here I also wrote an eval metric of "accuracy", which can be compared with "error". However, I found a problem here: the objective we previously set is "binary:logistic". According to the official document: binary:logistic: logistic regulation for binary classification, output probability, in fact, it only outputs probability when using the predict method, which is input to eval_ The so-called "y_predicted" in metric is actually $f(x) $, that is, the value without transformation. Therefore, in Eval_ The definitions in metric need to be careful.

 

 1 def metric_precision(y_predicted, y_true):
 2     label = y_true.get_label()
 3     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
 4 
 5     print('max_y_predicted:%f' % max(y_predicted))
 6     print('min_y_predicted:%f' % min(y_predicted))
 7 
 8     precision = precision_score(label, predicted_binary)
 9     return 'Precision', precision
10 
11 
12 def metric_accuracy(y_predicted, y_true):
13     label = y_true.get_label()
14     # predicted_binary = np.round(y_predicted)
15     # y_predicted = 1 / (1 + np.exp(-y_predicted))
16     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
17 
18     print('max_y_predicted:%f' % max(y_predicted))
19     print('min_y_predicted:%f' % min(y_predicted))
20 
21     accuracy = accuracy_score(label, predicted_binary)
22     return 'Accuracy', accuracy
23 
24 
25 def metric_recall(y_predicted, y_true):
26     label = y_true.get_label()
27     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
28 
29     print('max_y_predicted:%f' % max(y_predicted))
30     print('min_y_predicted:%f' % min(y_predicted))
31 
32     recall = recall_score(label, predicted_binary)
33     return 'Recall', recall
34 
35 
36 xgb_cla2 = grid_search.best_estimator_
37 temp2 = xgb_cla2.predict(X_train)
38 temp2_pro = xgb_cla2.predict_proba(X_train)
39 xgb_cla2.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric=['error'])
40 
41 xgb_cla2.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric=metric_accuracy)
42 
43 plt.figure()
44 plt.plot(xgb_cla2.evals_result()['validation_1']['Accuracy'], label='test')
45 plt.plot(xgb_cla2.evals_result()['validation_0']['Accuracy'], label='train')
46 
47 plt.legend()
48 plt.grid(axis='both')

On the training set, after grid_search, n_estiamtors(sklearn api) / num_boost_round(learning api) is 5 optimal. It can be found that on the test set, the optimal solution is also obtained at 5, and subsequent tests_ Accuracy is no longer promoted, but train_accuracy is further improved and gradually over fitted.

4.2 result analysis:

As in the previous articles, the optimal parameters of XGBoost are used in the prediction set, and the results are uploaded to kaggle. The results are as follows (Note: the training set here is only the whole training set, including train and test in the code):

It can be seen that the effect is similar to that of AdaBoost and GBDT, but slightly worse than that of RF (random forest).

 

Training set

accuracy

Training set

precision

Training set

recall

Prediction set

accuracy (upload kaggle to get results)

Naive Bayesian optimal solution 0.790 0.731 0.716 0.756
Perceptron 0.771 0.694 0.722 0.722
logistic regression 0.807 0.781 0.690 0.768
Linear SVM 0.801 0.772 0.684 0.773
rbf kernel SVM 0.834 0.817 0.731 0.785
AdaBoost 0.844 0.814 0.769 0.789
GBDT 0.843 0.877 0.687 0.778
RF 0.820 0.917 0.585 0.792
XGBoost 0.831 0.847  0.681  0.780

Tags: Machine Learning sklearn xgboost

Posted by Omid on Fri, 03 Jun 2022 11:18:31 +0530