# 1. write in front:

This part belongs to the actual combat part, and pays more attention to the application of the algorithm in the actual project. For further understanding of the perceptron algorithm itself, please refer to the following links, which have played a great role in my learning process:

[1] Principle summary of XGBoost algorithm https://www.cnblogs.com/pinard/p/10979808.html

# 2. dataset:

The Titanic dataset is one of the most attended projects on Kaggle. The data itself is simple and compact. It is suitable for beginners to learn and compare various machine learning algorithms.

The data set contains 11 variables: PassengerID, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. These data are used to predict whether passengers survive the Titanic accident.

# 3. algorithm Introduction:

XGBoost is actually an optimization of GBDT, which uses Taylor's first-order expansion to fit the negative gradient; XGBoost, on the other hand, solves all J leaf node regions \$r of the optimal decision tree at one time by expanding Taylor's second-order loss function_ {tj}\$and the optimal solution \$c of each leaf node region_ {tj}\$.

# 4. actual combat:

```  1 import pandas as pd
2 import numpy as np
3 import matplotlib.pyplot as plt
4
5 from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, OrdinalEncoder
6 from sklearn.impute import SimpleImputer
7 from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split, cross_validate
8 from sklearn.pipeline import Pipeline, FeatureUnion
9 from sklearn.tree import DecisionTreeClassifier
11 from sklearn.metrics import accuracy_score, precision_score, recall_score, auc
12 from sklearn.base import BaseEstimator, TransformerMixin
13 from sklearn.utils import class_weight
14
15 import xgboost as xgb
16
17
18 class DataFrameSelector(BaseEstimator, TransformerMixin):
19     def __init__(self, attribute_name):
20         self.attribute_name = attribute_name
21
22     def fit(self, x, y=None):
23         return self
24
25     def transform(self, x):
26         return x[self.attribute_name].values
27
28
31
32 data_x = data.drop('Survived', axis=1)
33 data_y = data['Survived']
34
35 # Data cleaning
36 cat_attribs = ['Pclass', 'Sex', 'Embarked']
37 dis_attribs = ['SibSp', 'Parch']
38 con_attribs = ['Age', 'Fare']
39
40 # encoder: OneHotEncoder(),OrdinalEncoder()
41 cat_pipeline = Pipeline([
42     ('selector', DataFrameSelector(cat_attribs)),
43     ('imputer', SimpleImputer(strategy='most_frequent')),
44     ('encoder', OneHotEncoder()),
45 ])
46
47 dis_pipeline = Pipeline([
48     ('selector', DataFrameSelector(dis_attribs)),
49     ('scaler', MinMaxScaler()),
50     ('imputer', SimpleImputer(strategy='most_frequent')),
51 ])
52
53 con_pipeline = Pipeline([
54     ('selector', DataFrameSelector(con_attribs)),
55     ('scaler', MinMaxScaler()),
56     ('imputer', SimpleImputer(strategy='mean')),
57 ])
58
59 full_pipeline = FeatureUnion(
60     transformer_list=[
61         ('con_pipeline', con_pipeline),
62         ('dis_pipeline', dis_pipeline),
63         ('cat_pipeline', cat_pipeline),
64     ]
65 )
66
67 data_x_cleaned = full_pipeline.fit_transform(data_x)
68
69 X_train, X_test, y_train, y_test = train_test_split(data_x_cleaned, data_y, stratify=data_y, test_size=0.25, random_state=1992)
70
71 cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=2)
72
73 # XGBoost
74 xgb_cla = xgb.XGBClassifier(use_label_encoder=False, verbosity=1, objective='binary:logistic', random_state=1992)
75 cls_wt = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)  # cls_wt[1]/cls_wt[0]
76
77 param_grid = [{
78         'learning_rate': [0.5],
79         'n_estimators': [5],
80         'max_depth': [5],
81         'min_child_weight': [5],
82         'gamma': [5],
83         'scale_pos_weight': [1],
84         'subsample': [0.8],
85                 }]
86
87 grid_search = GridSearchCV(xgb_cla, param_grid=param_grid, cv=cv, scoring='accuracy', n_jobs=-1, return_train_score=True)
88 grid_search.fit(X_train, y_train)
89 cv_results_grid_search = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score')
90
91 predicted_train_xgb = grid_search.predict(X_train)
92 predicted_test_xgb = grid_search.predict(X_test)
93
94 print('------------XGBoost grid_search: Results of train------------')
95 print(accuracy_score(y_train, predicted_train_xgb))
96 print(precision_score(y_train, predicted_train_xgb))
97 print(recall_score(y_train, predicted_train_xgb))
98
99 print('------------XGBoost grid search: Results of test------------')
100 print(accuracy_score(y_test, predicted_test_xgb))
101 print(precision_score(y_test, predicted_test_xgb))
102 print(recall_score(y_test, predicted_test_xgb))```

# 4.1 custom Eval in xgboost_ metric:

Like Adaboost and GBDT, we also want to perform the step-by-step strong learner of each step on the training set and verification set. Because xgboost does not directly stage_ For the predict method, you need to use the parameter "eval_metric" to specify the evaluation metric, which comes with "error". Here I also wrote an eval metric of "accuracy", which can be compared with "error". However, I found a problem here: the objective we previously set is "binary:logistic". According to the official document: binary:logistic: logistic regulation for binary classification, output probability, in fact, it only outputs probability when using the predict method, which is input to eval_ The so-called "y_predicted" in metric is actually \$f(x) \$, that is, the value without transformation. Therefore, in Eval_ The definitions in metric need to be careful.

``` 1 def metric_precision(y_predicted, y_true):
2     label = y_true.get_label()
3     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
4
5     print('max_y_predicted:%f' % max(y_predicted))
6     print('min_y_predicted:%f' % min(y_predicted))
7
8     precision = precision_score(label, predicted_binary)
9     return 'Precision', precision
10
11
12 def metric_accuracy(y_predicted, y_true):
13     label = y_true.get_label()
14     # predicted_binary = np.round(y_predicted)
15     # y_predicted = 1 / (1 + np.exp(-y_predicted))
16     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
17
18     print('max_y_predicted:%f' % max(y_predicted))
19     print('min_y_predicted:%f' % min(y_predicted))
20
21     accuracy = accuracy_score(label, predicted_binary)
22     return 'Accuracy', accuracy
23
24
25 def metric_recall(y_predicted, y_true):
26     label = y_true.get_label()
27     predicted_binary = [1 if y_cont > 0 else 0 for y_cont in y_predicted]
28
29     print('max_y_predicted:%f' % max(y_predicted))
30     print('min_y_predicted:%f' % min(y_predicted))
31
32     recall = recall_score(label, predicted_binary)
33     return 'Recall', recall
34
35
36 xgb_cla2 = grid_search.best_estimator_
37 temp2 = xgb_cla2.predict(X_train)
38 temp2_pro = xgb_cla2.predict_proba(X_train)
39 xgb_cla2.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric=['error'])
40
41 xgb_cla2.fit(X_train, y_train, eval_set=[(X_train, y_train), (X_test, y_test)], eval_metric=metric_accuracy)
42
43 plt.figure()
44 plt.plot(xgb_cla2.evals_result()['validation_1']['Accuracy'], label='test')
45 plt.plot(xgb_cla2.evals_result()['validation_0']['Accuracy'], label='train')
46
47 plt.legend()
48 plt.grid(axis='both')```

On the training set, after grid_search, n_estiamtors(sklearn api) / num_boost_round(learning api) is 5 optimal. It can be found that on the test set, the optimal solution is also obtained at 5, and subsequent tests_ Accuracy is no longer promoted, but train_accuracy is further improved and gradually over fitted.

# 4.2 result analysis:

As in the previous articles, the optimal parameters of XGBoost are used in the prediction set, and the results are uploaded to kaggle. The results are as follows (Note: the training set here is only the whole training set, including train and test in the code):

It can be seen that the effect is similar to that of AdaBoost and GBDT, but slightly worse than that of RF (random forest).

 Training set accuracy Training set precision Training set recall Prediction set accuracy (upload kaggle to get results) Naive Bayesian optimal solution 0.790 0.731 0.716 0.756 Perceptron 0.771 0.694 0.722 0.722 logistic regression 0.807 0.781 0.690 0.768 Linear SVM 0.801 0.772 0.684 0.773 rbf kernel SVM 0.834 0.817 0.731 0.785 AdaBoost 0.844 0.814 0.769 0.789 GBDT 0.843 0.877 0.687 0.778 RF 0.820 0.917 0.585 0.792 XGBoost 0.831 0.847 0.681 0.780

Posted by Omid on Fri, 03 Jun 2022 11:18:31 +0530