Integrated study notes

integrated learning

Integrated Learning:

  1. bagging: train multiple classifiers to take the average
  2. boosting: start from weak learners and train by weighting
  3. stacking: Aggregate multiple classification or regression models

bagging

Notice:
bagging requires sub-models, and the accuracy of sub-models cannot be lower than random guessing.

# call bagging
from sklearn.ensemble import BaggingClassifier
# Establish an AdaBoost classifier, each basic classification model is the decision tree model trained previously, and the maximum number of weak learners is 50
model = BaggingClassifier(base_estimator=base_model,
                            n_estimators=50,
                            random_state=1)
model.fit(X_train, y_train)# train
y_pred = model.predict(X_test)# predict
print(f"BaggingClassifier The accuracy of:{accuracy_score(y_test,y_pred):.3f}")

base_estimator : object or None, optional (default=None)
The base estimator fits a random subset of the dataset. If None, the base estimator is a decision tree

random forest

Random forest is a variant of Bagging, where the base learner is designated as a decision tree, and random attribute selection is added during the training process.

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(
                            n_estimators=50,
                            random_state=1)
model.fit(X_train, y_train)# train
y_pred = model.predict(X_test)# predict
print(f"RandomForestClassifier The accuracy of:{accuracy_score(y_test,y_pred):.3f}")

AdaBoost

The algorithm is adaptive. The adaptation lies in that the samples misclassified by the previous basic classifier will be strengthened, and the weighted total samples will be used again to train the next basic classifier. At the same time, a new weak classifier is added in each round until a predetermined small enough error rate is reached or a pre-specified maximum number of iterations is reached.

# Define the model, where the maximum number of classifiers is 50 and the learning rate is 1.5
base_model = DecisionTreeClassifier(max_depth=1, criterion='gini',random_state=1).fit(X_train, y_train)
model = AdaBoostClassifier(base_estimator=base_model,n_estimators=50,learning_rate=0.8)
# train
model.fit(X_train, y_train) 
# predict
y_pred = model.predict(X_test) 
acc = metrics.accuracy_score(y_test, y_pred) # Accuracy
print(f"Accuracy:{acc:.2}")

Automatic parameter tuning with GridSearchCV

The function can automatically adjust the parameters within the specified range, and you can get the optimized results and parameters just by entering the parameters. Compared with manual parameter adjustment, it is more time-saving and labor-saving, and it is more concise and flexible than the for loop method, and it is not easy to make mistakes.

hyperparameter_space = {'n_estimators':list(range(2, 102, 2)), 
                        'learning_rate':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]}


# Use the accuracy rate as the standard, and output the parameters with the highest accuracy rate. cv=5 represents the cross-validation parameters. Here, five-fold cross-validation is used, and n_jobs=-1 represents the number of parallels and the cpu
gs = GridSearchCV(AdaBoostClassifier(
                                     algorithm='SAMME.R',
                                     random_state=1),
                  param_grid=hyperparameter_space, 
                  scoring="accuracy", n_jobs=-1, cv=5)

gs.fit(X_train, y_train)
print("optimal hyperparameters:", gs.best_params_)

The common methods and properties of GridSearchCV are as follows:
grid.fit(): run a grid search (required)
grid.score(): model score after running grid search (available)
grid_scores_: gives the evaluation results under different parameters
best_params_: Describes the combination of parameters that have yielded the best results
best_score_: Provides the best score observed during the optimization process

Reference blog:
https://blog.csdn.net/MsSpark/article/details/84495949

Tags: Machine Learning

Posted by Swole on Tue, 18 Oct 2022 11:51:26 +0530