integrated learning
Integrated Learning:
- bagging: train multiple classifiers to take the average
- boosting: start from weak learners and train by weighting
- stacking: Aggregate multiple classification or regression models
bagging
Notice:
bagging requires sub-models, and the accuracy of sub-models cannot be lower than random guessing.
# call bagging from sklearn.ensemble import BaggingClassifier # Establish an AdaBoost classifier, each basic classification model is the decision tree model trained previously, and the maximum number of weak learners is 50 model = BaggingClassifier(base_estimator=base_model, n_estimators=50, random_state=1) model.fit(X_train, y_train)# train y_pred = model.predict(X_test)# predict print(f"BaggingClassifier The accuracy of:{accuracy_score(y_test,y_pred):.3f}")
base_estimator : object or None, optional (default=None)
The base estimator fits a random subset of the dataset. If None, the base estimator is a decision tree
random forest
Random forest is a variant of Bagging, where the base learner is designated as a decision tree, and random attribute selection is added during the training process.
from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier( n_estimators=50, random_state=1) model.fit(X_train, y_train)# train y_pred = model.predict(X_test)# predict print(f"RandomForestClassifier The accuracy of:{accuracy_score(y_test,y_pred):.3f}")
AdaBoost
The algorithm is adaptive. The adaptation lies in that the samples misclassified by the previous basic classifier will be strengthened, and the weighted total samples will be used again to train the next basic classifier. At the same time, a new weak classifier is added in each round until a predetermined small enough error rate is reached or a pre-specified maximum number of iterations is reached.
# Define the model, where the maximum number of classifiers is 50 and the learning rate is 1.5 base_model = DecisionTreeClassifier(max_depth=1, criterion='gini',random_state=1).fit(X_train, y_train) model = AdaBoostClassifier(base_estimator=base_model,n_estimators=50,learning_rate=0.8) # train model.fit(X_train, y_train) # predict y_pred = model.predict(X_test) acc = metrics.accuracy_score(y_test, y_pred) # Accuracy print(f"Accuracy:{acc:.2}")
Automatic parameter tuning with GridSearchCV
The function can automatically adjust the parameters within the specified range, and you can get the optimized results and parameters just by entering the parameters. Compared with manual parameter adjustment, it is more time-saving and labor-saving, and it is more concise and flexible than the for loop method, and it is not easy to make mistakes.
hyperparameter_space = {'n_estimators':list(range(2, 102, 2)), 'learning_rate':[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]} # Use the accuracy rate as the standard, and output the parameters with the highest accuracy rate. cv=5 represents the cross-validation parameters. Here, five-fold cross-validation is used, and n_jobs=-1 represents the number of parallels and the cpu gs = GridSearchCV(AdaBoostClassifier( algorithm='SAMME.R', random_state=1), param_grid=hyperparameter_space, scoring="accuracy", n_jobs=-1, cv=5) gs.fit(X_train, y_train) print("optimal hyperparameters:", gs.best_params_)
The common methods and properties of GridSearchCV are as follows:
grid.fit(): run a grid search (required)
grid.score(): model score after running grid search (available)
grid_scores_: gives the evaluation results under different parameters
best_params_: Describes the combination of parameters that have yielded the best results
best_score_: Provides the best score observed during the optimization process
Reference blog:
https://blog.csdn.net/MsSpark/article/details/84495949