catalogue
3, Breastcancer dataset grid search
2. Find the optimal n through multiple cross validation_ Estimators parameter
4. Adjust max_features parameter
1, GridSearchcv overview
The significance of GridSearchcv is to automatically adjust parameters. By inputting multiple different parameters, the optimal acc results and parameter values are given. However, this method is suitable for small data sets. When the amount of data is large, the fast parameter adjustment method of coordinate descent is needed. In fact, it is a greedy algorithm. One parameter with the greatest influence is put in each time for optimization, and the optimal parameter is obtained. Based on the optimal parameter, the parameter with the second degree of influence is continuously adjusted, and so on. However, this method has some disadvantages, such as local optimal solution.
2, Important parameters
sklearn.model_selection._search.GridSearchCV def __init__(self,
estimator: Any,
param_grid: Union[dict, list],
*,
scoring: Any = None,
n_jobs: Any = None,
refit: Any = True,
cv: Any = None,
verbose: int = 0,
pre_dispatch: Any = "2*n_jobs",
error_score: Any = np.nan,
return_train_score: Any = False) -> None
estimator | The selected classifier, and pass in other parameters besides the parameters that need to determine the best |
param_grid | The value of the parameter to be optimized is a dictionary or a list. The key value of the dictionary is the parameter name, and the value is the parameter adjustment range |
n_jobs | CPU parallel number, the default value is 1. When the value is - 1, it means that all CPU cores are running |
cv | Cross validation parameter, default to None |
3, Breastcancer dataset grid search
1. Import library
from sklearn.datasets import load_breast_cancer #Import the breast cancer database from sklearn.ensemble import RandomForestClassifier #Import random forest Library from sklearn.model_selection import cross_val_score #Cross validation from sklearn.model_selection import GridSearchCV #Grid search import matplotlib.pyplot as plt import pandas as pd
2. Find the optimal n through multiple cross validation_ Estimators parameter
score_l=[] for i in range(0,500,10): rfc=RandomForestClassifier(n_estimators=i+1 ,n_jobs=-1 ,random_state=90) score=cross_val_score(rfc,cancer.data,cancer.target,cv=10).mean() score_l.append(score) print(max(score_l),score_l.index(max(score_l))*10+1) plt.figure(figsize=[20,5]) plt.plot(range(1,501,10),score_l) plt.show()
Optimal N during test_ The parameters of the estimators are 371, and the acc value is about 0.96488. Next, the next step of tuning is performed according to the tuning order based on the modified optimal parameters.
3. Grid search tuning
cancer=load_breast_cancer() param_grid={'max_depth':np.arange(1,20,1)} rfc=RandomForestClassifier(n_estimators=371 ,random_state=90 ,n_jobs=-1) GS=GridSearchCV(rfc,param_grid,cv=10) GS.fit(cancer.data,cancer.target) print(GS.best_params_) #Output the optimal value of the parameter print(GS.best_score_) #Output the acc value when the parameter is optimal
Adjust parameters one by one to optimize max_depth,min_samples_split,min_sample_leaf, it is found that the acc value has not changed. Obviously, it is adjusted according to this method, and the acc value will not change again. Therefore, it can only be judged on the left side of the optimal model complexity by adjusting max_features value (the only parameter that can be adjusted to the left or right to make the acc value higher).
4. Adjust max_features parameter
param_grid={'max_features':np.arange(1,20,1)} rfc=RandomForestClassifier(n_estimators=371 ,max_depth=7 ,random_state=90 ,n_jobs=-1) GS=GridSearchCV(rfc,param_grid,cv=10) GS.fit(cancer.data,cancer.target) print(GS.best_params_) print(GS.best_score_)
Max obtained after test_ Features = 7, ACC value is 0.96667, and better parameters are obtained.