Machine learning -- breestcancer random forest grid search

catalogue

1, GridSearchcv overview

2, Important parameters

3, Breastcancer dataset grid search

1. Import library

2. Find the optimal n through multiple cross validation_ Estimators parameter

Edit

3. Grid search tuning

4. Adjust max_features parameter

1, GridSearchcv overview

The significance of GridSearchcv is to automatically adjust parameters. By inputting multiple different parameters, the optimal acc results and parameter values are given. However, this method is suitable for small data sets. When the amount of data is large, the fast parameter adjustment method of coordinate descent is needed. In fact, it is a greedy algorithm. One parameter with the greatest influence is put in each time for optimization, and the optimal parameter is obtained. Based on the optimal parameter, the parameter with the second degree of influence is continuously adjusted, and so on. However, this method has some disadvantages, such as local optimal solution.

2, Important parameters

sklearn.model_selection._search.GridSearchCV def __init__(self,
             estimator: Any,
             param_grid: Union[dict, list],
             *,
             scoring: Any = None,
             n_jobs: Any = None,
             refit: Any = True,
             cv: Any = None,
             verbose: int = 0,
             pre_dispatch: Any = "2*n_jobs",
             error_score: Any = np.nan,
             return_train_score: Any = False) -> None

estimatorThe selected classifier, and pass in other parameters besides the parameters that need to determine the best
param_gridThe value of the parameter to be optimized is a dictionary or a list. The key value of the dictionary is the parameter name, and the value is the parameter adjustment range
n_jobsCPU parallel number, the default value is 1. When the value is - 1, it means that all CPU cores are running
cvCross validation parameter, default to None

3, Breastcancer dataset grid search

1. Import library

from sklearn.datasets import load_breast_cancer         #Import the breast cancer database
from sklearn.ensemble import RandomForestClassifier     #Import random forest Library
from sklearn.model_selection import cross_val_score     #Cross validation
from sklearn.model_selection import GridSearchCV        #Grid search
import matplotlib.pyplot as plt
import pandas as pd

2. Find the optimal n through multiple cross validation_ Estimators parameter

score_l=[]
for i in range(0,500,10):
    rfc=RandomForestClassifier(n_estimators=i+1
                               ,n_jobs=-1
                               ,random_state=90)
    score=cross_val_score(rfc,cancer.data,cancer.target,cv=10).mean()
    score_l.append(score)

print(max(score_l),score_l.index(max(score_l))*10+1)
plt.figure(figsize=[20,5])
plt.plot(range(1,501,10),score_l)
plt.show()

Optimal N during test_ The parameters of the estimators are 371, and the acc value is about 0.96488. Next, the next step of tuning is performed according to the tuning order based on the modified optimal parameters.

3. Grid search tuning

cancer=load_breast_cancer()

param_grid={'max_depth':np.arange(1,20,1)}
rfc=RandomForestClassifier(n_estimators=371
                           ,random_state=90
                           ,n_jobs=-1)
GS=GridSearchCV(rfc,param_grid,cv=10)
GS.fit(cancer.data,cancer.target)
print(GS.best_params_)            #Output the optimal value of the parameter
print(GS.best_score_)             #Output the acc value when the parameter is optimal

Adjust parameters one by one to optimize max_depth,min_samples_split,min_sample_leaf, it is found that the acc value has not changed. Obviously, it is adjusted according to this method, and the acc value will not change again. Therefore, it can only be judged on the left side of the optimal model complexity by adjusting max_features value (the only parameter that can be adjusted to the left or right to make the acc value higher).

4. Adjust max_features parameter

param_grid={'max_features':np.arange(1,20,1)}
rfc=RandomForestClassifier(n_estimators=371
                           ,max_depth=7
                           ,random_state=90
                           ,n_jobs=-1)
GS=GridSearchCV(rfc,param_grid,cv=10)
GS.fit(cancer.data,cancer.target)
print(GS.best_params_)
print(GS.best_score_)

Max obtained after test_ Features = 7, ACC value is 0.96667, and better parameters are obtained.

Tags: Algorithm Python AI Machine Learning

Posted by elangsru on Mon, 29 Aug 2022 02:38:16 +0530