1, Data cleaning
Write at the front
Preg number of times pregnants
Plas ma glucose concentration a 2 hours in an oral glucose tolerance test
Pres sure Diastolic blood pressure (mm Hg)
Skin Triceps skin fold thickness (mm)
Insu 2-Hour serum insulin (mu U/ml)
Mass Body mass index (weight in kg/(height in m)^2) body mass index (weight kg/ (height m) ^2)
Pedi Diabetes pedigree function
Age (years)
Class variable (0 or 1) class variable (0 or 1)
1. Data import
We still use the following functions to import data
diabetes_data = pd.read_csv(r"C:\Users\86137\PycharmProjects\pythonProject\venv\Diabetes test\diabetes.csv")
2. Data view
First step
diabetes_data.head()#View the first five rows of data
Step 2
diabetes_data#View all data
According to the icon, there are 768 rows of data.
Step 3
diabetes_data.info()#View missing values of data
We set 0 as the missing value
#Set all 0 to null diabetes_data_copy = diabetes_data.copy(deep = True) diabetes_data_copy[['plas','pres','skin','insu','mass']] = diabetes_data_copy[['plas','pres','skin','insu','mass']].replace(0,np.NaN) print(diabetes_data_copy.isnull().sum())
The following results are obtained
It can be seen from the figure that plas and mass lack less data, pres is relatively small, and skin and insu lack too much data.
Step 4
diabetes_data.describe()#View the data table
3. Outlier detection
Use the same code as Titanic
#Outlier detection function # Outlier detection def detect_outliers(df, n, features): """ Takes a dataframe df of features and returns a list of the indices corresponding to the observations containing more than n outliers according to the Tukey method. """ outlier_indices = [] # iterate over features(columns) for col in features: # 1st quartile (25%) Q1 = np.percentile(df[col], 25) # 3rd quartile (75%) Q3 = np.percentile(df[col], 75) # Interquartile range (IQR) IQR = Q3 - Q1 # outlier step outlier_step = 1.5 * IQR # Determine a list of indices of outliers for feature col outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index # append the found outlier indices for col to the list of outlier indices outlier_indices.extend(outlier_list_col) # select observations containing more than 2 outliers outlier_indices = Counter(outlier_indices) multiple_outliers = list(k for k, v in outlier_indices.items() if v > n) return multiple_outliers
Outliers_to_drop = detect_outliers(diabetes_data, 4, ["preg", "plas", "pres", "skin","insu","mass","pedi","age"]) print(diabetes_data.loc[Outliers_to_drop]) # Show the outliers rows #If there are no more than four missing values, the data will not be deleted
For missing value processing, you can view 3000 word explanation four commonly used missing value processing methods_ One line playing python blog -CSDN blog
4. Conversion to non numeric type
diabetes_data['OutCome'] = diabetes_data['class'].map({"b'tested_positive'":1,"b'tested_negative'":0}) diabetes_data.drop(labels = ["class"], axis = 1, inplace = True) #Let's set the new OutCome column, and put b'tested_ Replace positive with 1, and b'tested_ Replace negative with 0 and delete the original column.
Then we look at the data types
print("Show sample type data") diabetes_data.dtypes
Result: all data are changed to numerical type
Let's review the first five lines of data
diabetes_data.head()
5. Analyze the characteristic column
(1) View incidence matrix
print("incidence matrix ") # Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived g = sns.heatmap(diabetes_data[["preg", "plas", "pres", "skin","insu","mass","pedi","age","OutCome"]].corr(),annot=True, fmt = ".2f", cmap = "coolwarm") plt.show()
Analysis: it can be seen from the chart that plas,mass,age and preg are closely related to OutCome. The correlation coefficient with more missing values is smaller.
(2) Analyze the relationship between plas and OutCome
According to the statistics of missing values above, there are only 5 missing values in plas, so it is considered to use the median instead.
diabetes_data['plas'].fillna(value = diabetes_data['plas'].median())
If a bar chart is used, the image is as follows:
#Bar graph print("plas and OutCome Relationship between") # Explore plas feature vs Outcome g = sns.catplot(x="plas",y="OutCome",data=diabetes_data,kind="bar", height = 6 , palette = "muted") g.despine(left=True) g = g.set_ylabels("OutCome") plt.show()
So we choose to use histogram
g = sns.FacetGrid(diabetes_data, col='OutCome') g = g.map(sns.histplot, "plas") plt.show()
Analysis: it can be seen from the chart that plas with positive detection of diabetes is around 20-40, and plas with negative detection of diabetes is in the range of 0-60, indicating that plas has a great impact on the results.
(3) Analyze the relationship between mass and OutCome
According to the statistics of missing values above, there are only 11 missing values in mass, so it is considered to use the median instead.
print("mass and OutCome Relationship between") diabetes_data['mass'].fillna(value = diabetes_data['mass'].median()) g = sns.FacetGrid(diabetes_data, col='OutCome') g = g.map(sns.histplot, "mass") plt.show()
Analysis: the value of mass with positive detection of diabetes is lower than that with negative detection, and does not exceed 60.
(4) Analyze the relationship between age and OutCome
print("age and OutCome Relationship between") # Explore age vs OutCome g = sns.FacetGrid(diabetes_data, col='OutCome') g = g.map(sns.histplot, "age") plt.show()
Analysis: the age distribution of diabetes test positive is evenly distributed in 20-60 years old, and 60-80 years old diabetes test is negative.
(5) Analyze the relationship between preg and OutCome
print("preg and OutCome Relationship between") # Explore preg feature vs Outcome g = sns.catplot(x="preg",y="OutCome",data=diabetes_data,kind="bar", height = 6 , palette = "muted") g.despine(left=True) g = g.set_ylabels("OutCome") plt.show()
Analysis: the higher the preg, the higher the possibility of positive diabetes detection.
(6) Analyze the distribution of age
print("age Distribution") # Explore Age distibution g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 0) & (diabetes_data["age"].notnull())], color="Red", shade = True) g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 1) & (diabetes_data["age"].notnull())], ax =g, color="Blue", shade= True) g.set_xlabel("age") g.set_ylabel("Frequency") g = g.legend(["0","1"])
Analysis: 30-60 years old diabetes is more likely to be tested positive.
(6) Analyze the relationship between age and plas, mass and preg
print("age and plas,mass,age and preg Relationship between") # Explore age vs plas,mass and preg g = sns.catplot(y="age",x="plas",hue="OutCome", data=diabetes_data,kind="box") g = sns.catplot(y="age",x="mass", hue="OutCome",data=diabetes_data,kind="box") g = sns.catplot(y="age",x="preg",hue="OutCome",data=diabetes_data,kind="box") plt.show()
2, Model training
Cross validate models
# Cross validate model with Kfold stratified cross val kfold = StratifiedKFold(n_splits=10) # Modeling step Test differents algorithms random_state = 2 classifiers = [] classifiers.append(SVC(random_state=random_state)) classifiers.append(DecisionTreeClassifier(random_state=random_state)) classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1)) classifiers.append(RandomForestClassifier(random_state=random_state)) classifiers.append(ExtraTreesClassifier(random_state=random_state)) classifiers.append(GradientBoostingClassifier(random_state=random_state)) classifiers.append(MLPClassifier(random_state=random_state)) classifiers.append(KNeighborsClassifier()) classifiers.append(LogisticRegression(random_state = random_state)) classifiers.append(LinearDiscriminantAnalysis()) diabetes_data["OutCome"] = diabetes_data["OutCome"].astype(int) Y_diabetes_data = diabetes_data["OutCome"] X_diabetes_data = diabetes_data.drop(labels = ["OutCome"],axis = 1) cv_results = [] for classifier in classifiers : cv_results.append(cross_val_score(classifier, X_diabetes_data, y = Y_diabetes_data, scoring = "accuracy", cv = kfold, n_jobs=4)) cv_means = [] cv_std = [] for cv_result in cv_results: cv_means.append(cv_result.mean()) cv_std.append(cv_result.std()) cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost", "RandomForest","ExtraTrees","GradientBoosting","MultipleLayerPerceptron","KNeighboors","LogisticRegression","LinearDiscriminantAnalysis"]}) g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std}) g.set_xlabel("Mean Accuracy") g = g.set_title("Cross validation scores")
Hyperparameter tunning for best models
### META MODELING WITH ADABOOST, RF, EXTRATREES and GRADIENTBOOSTING # Adaboost DTC = DecisionTreeClassifier() adaDTC = AdaBoostClassifier(DTC, random_state=7) ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"], "base_estimator__splitter" : ["best", "random"], "algorithm" : ["SAMME","SAMME.R"], "n_estimators" :[1,2], "learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]} gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1) gsadaDTC.fit(X_diabetes_data,Y_diabetes_data) ada_best = gsadaDTC.best_estimator_ gsadaDTC.best_score_
#ExtraTrees ExtC = ExtraTreesClassifier() ## Search grid for optimal parameters ex_param_grid = {"max_depth": [None], "max_features": [1, 3, 10], "min_samples_split": [2, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [False], "n_estimators" :[100,300], "criterion": ["gini"]} gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1) gsExtC.fit(X_diabetes_data,Y_diabetes_data) ExtC_best = gsExtC.best_estimator_ # Best score gsExtC.best_score_
# RFC Parameters tunning RFC = RandomForestClassifier() ## Search grid for optimal parameters rf_param_grid = {"max_depth": [None], "max_features": [1, 3, 10], "min_samples_split": [2, 3, 10], "min_samples_leaf": [1, 3, 10], "bootstrap": [False], "n_estimators" :[100,300], "criterion": ["gini"]} gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1) gsRFC.fit(X_diabetes_data,Y_diabetes_data) RFC_best = gsRFC.best_estimator_ # Best score gsRFC.best_score_
# Gradient boosting tunning GBC = GradientBoostingClassifier() gb_param_grid = {'loss' : ["deviance"], 'n_estimators' : [100,200,300], 'learning_rate': [0.1, 0.05, 0.01], 'max_depth': [4, 8], 'min_samples_leaf': [100,150], 'max_features': [0.3, 0.1] } gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1) gsGBC.fit(X_diabetes_data,Y_diabetes_data) GBC_best = gsGBC.best_estimator_ # Best score gsGBC.best_score_
### SVC classifier SVMC = SVC(probability=True) svc_param_grid = {'kernel': ['rbf'], 'gamma': [ 0.001, 0.01, 0.1, 1], 'C': [1, 10, 50, 100,200,300, 1000]} gsSVMC = GridSearchCV(SVMC,param_grid = svc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1) gsSVMC.fit(X_diabetes_data,Y_diabetes_data) SVMC_best = gsSVMC.best_estimator_ # Best score gsSVMC.best_score_
Plot learning curves
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)): """Generate a simple plot of the test and training learning curve""" plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") return plt g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_diabetes_data,Y_diabetes_data,cv=kfold) g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold) g = plot_learning_curve(gsSVMC.best_estimator_,"SVC learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold) g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold) g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
Feature importance of tree based classifiers
nrows = ncols = 2 fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15)) names_classifiers = [("AdaBoosting", ada_best),("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best)] nclassifier = 0 for row in range(nrows): for col in range(ncols): name = names_classifiers[nclassifier][0] classifier = names_classifiers[nclassifier][1] indices = np.argsort(classifier.feature_importances_)[::-1][:40] g = sns.barplot(y=X_diabetes_data.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col]) g.set_xlabel("Relative importance",fontsize=12) g.set_ylabel("Features",fontsize=12) g.tick_params(labelsize=9) g.set_title(name + " feature importance") nclassifier += 1
I don't know how to apply the later model.