kaggle Chapter 3, novice diabetes detection

1, Data cleaning

Write at the front

Preg number of times pregnants
Plas ma glucose concentration a 2 hours in an oral glucose tolerance test
Pres sure Diastolic blood pressure (mm Hg)
Skin Triceps skin fold thickness (mm)
Insu 2-Hour serum insulin (mu U/ml)
Mass Body mass index (weight in kg/(height in m)^2) body mass index (weight kg/ (height m) ^2)
Pedi Diabetes pedigree function
Age (years)
Class variable (0 or 1) class variable (0 or 1)

1. Data import

We still use the following functions to import data

diabetes_data = pd.read_csv(r"C:\Users\86137\PycharmProjects\pythonProject\venv\Diabetes test\diabetes.csv")


2. Data view

First step

diabetes_data.head()#View the first five rows of data


Step 2

diabetes_data#View all data

According to the icon, there are 768 rows of data.  

Step 3

diabetes_data.info()#View missing values of data

We set 0 as the missing value

#Set all 0 to null
diabetes_data_copy = diabetes_data.copy(deep = True)
diabetes_data_copy[['plas','pres','skin','insu','mass']] = diabetes_data_copy[['plas','pres','skin','insu','mass']].replace(0,np.NaN)


The following results are obtained


It can be seen from the figure that plas and mass lack less data, pres is relatively small, and skin and insu lack too much data.

Step 4

diabetes_data.describe()#View the data table


3. Outlier detection

Use the same code as Titanic

#Outlier detection function
# Outlier detection

def detect_outliers(df, n, features):
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    outlier_indices = []

    # iterate over features(columns)
    for col in features:
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col], 75)
        # Interquartile range (IQR)
        IQR = Q3 - Q1

        # outlier step
        outlier_step = 1.5 * IQR

        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step)].index

        # append the found outlier indices for col to the list of outlier indices

        # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(k for k, v in outlier_indices.items() if v > n)

    return multiple_outliers
Outliers_to_drop = detect_outliers(diabetes_data, 4, ["preg", "plas", "pres", "skin","insu","mass","pedi","age"])
print(diabetes_data.loc[Outliers_to_drop]) # Show the outliers rows
#If there are no more than four missing values, the data will not be deleted

For missing value processing, you can view 3000 word explanation four commonly used missing value processing methods_ One line playing python blog -CSDN blog


4. Conversion to non numeric type

diabetes_data['OutCome'] = diabetes_data['class'].map({"b'tested_positive'":1,"b'tested_negative'":0})
diabetes_data.drop(labels = ["class"], axis = 1, inplace = True)
#Let's set the new OutCome column, and put b'tested_ Replace positive with 1, and b'tested_ Replace negative with 0 and delete the original column.

Then we look at the data types

print("Show sample type data")

Result: all data are changed to numerical type

Let's review the first five lines of data


5. Analyze the characteristic column

(1) View incidence matrix

print("incidence matrix ")

# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived
g = sns.heatmap(diabetes_data[["preg", "plas", "pres", "skin","insu","mass","pedi","age","OutCome"]].corr(),annot=True, fmt = ".2f", cmap = "coolwarm")

Analysis: it can be seen from the chart that plas,mass,age and preg are closely related to OutCome. The correlation coefficient with more missing values is smaller.

(2) Analyze the relationship between plas and OutCome

According to the statistics of missing values above, there are only 5 missing values in plas, so it is considered to use the median instead.

diabetes_data['plas'].fillna(value = diabetes_data['plas'].median())

If a bar chart is used, the image is as follows:

#Bar graph
print("plas and OutCome Relationship between")

# Explore plas feature vs Outcome
g = sns.catplot(x="plas",y="OutCome",data=diabetes_data,kind="bar", height = 6 ,
palette = "muted")
g = g.set_ylabels("OutCome")

So we choose to use histogram

g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "plas")


Analysis: it can be seen from the chart that plas with positive detection of diabetes is around 20-40, and plas with negative detection of diabetes is in the range of 0-60, indicating that plas has a great impact on the results.

(3) Analyze the relationship between mass and OutCome

According to the statistics of missing values above, there are only 11 missing values in mass, so it is considered to use the median instead.

print("mass and OutCome Relationship between")

diabetes_data['mass'].fillna(value = diabetes_data['mass'].median())
g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "mass")


Analysis: the value of mass with positive detection of diabetes is lower than that with negative detection, and does not exceed 60.


(4) Analyze the relationship between age and OutCome

print("age and OutCome Relationship between")

# Explore age vs OutCome
g = sns.FacetGrid(diabetes_data, col='OutCome')
g = g.map(sns.histplot, "age")

Analysis: the age distribution of diabetes test positive is evenly distributed in 20-60 years old, and 60-80 years old diabetes test is negative.

(5) Analyze the relationship between preg and OutCome

print("preg and OutCome Relationship between")

# Explore preg feature vs Outcome
g = sns.catplot(x="preg",y="OutCome",data=diabetes_data,kind="bar", height = 6 ,
palette = "muted")
g = g.set_ylabels("OutCome")

Analysis: the higher the preg, the higher the possibility of positive diabetes detection.

(6) Analyze the distribution of age

print("age Distribution")

# Explore Age distibution
g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 0) & (diabetes_data["age"].notnull())], color="Red", shade = True)
g = sns.kdeplot(diabetes_data["age"][(diabetes_data["OutCome"] == 1) & (diabetes_data["age"].notnull())], ax =g, color="Blue", shade= True)
g = g.legend(["0","1"])

Analysis: 30-60 years old diabetes is more likely to be tested positive.

(6) Analyze the relationship between age and plas, mass and preg

print("age and plas,mass,age and preg Relationship between")

# Explore age vs plas,mass and preg

g = sns.catplot(y="age",x="plas",hue="OutCome", data=diabetes_data,kind="box")
g = sns.catplot(y="age",x="mass", hue="OutCome",data=diabetes_data,kind="box")
g = sns.catplot(y="age",x="preg",hue="OutCome",data=diabetes_data,kind="box")



2, Model training

Cross validate models

# Cross validate model with Kfold stratified cross val
kfold = StratifiedKFold(n_splits=10)

# Modeling step Test differents algorithms 
random_state = 2
classifiers = []
classifiers.append(LogisticRegression(random_state = random_state))

diabetes_data["OutCome"] = diabetes_data["OutCome"].astype(int)

Y_diabetes_data = diabetes_data["OutCome"]

X_diabetes_data = diabetes_data.drop(labels = ["OutCome"],axis = 1)

cv_results = []
for classifier in classifiers :
    cv_results.append(cross_val_score(classifier, X_diabetes_data, y = Y_diabetes_data, scoring = "accuracy", cv = kfold, n_jobs=4))

cv_means = []
cv_std = []
for cv_result in cv_results:

cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std,"Algorithm":["SVC","DecisionTree","AdaBoost",

g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{'xerr':cv_std})
g.set_xlabel("Mean Accuracy")
g = g.set_title("Cross validation scores")

Hyperparameter tunning for best models 


# Adaboost
DTC = DecisionTreeClassifier()

adaDTC = AdaBoostClassifier(DTC, random_state=7)

ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"],
              "base_estimator__splitter" :   ["best", "random"],
              "algorithm" : ["SAMME","SAMME.R"],
              "n_estimators" :[1,2],
              "learning_rate":  [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}

gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)


ada_best = gsadaDTC.best_estimator_




ExtC = ExtraTreesClassifier()

## Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}

gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)


ExtC_best = gsExtC.best_estimator_

# Best score



# RFC Parameters tunning 
RFC = RandomForestClassifier()

## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}

gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)


RFC_best = gsRFC.best_estimator_

# Best score


# Gradient boosting tunning

GBC = GradientBoostingClassifier()
gb_param_grid = {'loss' : ["deviance"],
              'n_estimators' : [100,200,300],
              'learning_rate': [0.1, 0.05, 0.01],
              'max_depth': [4, 8],
              'min_samples_leaf': [100,150],
              'max_features': [0.3, 0.1] 

gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)


GBC_best = gsGBC.best_estimator_

# Best score

### SVC classifier
SVMC = SVC(probability=True)
svc_param_grid = {'kernel': ['rbf'], 
                  'gamma': [ 0.001, 0.01, 0.1, 1],
                  'C': [1, 10, 50, 100,200,300, 1000]}

gsSVMC = GridSearchCV(SVMC,param_grid = svc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)


SVMC_best = gsSVMC.best_estimator_

# Best score


Plot learning curves 

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """Generate a simple plot of the test and training learning curve"""
    if ylim is not None:
    plt.xlabel("Training examples")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    return plt

g = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsSVMC.best_estimator_,"SVC learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)
g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X_diabetes_data,Y_diabetes_data,cv=kfold)



Feature importance of tree based classifiers 

nrows = ncols = 2
fig, axes = plt.subplots(nrows = nrows, ncols = ncols, sharex="all", figsize=(15,15))

names_classifiers = [("AdaBoosting", ada_best),("ExtraTrees",ExtC_best),("RandomForest",RFC_best),("GradientBoosting",GBC_best)]

nclassifier = 0
for row in range(nrows):
    for col in range(ncols):
        name = names_classifiers[nclassifier][0]
        classifier = names_classifiers[nclassifier][1]
        indices = np.argsort(classifier.feature_importances_)[::-1][:40]
        g = sns.barplot(y=X_diabetes_data.columns[indices][:40],x = classifier.feature_importances_[indices][:40] , orient='h',ax=axes[row][col])
        g.set_xlabel("Relative importance",fontsize=12)
        g.set_title(name + " feature importance")
        nclassifier += 1


I don't know how to apply the later model.  

Tags: Python Data Mining programming language

Posted by blanius on Sat, 23 Jul 2022 23:54:38 +0530