This article is a record of the Tianchi Teaching Competition and the forecast of bank customer subscription products. The website of the teaching competition is as follows:
1. Read data
import pandas as pd # Download Data train = pd.read_csv('train.csv') test = pd.read_csv('test.csv')
2. Data processing
2.1 Merge data
# The training set and the test set are merged to facilitate the processing of characteristic data df = pd.concat([train, test], axis=0) #Concatenate the training data and test data in the row direction df
The results obtained:
id age job marital education default housing loan contact month ... campaign pdays previous poutcome emp_var_rate cons_price_index cons_conf_index lending_rate3m nr_employed subscribe 0 1 51 admin. divorced professional.course no yes yes cellular aug ... 1 112 2 failure 1.4 90.81 -35.53 0.69 5219.74 no 1 2 50 services married high.school unknown yes no cellular may ... 1 412 2 nonexistent -1.8 96.33 -40.58 4.05 4974.79 yes 2 3 48 blue-collar divorced basic.9y no no no cellular apr ... 0 1027 1 failure -1.8 96.33 -44.74 1.50 5022.61 no 3 4 26 entrepreneur single high.school yes yes yes cellular aug ... 26 998 0 nonexistent 1.4 97.08 -35.55 5.11 5222.87 yes 4 5 45 admin. single university.degree no no no cellular nov ... 1 240 4 success -3.4 89.82 -33.83 1.17 4884.70 no ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 7495 29996 49 admin. unknown university.degree unknown yes yes telephone apr ... 50 302 1 failure -1.8 95.77 -40.50 3.86 5058.64 NaN 7496 29997 34 blue-collar married basic.4y no no no cellular jul ... 8 440 3 failure 1.4 90.59 -47.29 1.77 5156.70 NaN 7497 29998 50 retired single basic.4y no yes no cellular jun ... 3 997 0 nonexistent -2.9 97.42 -39.69 1.29 5116.80 NaN 7498 29999 31 technician married professional.course no no no cellular aug ... 3 1028 0 nonexistent 1.4 96.90 -37.68 5.18 5144.45 NaN 7499 30000 46 admin. divorced university.degree no yes no cellular aug ... 2 387 3 success 1.4 97.49 -31.54 3.79 5082.25 NaN 30000 rows × 22 columns
It can be seen that the data has both numbers and text, and the text needs to be converted into numbers
2.2 Convert non-numeric features to numbers
# First select all features that are object (non-digital) features cat_columns = df.select_dtypes(include='object').columns #Select non-numeric columns and process them df[cat_columns]
# Encode non-numeric features from sklearn.preprocessing import LabelEncoder job_le = LabelEncoder() df['job'] = job_le.fit_transform(df['job']) df['marital'] = df['marital'].map({'unknown':0, 'single':1, 'married':2, 'divorced':3}) df['education'] = df['education'].map({'unknown':0, 'basic.4y':1, 'basic.6y':2, 'basic.9y':3, 'high.school':4, 'university.degree':5, 'professional.course':6, 'illiterate':7}) df['housing'] = df['housing'].map({'unknown': 0, 'no': 1, 'yes': 2}) df['loan'] = df['loan'].map({'unknown': 0, 'no': 1, 'yes': 2}) df['contact'] = df['contact'].map({'cellular': 0, 'telephone': 1}) df['day_of_week'] = df['day_of_week'].map({'mon': 0, 'tue': 1, 'wed': 2, 'thu': 3, 'fri': 4}) df['poutcome'] = df['poutcome'].map({'nonexistent': 0, 'failure': 1, 'success': 2}) df['default'] = df['default'].map({'unknown': 0, 'no': 1, 'yes': 2}) df['month'] = df['month'].map({'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, \ 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}) df['subscribe'] = df['subscribe'].map({'no': 0, 'yes': 1})
2.3 Split data
# Re-divide the data set into training set and test set to judge whether the subscribe is empty train = df[df['subscribe'].notnull()] test = df[df['subscribe'].isnull()] # Looking at the proportion of labels 0 and 1 in the training set, it can be seen that 0 and 1 are unbalanced, and 0 is 6.6 times that of 1. train['subscribe'].value_counts()
get
0.0 19548 1.0 2952 Name: subscribe, dtype: int64
2.4 Analyze data
import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings warnings.filterwarnings('ignore') %matplotlib inline num_features = [x for x in train.columns if x not in cat_columns and x!='id'] fig = plt.figure(figsize=(80,60)) for i in range(len(num_features)): plt.subplot(7,2,i+1) sns.boxplot(train[num_features[i]]) plt.ylabel(num_features[i], fontsize=36) plt.show()
There are outliers, and the outliers are processed
2.5 Dealing with outliers
for colum in num_features: temp = train[colum] q1 = temp.quantile(0.25) q2 = temp.quantile(0.75) delta = (q2-q1) * 10 train[colum] = np.clip(temp, q1-delta, q2+delta) ## Values that exceed 10 times are processed
2.6 Other processing
Data equalization and feature selection are carried out, but after the processing, the classification effect is deteriorated, which is omitted here. But post the original code for reference.
'''# SMOTE is used for oversampling. Although the training effect is good, the final classification effect is reduced instead. Oversampling is not used here. from imblearn.over_sampling import SMOTE from imblearn.over_sampling import ADASYN #smo = SMOTE(random_state=0, k_neighbors=10) adasyn = ADASYN() X_smo, y_smo = adasyn.fit_resample(train.iloc[:,:-1], train.iloc[:,-1]) train_smo = pd.concat([X_smo, y_smo], axis=1) train_smo['subscribe'].value_counts()'''
'''# The feature selection method uses SelectFromModel, Model selection tree model from sklearn.ensemble import ExtraTreesClassifier from sklearn.feature_selection import SelectFromModel # Extract training data and labels train_X = train.iloc[:,:-1] train_y = train.iloc[:,-1] # clf_ect is the model name, and FeaSel is the feature selection model clf_etc = ExtraTreesClassifier(n_estimators=50) clf_etc = clf_etc.fit(train_X, train_y) FeaSel = SelectFromModel(clf_etc, prefit=True) train_sel = FeaSel.transform(train_X) test_sel = FeaSel.transform(test.iloc[:,:-1]) # Extract the feature name and write the feature name back to the original data train_new = pd.DataFrame(train_sel) feature_idx = FeaSel.get_support() #Extract selected column names train_new.columns = train_X.columns[feature_idx] #Write the column names back to the selected data train_new = pd.concat([train_new, train_y],axis=1) test_new = pd.DataFrame(test_sel) test_new.columns = train_X.columns[feature_idx]'''
There may be issues with variable naming in this department content. the
2.7 Data storage
train_new = train test_new = test # Write the processed data back to train_new and test_new for saving train_new.to_csv('train_new.csv', index=False) test_new.to_csv('test_new.csv', index=False)
3. Model training
3.1 Import package and data
from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import AdaBoostClassifier from xgboost import XGBRFClassifier from lightgbm import LGBMClassifier from sklearn.model_selection import cross_val_score import time clf_lr = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial') clf_dt = DecisionTreeClassifier() clf_rf = RandomForestClassifier() clf_gb = GradientBoostingClassifier() clf_adab = AdaBoostClassifier() clf_xgbrf = XGBRFClassifier() clf_lgb = LGBMClassifier() from sklearn.model_selection import train_test_split train_new = pd.read_csv('train_new.csv') test_new = pd.read_csv('test_new.csv') feature_columns = [col for col in train_new.columns if col not in ['subscribe']] train_data = train_new[feature_columns] target_data = train_new['subscribe']
3.2 Model tuning
from lightgbm import LGBMClassifier from sklearn.metrics import classification_report from sklearn.model_selection import GridSearchCV from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(train_data, target_data, test_size=0.2,shuffle=True, random_state=2023) #X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5,shuffle=True,random_state=2023) n_estimators = [300] learning_rate = [0.02]#Medium 0.2 is the best subsample = [0.6] colsample_bytree = [0.7] ##0.6 is optimal in [0.5, 0.6, 0.7] max_depth = [9, 11, 13] ##11 is optimal among [7, 9, 11, 13] is_unbalance = [False] early_stopping_rounds = [300] num_boost_round = [5000] metric = ['binary_logloss'] feature_fraction = [0.6, 0.75, 0.9] bagging_fraction = [0.6, 0.75, 0.9] bagging_freq = [2, 4, 5, 8] lambda_l1 = [0, 0.1, 0.4, 0.5] lambda_l2 = [0, 10, 15, 35] cat_smooth = [1, 10, 15, 20] param = {'n_estimators':n_estimators, 'learning_rate':learning_rate, 'subsample':subsample, 'colsample_bytree':colsample_bytree, 'max_depth':max_depth, 'is_unbalance':is_unbalance, 'early_stopping_rounds':early_stopping_rounds, 'num_boost_round':num_boost_round, 'metric':metric, 'feature_fraction':feature_fraction, 'bagging_fraction':bagging_fraction, 'lambda_l1':lambda_l1, 'lambda_l2':lambda_l2, 'cat_smooth':cat_smooth} model = LGBMClassifier() clf = GridSearchCV(model, param, cv=3, scoring='accuracy', verbose=1, n_jobs=-1) clf.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_test, y_test)]) print(clf.best_params_, clf.best_score_)
There is only 1 value in it, which is the optimal value that has been found by GridSearchCV. The program shows the optimization of the last 6 parameters. It takes too long to train them all together, so I choose to search separately.
The results obtained:
Early stopping, best iteration is: [287] training's binary_logloss: 0.22302 valid_1's binary_logloss: 0.253303 {'bagging_fraction': 0.6, 'cat_smooth': 1, 'colsample_bytree': 0.7, 'early_stopping_rounds': 300, 'feature_fraction': 0.75, 'is_unbalance': False, 'lambda_l1': 0.4, 'lambda_l2': 10, 'learning_rate': 0.02, 'max_depth': 11, 'metric': 'binary_logloss', 'n_estimators': 300, 'num_boost_round': 5000, 'subsample': 0.6} 0.8853333333333334
3.3 Prediction Results
y_true, y_pred = y_test, clf.predict(X_test) accuracy = accuracy_score(y_true,y_pred) print(classification_report(y_true, y_pred)) print('Accuracy',accuracy)
result
precision recall f1-score support 0.0 0.91 0.97 0.94 3933 1.0 0.60 0.32 0.42 567 accuracy 0.89 4500 macro avg 0.75 0.64 0.68 4500 weighted avg 0.87 0.89 0.87 4500 Accuracy 0.8875555555555555
View confusion matrix
from sklearn import metrics confusion_matrix_result = metrics.confusion_matrix(y_true, y_pred) plt.figure(figsize=(8,6)) sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues') plt.xlabel('predict') plt.ylabel('true') plt.show()
4. Output the result
test_x = test[feature_columns] pred_test = clf.predict(test_x) result = pd.read_csv('./submission.csv') subscribe_map ={1: 'yes', 0: 'no'} result['subscribe'] = [subscribe_map[x] for x in pred_test] result.to_csv('./baseline_lgb1.csv', index=False) result['subscribe'].value_counts()
result:
no 6987 yes 513 Name: subscribe, dtype: int64
5. Submit results
6. Summary
My method only got the result of 0.9676, I hope you can improve on the basis of my program to get better results. If there is a better way, please tell me in the message area and discuss with each other.
Ideas for improvement:
1. In terms of data processing, when I perform data equalization, the training effect is very good, but the final effect is poor, which should be due to data overfitting; in addition, further considerations can be made in the processing of data outliers ;
2. The improvement of the method. I compared lr, dt, rf, gb, adab, xgbrf, lgb and finally lgb has the best effect, so I finally choose lgb for parameter adjustment. You can consider using a combination of multiple methods for training;
3. Adjust parameters on the basis of lgb, this is the least technological content. But taking the time should yield better results than mine.