[Teaching Competition] Financial Data Analysis Question 1: Bank Customer Subscription Product Prediction (0.9676)

This article is a record of the Tianchi Teaching Competition and the forecast of bank customer subscription products. The website of the teaching competition is as follows:

[Teaching Competition] Financial Data Analysis Competition Question 1: Bank Customer Subscription Product Prediction_Learning Competition_Tianchi Contest-Alibaba Cloud Tianchi

1. Read data

import pandas as pd

# Download Data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

2. Data processing

2.1 Merge data

# The training set and the test set are merged to facilitate the processing of characteristic data
df = pd.concat([train, test], axis=0) #Concatenate the training data and test data in the row direction
df

The results obtained:

id	age	job	marital	education	default	housing	loan	contact	month	...	campaign	pdays	previous	poutcome	emp_var_rate	cons_price_index	cons_conf_index	lending_rate3m	nr_employed	subscribe
0	1	51	admin.	divorced	professional.course	no	yes	yes	cellular	aug	...	1	112	2	failure	1.4	90.81	-35.53	0.69	5219.74	no
1	2	50	services	married	high.school	unknown	yes	no	cellular	may	...	1	412	2	nonexistent	-1.8	96.33	-40.58	4.05	4974.79	yes
2	3	48	blue-collar	divorced	basic.9y	no	no	no	cellular	apr	...	0	1027	1	failure	-1.8	96.33	-44.74	1.50	5022.61	no
3	4	26	entrepreneur	single	high.school	yes	yes	yes	cellular	aug	...	26	998	0	nonexistent	1.4	97.08	-35.55	5.11	5222.87	yes
4	5	45	admin.	single	university.degree	no	no	no	cellular	nov	...	1	240	4	success	-3.4	89.82	-33.83	1.17	4884.70	no
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7495	29996	49	admin.	unknown	university.degree	unknown	yes	yes	telephone	apr	...	50	302	1	failure	-1.8	95.77	-40.50	3.86	5058.64	NaN
7496	29997	34	blue-collar	married	basic.4y	no	no	no	cellular	jul	...	8	440	3	failure	1.4	90.59	-47.29	1.77	5156.70	NaN
7497	29998	50	retired	single	basic.4y	no	yes	no	cellular	jun	...	3	997	0	nonexistent	-2.9	97.42	-39.69	1.29	5116.80	NaN
7498	29999	31	technician	married	professional.course	no	no	no	cellular	aug	...	3	1028	0	nonexistent	1.4	96.90	-37.68	5.18	5144.45	NaN
7499	30000	46	admin.	divorced	university.degree	no	yes	no	cellular	aug	...	2	387	3	success	1.4	97.49	-31.54	3.79	5082.25	NaN
30000 rows × 22 columns

It can be seen that the data has both numbers and text, and the text needs to be converted into numbers

2.2 Convert non-numeric features to numbers

# First select all features that are object (non-digital) features
cat_columns = df.select_dtypes(include='object').columns  #Select non-numeric columns and process them
df[cat_columns]
# Encode non-numeric features
from sklearn.preprocessing import LabelEncoder

job_le = LabelEncoder()
df['job'] = job_le.fit_transform(df['job'])
df['marital'] = df['marital'].map({'unknown':0, 'single':1, 'married':2, 'divorced':3})
df['education'] = df['education'].map({'unknown':0, 'basic.4y':1, 'basic.6y':2, 'basic.9y':3, 'high.school':4, 'university.degree':5, 'professional.course':6, 'illiterate':7})
df['housing'] = df['housing'].map({'unknown': 0, 'no': 1, 'yes': 2})
df['loan'] = df['loan'].map({'unknown': 0, 'no': 1, 'yes': 2})
df['contact'] = df['contact'].map({'cellular': 0, 'telephone': 1})
df['day_of_week'] = df['day_of_week'].map({'mon': 0, 'tue': 1, 'wed': 2, 'thu': 3, 'fri': 4})
df['poutcome'] = df['poutcome'].map({'nonexistent': 0, 'failure': 1, 'success': 2})
df['default'] = df['default'].map({'unknown': 0, 'no': 1, 'yes': 2})
df['month'] = df['month'].map({'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 'jul': 7, 'aug': 8, \
                 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12})
df['subscribe'] = df['subscribe'].map({'no': 0, 'yes': 1})

2.3 Split data

# Re-divide the data set into training set and test set to judge whether the subscribe is empty
train = df[df['subscribe'].notnull()]
test = df[df['subscribe'].isnull()]

# Looking at the proportion of labels 0 and 1 in the training set, it can be seen that 0 and 1 are unbalanced, and 0 is 6.6 times that of 1.
train['subscribe'].value_counts()

get

0.0    19548
1.0     2952
Name: subscribe, dtype: int64

2.4 Analyze data

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline
num_features = [x for x in train.columns if x not in cat_columns and x!='id']

fig = plt.figure(figsize=(80,60))

for i in range(len(num_features)):
    plt.subplot(7,2,i+1)
    sns.boxplot(train[num_features[i]])
    plt.ylabel(num_features[i], fontsize=36)
plt.show()

There are outliers, and the outliers are processed

2.5 Dealing with outliers

for colum in num_features:
    temp = train[colum]
    q1 = temp.quantile(0.25)
    q2 = temp.quantile(0.75)
    delta = (q2-q1) * 10
    train[colum] = np.clip(temp, q1-delta, q2+delta)
## Values ​​that exceed 10 times are processed

2.6 Other processing

Data equalization and feature selection are carried out, but after the processing, the classification effect is deteriorated, which is omitted here. But post the original code for reference.

'''# SMOTE is used for oversampling. Although the training effect is good, the final classification effect is reduced instead. Oversampling is not used here.
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import ADASYN

#smo = SMOTE(random_state=0, k_neighbors=10)
adasyn = ADASYN()
X_smo, y_smo = adasyn.fit_resample(train.iloc[:,:-1], train.iloc[:,-1])
train_smo = pd.concat([X_smo, y_smo], axis=1)

train_smo['subscribe'].value_counts()'''
'''# The feature selection method uses SelectFromModel, Model selection tree model
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel

# Extract training data and labels
train_X = train.iloc[:,:-1]
train_y = train.iloc[:,-1]

# clf_ect is the model name, and FeaSel is the feature selection model
clf_etc = ExtraTreesClassifier(n_estimators=50)
clf_etc = clf_etc.fit(train_X, train_y)
FeaSel = SelectFromModel(clf_etc, prefit=True)
train_sel = FeaSel.transform(train_X)
test_sel = FeaSel.transform(test.iloc[:,:-1])

# Extract the feature name and write the feature name back to the original data
train_new = pd.DataFrame(train_sel)
feature_idx = FeaSel.get_support() #Extract selected column names
train_new.columns = train_X.columns[feature_idx]  #Write the column names back to the selected data

train_new = pd.concat([train_new, train_y],axis=1)
test_new = pd.DataFrame(test_sel)
test_new.columns = train_X.columns[feature_idx]'''

There may be issues with variable naming in this department content. the

2.7 Data storage

train_new = train
test_new = test

# Write the processed data back to train_new and test_new for saving
train_new.to_csv('train_new.csv', index=False)
test_new.to_csv('test_new.csv', index=False)

3. Model training

3.1 Import package and data

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBRFClassifier
from lightgbm import LGBMClassifier
from sklearn.model_selection import cross_val_score
import time

clf_lr = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')
clf_dt = DecisionTreeClassifier()
clf_rf = RandomForestClassifier()
clf_gb = GradientBoostingClassifier()
clf_adab = AdaBoostClassifier()
clf_xgbrf = XGBRFClassifier()
clf_lgb = LGBMClassifier()

from sklearn.model_selection import train_test_split
train_new = pd.read_csv('train_new.csv')
test_new = pd.read_csv('test_new.csv')
feature_columns = [col for col in train_new.columns if col not in ['subscribe']]
train_data = train_new[feature_columns]
target_data = train_new['subscribe']

3.2 Model tuning

from lightgbm import LGBMClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(train_data, target_data, test_size=0.2,shuffle=True, random_state=2023)
#X_test, X_valid, y_test, y_valid = train_test_split(X_test, y_test, test_size=0.5,shuffle=True,random_state=2023)

n_estimators = [300]
learning_rate = [0.02]#Medium 0.2 is the best
subsample = [0.6]
colsample_bytree = [0.7]   ##0.6 is optimal in [0.5, 0.6, 0.7]
max_depth = [9, 11, 13] ##11 is optimal among [7, 9, 11, 13]
is_unbalance = [False]
early_stopping_rounds = [300]
num_boost_round = [5000]
metric = ['binary_logloss']
feature_fraction = [0.6, 0.75, 0.9]
bagging_fraction = [0.6, 0.75, 0.9]
bagging_freq = [2, 4, 5, 8]
lambda_l1 = [0, 0.1, 0.4, 0.5]
lambda_l2 =  [0, 10, 15, 35]
cat_smooth = [1, 10, 15, 20]


param = {'n_estimators':n_estimators,
         'learning_rate':learning_rate,
         'subsample':subsample,
         'colsample_bytree':colsample_bytree,
         'max_depth':max_depth,
         'is_unbalance':is_unbalance,
         'early_stopping_rounds':early_stopping_rounds,
         'num_boost_round':num_boost_round,
         'metric':metric,
         'feature_fraction':feature_fraction,
         'bagging_fraction':bagging_fraction,
         'lambda_l1':lambda_l1,
         'lambda_l2':lambda_l2,
         'cat_smooth':cat_smooth}

model = LGBMClassifier()

clf = GridSearchCV(model, param, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
clf.fit(X_train, y_train, eval_set=[(X_train, y_train),(X_test, y_test)])

print(clf.best_params_, clf.best_score_)

There is only 1 value in it, which is the optimal value that has been found by GridSearchCV. The program shows the optimization of the last 6 parameters. It takes too long to train them all together, so I choose to search separately.

The results obtained:

Early stopping, best iteration is:
[287]	training's binary_logloss: 0.22302	valid_1's binary_logloss: 0.253303
{'bagging_fraction': 0.6, 'cat_smooth': 1, 'colsample_bytree': 0.7, 'early_stopping_rounds': 300, 'feature_fraction': 0.75, 'is_unbalance': False, 'lambda_l1': 0.4, 'lambda_l2': 10, 'learning_rate': 0.02, 'max_depth': 11, 'metric': 'binary_logloss', 'n_estimators': 300, 'num_boost_round': 5000, 'subsample': 0.6} 0.8853333333333334

3.3 Prediction Results

y_true, y_pred = y_test, clf.predict(X_test)
accuracy = accuracy_score(y_true,y_pred)
print(classification_report(y_true, y_pred))
print('Accuracy',accuracy)

result

precision    recall  f1-score   support

         0.0       0.91      0.97      0.94      3933
         1.0       0.60      0.32      0.42       567

    accuracy                           0.89      4500
   macro avg       0.75      0.64      0.68      4500
weighted avg       0.87      0.89      0.87      4500

Accuracy 0.8875555555555555

View confusion matrix

from sklearn import metrics
confusion_matrix_result = metrics.confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8,6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('predict')
plt.ylabel('true')
plt.show()

4. Output the result

test_x = test[feature_columns]
pred_test = clf.predict(test_x)
result = pd.read_csv('./submission.csv')
subscribe_map ={1: 'yes', 0: 'no'}
result['subscribe'] = [subscribe_map[x] for x in pred_test]
result.to_csv('./baseline_lgb1.csv', index=False)
result['subscribe'].value_counts()

result:

no     6987
yes     513
Name: subscribe, dtype: int64

5. Submit results

6. Summary

My method only got the result of 0.9676, I hope you can improve on the basis of my program to get better results. If there is a better way, please tell me in the message area and discuss with each other.

Ideas for improvement:

1. In terms of data processing, when I perform data equalization, the training effect is very good, but the final effect is poor, which should be due to data overfitting; in addition, further considerations can be made in the processing of data outliers ;

2. The improvement of the method. I compared lr, dt, rf, gb, adab, xgbrf, lgb and finally lgb has the best effect, so I finally choose lgb for parameter adjustment. You can consider using a combination of multiple methods for training;

3. Adjust parameters on the basis of lgb, this is the least technological content. But taking the time should yield better results than mine.

Tags: Data Analysis AI Machine Learning

Posted by zhopa on Tue, 17 Jan 2023 22:48:00 +0530