Predicting Credit Card Default - Classification

11 minute read

Context

The dataset used for this notebook is from https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients or you can check it in https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/discussion/34608.

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

  • X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
  • X2: Gender (1 = male; 2 = female).
  • X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
  • X4: Marital status (1 = married; 2 = single; 3 = others).
  • X5: Age (year).
  • X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
  • X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
  • X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Importing all packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from scipy.stats import shapiro, mannwhitneyu, chi2_contingency
import ppscore as pps

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, f1_score, roc_auc_score, matthews_corrcoef, precision_score, recall_score, accuracy_score

from jcopml.tuning import random_search_params as rsp
from jcopml.tuning.space import Real, Integer

from yellowbrick.model_selection import LearningCurve

import warnings
warnings.filterwarnings("ignore") 
sns.set_style('whitegrid')
cc = pd.read_csv('UCI_Credit_Card.csv')
cc.shape
(30000, 25)

This dataset contains 30.000 observations with 25 features.

cc.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default.payment.next.month
0 1 20000.0 2 2 1 24 2 2 -1 -1 ... 0.0 0.0 0.0 0.0 689.0 0.0 0.0 0.0 0.0 1
1 2 120000.0 2 2 2 26 -1 2 0 0 ... 3272.0 3455.0 3261.0 0.0 1000.0 1000.0 1000.0 0.0 2000.0 1
2 3 90000.0 2 2 2 34 0 0 0 0 ... 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 1000.0 1000.0 5000.0 0
3 4 50000.0 2 2 1 37 0 0 0 0 ... 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 1100.0 1069.0 1000.0 0
4 5 50000.0 1 2 1 57 -1 0 -1 0 ... 20940.0 19146.0 19131.0 2000.0 36681.0 10000.0 9000.0 689.0 679.0 0

5 rows × 25 columns

pd.DataFrame({'dataFeatures' : cc.columns, 'dataType' : cc.dtypes.values, 
              'null' : [cc[i].isna().sum() for i in cc.columns],
              'nullPct' : [((cc[i].isna().sum()/len(cc[i]))*100).round(2) for i in cc.columns],
             'Nunique' : [cc[i].nunique() for i in cc.columns],
             'uniqueSample' : [list(pd.Series(cc[i].unique()).sample(2)) for i in cc.columns]}).reset_index(drop = True)
dataFeatures dataType null nullPct Nunique uniqueSample
0 ID int64 0 0.0 30000 [24194, 23057]
1 LIMIT_BAL float64 0 0.0 81 [40000.0, 110000.0]
2 SEX int64 0 0.0 2 [2, 1]
3 EDUCATION int64 0 0.0 7 [5, 4]
4 MARRIAGE int64 0 0.0 4 [3, 2]
5 AGE int64 0 0.0 56 [61, 30]
6 PAY_0 int64 0 0.0 11 [7, 1]
7 PAY_2 int64 0 0.0 11 [-1, 5]
8 PAY_3 int64 0 0.0 11 [8, 2]
9 PAY_4 int64 0 0.0 11 [6, 0]
10 PAY_5 int64 0 0.0 10 [-2, 0]
11 PAY_6 int64 0 0.0 10 [4, 6]
12 BILL_AMT1 float64 0 0.0 22723 [105105.0, 7065.0]
13 BILL_AMT2 float64 0 0.0 22346 [7277.0, 94153.0]
14 BILL_AMT3 float64 0 0.0 22026 [7251.0, 76767.0]
15 BILL_AMT4 float64 0 0.0 21548 [5733.0, 11950.0]
16 BILL_AMT5 float64 0 0.0 21010 [45617.0, 41630.0]
17 BILL_AMT6 float64 0 0.0 20604 [84237.0, 67700.0]
18 PAY_AMT1 float64 0 0.0 7943 [14180.0, 66.0]
19 PAY_AMT2 float64 0 0.0 7899 [4070.0, 1450.0]
20 PAY_AMT3 float64 0 0.0 7518 [15413.0, 179.0]
21 PAY_AMT4 float64 0 0.0 6937 [6666.0, 8337.0]
22 PAY_AMT5 float64 0 0.0 6897 [6017.0, 15362.0]
23 PAY_AMT6 float64 0 0.0 6939 [1691.0, 2056.0]
24 default.payment.next.month int64 0 0.0 2 [1, 0]

No missing values! Before I explore the dataset, I’ll rename the target column first.

cc.rename(columns = {'default.payment.next.month' : 'target'}, inplace = True)
cc.dtypes
ID             int64
LIMIT_BAL    float64
SEX            int64
EDUCATION      int64
MARRIAGE       int64
AGE            int64
PAY_0          int64
PAY_2          int64
PAY_3          int64
PAY_4          int64
PAY_5          int64
PAY_6          int64
BILL_AMT1    float64
BILL_AMT2    float64
BILL_AMT3    float64
BILL_AMT4    float64
BILL_AMT5    float64
BILL_AMT6    float64
PAY_AMT1     float64
PAY_AMT2     float64
PAY_AMT3     float64
PAY_AMT4     float64
PAY_AMT5     float64
PAY_AMT6     float64
target         int64
dtype: object

Much better…

Exploration Data Analysis

cc['target'].value_counts(normalize = True)
0    0.7788
1    0.2212
Name: target, dtype: float64

Percentage of default is: 22.12%

Hypothesis Testing

Before we go further, i’ll do significance test first. But significance test required normality test, to know the data whether normally distribute or not. I’ll use shapiro-wilk test to do normality test.

Normality Test

It’s a little bit tricky, we need to know, what features is numerical or categorical.

cc.head()
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_0 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 target
0 1 20000.0 2 2 1 24 2 2 -1 -1 ... 0.0 0.0 0.0 0.0 689.0 0.0 0.0 0.0 0.0 1
1 2 120000.0 2 2 2 26 -1 2 0 0 ... 3272.0 3455.0 3261.0 0.0 1000.0 1000.0 1000.0 0.0 2000.0 1
2 3 90000.0 2 2 2 34 0 0 0 0 ... 14331.0 14948.0 15549.0 1518.0 1500.0 1000.0 1000.0 1000.0 5000.0 0
3 4 50000.0 2 2 1 37 0 0 0 0 ... 28314.0 28959.0 29547.0 2000.0 2019.0 1200.0 1100.0 1069.0 1000.0 0
4 5 50000.0 1 2 1 57 -1 0 -1 0 ... 20940.0 19146.0 19131.0 2000.0 36681.0 10000.0 9000.0 689.0 679.0 0

5 rows × 25 columns

numerical = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3',
            'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2',
            'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']
shap = []

for i in numerical:
    if shapiro(cc[i])[1] < 0.05:
        shap.append('Reject Null Hypothesis')
    else:
        shap.append('Fail To reject Null Hypothesis')
        
pd.DataFrame({'Hypothesis' : shap}, index = numerical)
/home/user/anaconda3/lib/python3.7/site-packages/scipy/stats/morestats.py:1681: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")
Hypothesis
LIMIT_BAL Reject Null Hypothesis
AGE Reject Null Hypothesis
BILL_AMT1 Reject Null Hypothesis
BILL_AMT2 Reject Null Hypothesis
BILL_AMT3 Reject Null Hypothesis
BILL_AMT4 Reject Null Hypothesis
BILL_AMT5 Reject Null Hypothesis
BILL_AMT6 Reject Null Hypothesis
PAY_AMT1 Reject Null Hypothesis
PAY_AMT2 Reject Null Hypothesis
PAY_AMT3 Reject Null Hypothesis
PAY_AMT4 Reject Null Hypothesis
PAY_AMT5 Reject Null Hypothesis
PAY_AMT6 Reject Null Hypothesis

Significance Test

Let’s do mannwhitneyu Test then…

mann = []

for i in numerical:
    if mannwhitneyu(cc[cc['target'] == 0][i],cc[cc['target'] == 1][i])[1] < 0.05:
        mann.append('Reject Null Hypothesis')
    else:
        mann.append('Fail To Reject Null Hypothesis')
    
pd.DataFrame(mann, columns = ['Hypothesis'], index = numerical)
Hypothesis
LIMIT_BAL Reject Null Hypothesis
AGE Fail To Reject Null Hypothesis
BILL_AMT1 Reject Null Hypothesis
BILL_AMT2 Reject Null Hypothesis
BILL_AMT3 Reject Null Hypothesis
BILL_AMT4 Fail To Reject Null Hypothesis
BILL_AMT5 Fail To Reject Null Hypothesis
BILL_AMT6 Fail To Reject Null Hypothesis
PAY_AMT1 Reject Null Hypothesis
PAY_AMT2 Reject Null Hypothesis
PAY_AMT3 Reject Null Hypothesis
PAY_AMT4 Reject Null Hypothesis
PAY_AMT5 Reject Null Hypothesis
PAY_AMT6 Reject Null Hypothesis

Chi2 Test

for i in ['ID', 'target']:
    numerical.append(i)
categorical = cc.drop(numerical, axis = 1).columns
chi2 = []

for i in categorical:
    if chi2_contingency(pd.crosstab(cc['target'], cc[i]))[1] < 0.05:
        chi2.append('Reject Null Hypothesis')
    else:
        chi2.append('Fail To Reject Null Hypothesis')
        
pd.DataFrame({'Hypothesis' : chi2}, index = categorical)
Hypothesis
SEX Reject Null Hypothesis
EDUCATION Reject Null Hypothesis
MARRIAGE Reject Null Hypothesis
PAY_0 Reject Null Hypothesis
PAY_2 Reject Null Hypothesis
PAY_3 Reject Null Hypothesis
PAY_4 Reject Null Hypothesis
PAY_5 Reject Null Hypothesis
PAY_6 Reject Null Hypothesis

From the significance test, we know AGE and amount of bill statement in June, May, and April are not significance (difference). My recommendation, focus on amount of bill statement in September, August, and July.

Wait, AGE is not significance? That’s interesting. But why? I thought, younger people will defaulting more. Let’s visualize it.

cc[cc['target'] == 0]['AGE'].plot(kind = 'kde', color = 'blue', label = 'Default Payment = 0')
cc[cc['target'] == 1]['AGE'].plot(kind = 'kde', color = 'red', label = 'Default Payment = 1')
plt.legend()
plt.title('Age distribution')
Text(0.5, 1.0, 'Age distribution')

png

sns.boxplot(y = cc['AGE'], x = cc['target'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f0d46488fd0>

png

Oh! I see! Almost no difference at all in age distributions between defaulters and not. That’s so suprising. Thanks to Significance Test.

fig, ax = plt.subplots(1, 3, figsize = (20, 7))

for i,j in zip(ax.flatten(), ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3']):
    sns.distplot(cc[j], ax = i)

png

The pattern between amount of bill statement in September, August, and July looks same. And that’s make sense, most people don’t want have too much debt and the vast majority of people have debt below about NT$ 50.000. How about the limit balanced? Let me guess, I think, people with lower limit balanced are more likely defaulting. Prove me wrong please…

cc[cc['target'] == 0]['LIMIT_BAL'].plot(kind = 'kde', color = 'blue', label = 'Default Payment = 0')
cc[cc['target'] == 1]['LIMIT_BAL'].plot(kind = 'kde', color = 'red', label = 'Default Payment = 1')
plt.legend()
plt.title('LIMIT_BAL distribution')
Text(0.5, 1.0, 'LIMIT_BAL distribution')

png

:( it’s not suprising at all. The lower limit balanced are more likely defaulting then people with higher limit balanced. Why this make sense? Cause If bank seems ‘don’t trust’ you or they see you’re a ‘high risk’ applicant, bank usually will only give you a smaller line of credit. Does education significance (quantitively)?

plt.figure(figsize = (10, 5))
sns.countplot(cc['EDUCATION'], hue = cc['target'])
plt.xticks((0,1,2,3,4,5,6),('<12 grade','graduate school','university',
                            'high school','other','trade school','Not disclosed'))
plt.tight_layout()

png

In this observations, our debtors are mostly educated people. Let’s do feature selection using Predictive Power Scoring…

plt.figure(figsize = (20, 5))
df_predictors = pps.predictors(cc, y="target")
sns.barplot(data=df_predictors, x="x", y="ppscore")
plt.tight_layout()

png

That’s interesting, history of payment features will help model to predictive whether the debtors will defaulting or not.

fig, ax = plt.subplots(2,3, figsize = (20, 8))

for i,j in zip([0,2,3,4,5,6],ax.flatten()):
    sns.countplot(cc['PAY_{}'.format(i)], hue = cc['target'], ax = j)

png

Seems they all look same. This plot tell us, that not everyone pays their debt on time. But for the debtors that are already late in a few monts, it looks like they’re more likely to end up in default than debtors who end up paying off their debt.

Splitting Dataset

X = cc.drop(columns = 'target')
y = cc['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101, stratify = y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((24000, 24), (6000, 24), (24000,), (6000,))

Preprocessing

categorical = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])

preprocessor = ColumnTransformer([
    ('cat', categorical, ['PAY_{}'.format(i) for i in [0,2,3,4,5,6]])
])

Modeling

def evaluationMetrics(y_test, y_pred):
    f1.append(f1_score(y_test,y_pred))
    prec.append(precision_score(y_test, y_pred))
    recall.append(recall_score(y_test, y_pred))
    auc.append(roc_auc_score(y_test, y_pred))
    acc.append(accuracy_score(y_test, y_pred))
    matthews.append(matthews_corrcoef(y_test, y_pred))
metric = pd.DataFrame(index = ['AdaBoost Classifier', 'KNN Classifier', 'XGBoost Classifier']) 
f1 = []
prec = []
recall = []
matthews = []
auc = []
acc = []

Adaboost

ada_params = {
    'algo__learning_rate': Real(low=-2, high=0, prior='log-uniform'),
    'algo__n_estimators': Integer(low=100, high=200)
}
pipeline = Pipeline([
    ('prep', preprocessor),
    ('algo', AdaBoostClassifier())
])

ada = RandomizedSearchCV(pipeline, ada_params, cv = 3, n_jobs = -1, random_state = 101)
ada.fit(X_train, y_train)

print(ada.best_params_)
print(ada.score(X_train, y_train), ada.best_score_, ada.score(X_test, y_test))
{'algo__learning_rate': 0.36122336008063605, 'algo__n_estimators': 105}
0.8197083333333334 0.8192083333333334 0.8256666666666667
y_pred_ada = ada.best_estimator_.predict(X_test)
evaluationMetrics(y_test, y_pred_ada)

KNN

knn_params = {
    'algo__n_neighbors': Integer(low=1, high=20),
}
pipeline = Pipeline([
    ('prep', preprocessor),
    ('algo', KNeighborsClassifier())
])

knn = RandomizedSearchCV(pipeline, knn_params, cv = 3, n_jobs = -1, random_state = 101)
knn.fit(X_train, y_train)

print(knn.best_params_)
print(knn.score(X_train, y_train), knn.best_score_, knn.score(X_test, y_test))
{'algo__n_neighbors': 16}
0.8184166666666667 0.8162916666666667 0.8191666666666667
y_pred_knn = knn.best_estimator_.predict(X_test)
evaluationMetrics(y_test, y_pred_knn)

XGBoost Classifier

pipeline = Pipeline([
    ('prep', preprocessor),
    ('algo', XGBClassifier())
])

xgb = RandomizedSearchCV(pipeline, rsp.xgb_params, cv = 3, n_jobs = -1, random_state = 101)
xgb.fit(X_train, y_train)

print(xgb.best_params_)
print(xgb.score(X_train, y_train), xgb.best_score_, xgb.score(X_test, y_test))
{'algo__colsample_bytree': 0.14363608368376052, 'algo__gamma': 8, 'algo__learning_rate': 0.4501489968769524, 'algo__max_depth': 9, 'algo__n_estimators': 162, 'algo__reg_alpha': 0.2573510611415207, 'algo__reg_lambda': 0.23179283837105844, 'algo__subsample': 0.5208429341570253}
0.821125 0.8199166666666667 0.8258333333333333
y_pred_xgb = xgb.best_estimator_.predict(X_test)
evaluationMetrics(y_test, y_pred_xgb)

Evaluation

Metrics

metric['F1'] = f1
metric['Precision'] = prec
metric['Recall'] = recall
metric['MCC'] = matthews
metric['AUC'] = auc
metric['accuracy'] = acc
metric
F1 Precision Recall MCC AUC accuracy
AdaBoost Classifier 0.485236 0.699291 0.371515 0.420359 0.663074 0.825667
KNN Classifier 0.440433 0.697712 0.321778 0.386976 0.641095 0.819167
XGBoost Classifier 0.490989 0.694215 0.379804 0.422884 0.666149 0.825833

Confusion Matrix

Adaboost

sns.heatmap(confusion_matrix(y_test, y_pred_ada), annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0d341d8c50>

png

KNN

sns.heatmap(confusion_matrix(y_test, y_pred_knn), annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0d34202fd0>

png

XGBoost

sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot = True)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0d342ade90>

png

Learning Curves

Adaboost

sizes = np.linspace(0.3, 1, 10)

visualizer = LearningCurve(
    ada.best_estimator_, scoring = 'accuracy', train_size = sizes, random_state = 101,  cv = 3
)

visualizer.fit(X_train, y_train)
visualizer.show()

png

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d3440df90>

KNN

sizes = np.linspace(0.3, 1, 10)

visualizer = LearningCurve(
    knn.best_estimator_, scoring = 'accuracy', train_size = sizes, random_state = 101,  cv = 3
)

visualizer.fit(X_train, y_train)
visualizer.show()

png

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d343f6790>

XGBoost

sizes = np.linspace(0.3, 1, 10)

visualizer = LearningCurve(
    xgb.best_estimator_, scoring = 'accuracy', train_size = sizes, random_state = 101,  cv = 3
)

visualizer.fit(X_train, y_train)
visualizer.show()

png

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d342f7c50>

Conclusion

Our best model is XGBoost with 82-83% accuracy and if we see to his learning curve, its a good fit.

Thank you