Predicting Credit Card Default - Classification

11 minute read

Context

The dataset used for this notebook is from https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients or you can check it in https://www.kaggle.com/uciml/default-of-credit-card-clients-dataset/discussion/34608.

Attribute Information:

This research employed a binary variable, default payment (Yes = 1, No = 0), as the response variable. This study reviewed the literature and used the following 23 variables as explanatory variables:

X1: Amount of the given credit (NT dollar): it includes both the individual consumer credit and his/her family (supplementary) credit.
X2: Gender (1 = male; 2 = female).
X3: Education (1 = graduate school; 2 = university; 3 = high school; 4 = others).
X4: Marital status (1 = married; 2 = single; 3 = others).
X5: Age (year).
X6 - X11: History of past payment. We tracked the past monthly payment records (from April to September, 2005) as follows: X6 = the repayment status in September, 2005; X7 = the repayment status in August, 2005; . . .;X11 = the repayment status in April, 2005. The measurement scale for the repayment status is: -1 = pay duly; 1 = payment delay for one month; 2 = payment delay for two months; . . .; 8 = payment delay for eight months; 9 = payment delay for nine months and above.
X12-X17: Amount of bill statement (NT dollar). X12 = amount of bill statement in September, 2005; X13 = amount of bill statement in August, 2005; . . .; X17 = amount of bill statement in April, 2005.
X18-X23: Amount of previous payment (NT dollar). X18 = amount paid in September, 2005; X19 = amount paid in August, 2005; . . .;X23 = amount paid in April, 2005.

Importing all packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from scipy.stats import shapiro, mannwhitneyu, chi2_contingency
import ppscore as pps

from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, f1_score, roc_auc_score, matthews_corrcoef, precision_score, recall_score, accuracy_score

from jcopml.tuning import random_search_params as rsp
from jcopml.tuning.space import Real, Integer

from yellowbrick.model_selection import LearningCurve

import warnings
warnings.filterwarnings("ignore") 

sns.set_style('whitegrid')

cc = pd.read_csv('UCI_Credit_Card.csv')
cc.shape

(30000, 25)

This dataset contains 30.000 observations with 25 features.

cc.head()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	default.payment.next.month
0	1	20000.0	2	2	1	24	2	2	-1	-1	...	0.0	0.0	0.0	0.0	689.0	0.0	0.0	0.0	0.0	1
1	2	120000.0	2	2	2	26	-1	2	0	0	...	3272.0	3455.0	3261.0	0.0	1000.0	1000.0	1000.0	0.0	2000.0	1
2	3	90000.0	2	2	2	34	0	0	0	0	...	14331.0	14948.0	15549.0	1518.0	1500.0	1000.0	1000.0	1000.0	5000.0	0
3	4	50000.0	2	2	1	37	0	0	0	0	...	28314.0	28959.0	29547.0	2000.0	2019.0	1200.0	1100.0	1069.0	1000.0	0
4	5	50000.0	1	2	1	57	-1	0	-1	0	...	20940.0	19146.0	19131.0	2000.0	36681.0	10000.0	9000.0	689.0	679.0	0

5 rows × 25 columns

pd.DataFrame({'dataFeatures' : cc.columns, 'dataType' : cc.dtypes.values, 
              'null' : [cc[i].isna().sum() for i in cc.columns],
              'nullPct' : [((cc[i].isna().sum()/len(cc[i]))*100).round(2) for i in cc.columns],
             'Nunique' : [cc[i].nunique() for i in cc.columns],
             'uniqueSample' : [list(pd.Series(cc[i].unique()).sample(2)) for i in cc.columns]}).reset_index(drop = True)

	dataFeatures	dataType	Nunique	uniqueSample
0	ID	int64	30000	[24194, 23057]
1	LIMIT_BAL	float64	81	[40000.0, 110000.0]
2	SEX	int64	2	[2, 1]
3	EDUCATION	int64	7	[5, 4]
4	MARRIAGE	int64	4	[3, 2]
5	AGE	int64	56	[61, 30]
6	PAY_0	int64	11	[7, 1]
7	PAY_2	int64	11	[-1, 5]
8	PAY_3	int64	11	[8, 2]
9	PAY_4	int64	11	[6, 0]
10	PAY_5	int64	10	[-2, 0]
11	PAY_6	int64	10	[4, 6]
12	BILL_AMT1	float64	22723	[105105.0, 7065.0]
13	BILL_AMT2	float64	22346	[7277.0, 94153.0]
14	BILL_AMT3	float64	22026	[7251.0, 76767.0]
15	BILL_AMT4	float64	21548	[5733.0, 11950.0]
16	BILL_AMT5	float64	21010	[45617.0, 41630.0]
17	BILL_AMT6	float64	20604	[84237.0, 67700.0]
18	PAY_AMT1	float64	7943	[14180.0, 66.0]
19	PAY_AMT2	float64	7899	[4070.0, 1450.0]
20	PAY_AMT3	float64	7518	[15413.0, 179.0]
21	PAY_AMT4	float64	6937	[6666.0, 8337.0]
22	PAY_AMT5	float64	6897	[6017.0, 15362.0]
23	PAY_AMT6	float64	6939	[1691.0, 2056.0]
24	default.payment.next.month	int64	2	[1, 0]

No missing values! Before I explore the dataset, I’ll rename the target column first.

cc.rename(columns = {'default.payment.next.month' : 'target'}, inplace = True)

cc.dtypes

ID             int64
LIMIT_BAL    float64
SEX            int64
EDUCATION      int64
MARRIAGE       int64
AGE            int64
PAY_0          int64
PAY_2          int64
PAY_3          int64
PAY_4          int64
PAY_5          int64
PAY_6          int64
BILL_AMT1    float64
BILL_AMT2    float64
BILL_AMT3    float64
BILL_AMT4    float64
BILL_AMT5    float64
BILL_AMT6    float64
PAY_AMT1     float64
PAY_AMT2     float64
PAY_AMT3     float64
PAY_AMT4     float64
PAY_AMT5     float64
PAY_AMT6     float64
target         int64
dtype: object

Much better…

Exploration Data Analysis

cc['target'].value_counts(normalize = True)

0    0.7788
1    0.2212
Name: target, dtype: float64

Percentage of default is: 22.12%

Hypothesis Testing

Before we go further, i’ll do significance test first. But significance test required normality test, to know the data whether normally distribute or not. I’ll use shapiro-wilk test to do normality test.

Normality Test

It’s a little bit tricky, we need to know, what features is numerical or categorical.

cc.head()

	ID	LIMIT_BAL	SEX	EDUCATION	MARRIAGE	AGE	PAY_0	PAY_2	PAY_3	PAY_4	...	BILL_AMT4	BILL_AMT5	BILL_AMT6	PAY_AMT1	PAY_AMT2	PAY_AMT3	PAY_AMT4	PAY_AMT5	PAY_AMT6	target
0	1	20000.0	2	2	1	24	2	2	-1	-1	...	0.0	0.0	0.0	0.0	689.0	0.0	0.0	0.0	0.0	1
1	2	120000.0	2	2	2	26	-1	2	0	0	...	3272.0	3455.0	3261.0	0.0	1000.0	1000.0	1000.0	0.0	2000.0	1
2	3	90000.0	2	2	2	34	0	0	0	0	...	14331.0	14948.0	15549.0	1518.0	1500.0	1000.0	1000.0	1000.0	5000.0	0
3	4	50000.0	2	2	1	37	0	0	0	0	...	28314.0	28959.0	29547.0	2000.0	2019.0	1200.0	1100.0	1069.0	1000.0	0
4	5	50000.0	1	2	1	57	-1	0	-1	0	...	20940.0	19146.0	19131.0	2000.0	36681.0	10000.0	9000.0	689.0	679.0	0

5 rows × 25 columns

numerical = ['LIMIT_BAL', 'AGE', 'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3',
            'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1', 'PAY_AMT2',
            'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6']

shap = []

for i in numerical:
    if shapiro(cc[i])[1] < 0.05:
        shap.append('Reject Null Hypothesis')
    else:
        shap.append('Fail To reject Null Hypothesis')
        
pd.DataFrame({'Hypothesis' : shap}, index = numerical)

/home/user/anaconda3/lib/python3.7/site-packages/scipy/stats/morestats.py:1681: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

	Hypothesis
LIMIT_BAL	Reject Null Hypothesis
AGE	Reject Null Hypothesis
BILL_AMT1	Reject Null Hypothesis
BILL_AMT2	Reject Null Hypothesis
BILL_AMT3	Reject Null Hypothesis
BILL_AMT4	Reject Null Hypothesis
BILL_AMT5	Reject Null Hypothesis
BILL_AMT6	Reject Null Hypothesis
PAY_AMT1	Reject Null Hypothesis
PAY_AMT2	Reject Null Hypothesis
PAY_AMT3	Reject Null Hypothesis
PAY_AMT4	Reject Null Hypothesis
PAY_AMT5	Reject Null Hypothesis
PAY_AMT6	Reject Null Hypothesis

Significance Test

Let’s do mannwhitneyu Test then…

mann = []

for i in numerical:
    if mannwhitneyu(cc[cc['target'] == 0][i],cc[cc['target'] == 1][i])[1] < 0.05:
        mann.append('Reject Null Hypothesis')
    else:
        mann.append('Fail To Reject Null Hypothesis')
    
pd.DataFrame(mann, columns = ['Hypothesis'], index = numerical)

	Hypothesis
LIMIT_BAL	Reject Null Hypothesis
AGE	Fail To Reject Null Hypothesis
BILL_AMT1	Reject Null Hypothesis
BILL_AMT2	Reject Null Hypothesis
BILL_AMT3	Reject Null Hypothesis
BILL_AMT4	Fail To Reject Null Hypothesis
BILL_AMT5	Fail To Reject Null Hypothesis
BILL_AMT6	Fail To Reject Null Hypothesis
PAY_AMT1	Reject Null Hypothesis
PAY_AMT2	Reject Null Hypothesis
PAY_AMT3	Reject Null Hypothesis
PAY_AMT4	Reject Null Hypothesis
PAY_AMT5	Reject Null Hypothesis
PAY_AMT6	Reject Null Hypothesis

Chi2 Test

for i in ['ID', 'target']:
    numerical.append(i)

categorical = cc.drop(numerical, axis = 1).columns

chi2 = []

for i in categorical:
    if chi2_contingency(pd.crosstab(cc['target'], cc[i]))[1] < 0.05:
        chi2.append('Reject Null Hypothesis')
    else:
        chi2.append('Fail To Reject Null Hypothesis')
        
pd.DataFrame({'Hypothesis' : chi2}, index = categorical)

	Hypothesis
SEX	Reject Null Hypothesis
EDUCATION	Reject Null Hypothesis
MARRIAGE	Reject Null Hypothesis
PAY_0	Reject Null Hypothesis
PAY_2	Reject Null Hypothesis
PAY_3	Reject Null Hypothesis
PAY_4	Reject Null Hypothesis
PAY_5	Reject Null Hypothesis
PAY_6	Reject Null Hypothesis

From the significance test, we know AGE and amount of bill statement in June, May, and April are not significance (difference). My recommendation, focus on amount of bill statement in September, August, and July.

Wait, AGE is not significance? That’s interesting. But why? I thought, younger people will defaulting more. Let’s visualize it.

cc[cc['target'] == 0]['AGE'].plot(kind = 'kde', color = 'blue', label = 'Default Payment = 0')
cc[cc['target'] == 1]['AGE'].plot(kind = 'kde', color = 'red', label = 'Default Payment = 1')
plt.legend()
plt.title('Age distribution')

Text(0.5, 1.0, 'Age distribution')

png

sns.boxplot(y = cc['AGE'], x = cc['target'])

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d46488fd0>

png

Oh! I see! Almost no difference at all in age distributions between defaulters and not. That’s so suprising. Thanks to Significance Test.

fig, ax = plt.subplots(1, 3, figsize = (20, 7))

for i,j in zip(ax.flatten(), ['BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3']):
    sns.distplot(cc[j], ax = i)

png

The pattern between amount of bill statement in September, August, and July looks same. And that’s make sense, most people don’t want have too much debt and the vast majority of people have debt below about NT$ 50.000. How about the limit balanced? Let me guess, I think, people with lower limit balanced are more likely defaulting. Prove me wrong please…

cc[cc['target'] == 0]['LIMIT_BAL'].plot(kind = 'kde', color = 'blue', label = 'Default Payment = 0')
cc[cc['target'] == 1]['LIMIT_BAL'].plot(kind = 'kde', color = 'red', label = 'Default Payment = 1')
plt.legend()
plt.title('LIMIT_BAL distribution')

Text(0.5, 1.0, 'LIMIT_BAL distribution')

png

:( it’s not suprising at all. The lower limit balanced are more likely defaulting then people with higher limit balanced. Why this make sense? Cause If bank seems ‘don’t trust’ you or they see you’re a ‘high risk’ applicant, bank usually will only give you a smaller line of credit. Does education significance (quantitively)?

plt.figure(figsize = (10, 5))
sns.countplot(cc['EDUCATION'], hue = cc['target'])
plt.xticks((0,1,2,3,4,5,6),('<12 grade','graduate school','university',
                            'high school','other','trade school','Not disclosed'))
plt.tight_layout()

png

In this observations, our debtors are mostly educated people. Let’s do feature selection using Predictive Power Scoring…

plt.figure(figsize = (20, 5))
df_predictors = pps.predictors(cc, y="target")
sns.barplot(data=df_predictors, x="x", y="ppscore")
plt.tight_layout()

png

That’s interesting, history of payment features will help model to predictive whether the debtors will defaulting or not.

fig, ax = plt.subplots(2,3, figsize = (20, 8))

for i,j in zip([0,2,3,4,5,6],ax.flatten()):
    sns.countplot(cc['PAY_{}'.format(i)], hue = cc['target'], ax = j)

png

Seems they all look same. This plot tell us, that not everyone pays their debt on time. But for the debtors that are already late in a few monts, it looks like they’re more likely to end up in default than debtors who end up paying off their debt.

Splitting Dataset

X = cc.drop(columns = 'target')
y = cc['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 101, stratify = y)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((24000, 24), (6000, 24), (24000,), (6000,))

Preprocessing

categorical = Pipeline([
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])

preprocessor = ColumnTransformer([
    ('cat', categorical, ['PAY_{}'.format(i) for i in [0,2,3,4,5,6]])
])

Modeling

def evaluationMetrics(y_test, y_pred):
    f1.append(f1_score(y_test,y_pred))
    prec.append(precision_score(y_test, y_pred))
    recall.append(recall_score(y_test, y_pred))
    auc.append(roc_auc_score(y_test, y_pred))
    acc.append(accuracy_score(y_test, y_pred))
    matthews.append(matthews_corrcoef(y_test, y_pred))

metric = pd.DataFrame(index = ['AdaBoost Classifier', 'KNN Classifier', 'XGBoost Classifier']) 
f1 = []
prec = []
recall = []
matthews = []
auc = []
acc = []

Adaboost

ada_params = {
    'algo__learning_rate': Real(low=-2, high=0, prior='log-uniform'),
    'algo__n_estimators': Integer(low=100, high=200)
}

pipeline = Pipeline([
    ('prep', preprocessor),
    ('algo', AdaBoostClassifier())
])

ada = RandomizedSearchCV(pipeline, ada_params, cv = 3, n_jobs = -1, random_state = 101)
ada.fit(X_train, y_train)

print(ada.best_params_)
print(ada.score(X_train, y_train), ada.best_score_, ada.score(X_test, y_test))

{'algo__learning_rate': 0.36122336008063605, 'algo__n_estimators': 105}
0.8197083333333334 0.8192083333333334 0.8256666666666667

y_pred_ada = ada.best_estimator_.predict(X_test)
evaluationMetrics(y_test, y_pred_ada)

KNN

knn_params = {
    'algo__n_neighbors': Integer(low=1, high=20),
}

pipeline = Pipeline([
    ('prep', preprocessor),
    ('algo', KNeighborsClassifier())
])

knn = RandomizedSearchCV(pipeline, knn_params, cv = 3, n_jobs = -1, random_state = 101)
knn.fit(X_train, y_train)

print(knn.best_params_)
print(knn.score(X_train, y_train), knn.best_score_, knn.score(X_test, y_test))

{'algo__n_neighbors': 16}
0.8184166666666667 0.8162916666666667 0.8191666666666667

y_pred_knn = knn.best_estimator_.predict(X_test)
evaluationMetrics(y_test, y_pred_knn)

XGBoost Classifier

pipeline = Pipeline([
    ('prep', preprocessor),
    ('algo', XGBClassifier())
])

xgb = RandomizedSearchCV(pipeline, rsp.xgb_params, cv = 3, n_jobs = -1, random_state = 101)
xgb.fit(X_train, y_train)

print(xgb.best_params_)
print(xgb.score(X_train, y_train), xgb.best_score_, xgb.score(X_test, y_test))

{'algo__colsample_bytree': 0.14363608368376052, 'algo__gamma': 8, 'algo__learning_rate': 0.4501489968769524, 'algo__max_depth': 9, 'algo__n_estimators': 162, 'algo__reg_alpha': 0.2573510611415207, 'algo__reg_lambda': 0.23179283837105844, 'algo__subsample': 0.5208429341570253}
0.821125 0.8199166666666667 0.8258333333333333

y_pred_xgb = xgb.best_estimator_.predict(X_test)
evaluationMetrics(y_test, y_pred_xgb)

Evaluation

Metrics

metric['F1'] = f1
metric['Precision'] = prec
metric['Recall'] = recall
metric['MCC'] = matthews
metric['AUC'] = auc
metric['accuracy'] = acc

metric

	F1	Precision	Recall	MCC	AUC	accuracy
AdaBoost Classifier	0.485236	0.699291	0.371515	0.420359	0.663074	0.825667
KNN Classifier	0.440433	0.697712	0.321778	0.386976	0.641095	0.819167
XGBoost Classifier	0.490989	0.694215	0.379804	0.422884	0.666149	0.825833

Confusion Matrix

Adaboost

sns.heatmap(confusion_matrix(y_test, y_pred_ada), annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d341d8c50>

png

KNN

sns.heatmap(confusion_matrix(y_test, y_pred_knn), annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d34202fd0>

png

XGBoost

sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot = True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d342ade90>

png

Learning Curves

Adaboost

sizes = np.linspace(0.3, 1, 10)

visualizer = LearningCurve(
    ada.best_estimator_, scoring = 'accuracy', train_size = sizes, random_state = 101,  cv = 3
)

visualizer.fit(X_train, y_train)
visualizer.show()

png

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d3440df90>

KNN

sizes = np.linspace(0.3, 1, 10)

visualizer = LearningCurve(
    knn.best_estimator_, scoring = 'accuracy', train_size = sizes, random_state = 101,  cv = 3
)

visualizer.fit(X_train, y_train)
visualizer.show()

png

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d343f6790>

XGBoost

sizes = np.linspace(0.3, 1, 10)

visualizer = LearningCurve(
    xgb.best_estimator_, scoring = 'accuracy', train_size = sizes, random_state = 101,  cv = 3
)

visualizer.fit(X_train, y_train)
visualizer.show()

png

<matplotlib.axes._subplots.AxesSubplot at 0x7f0d342f7c50>

Conclusion

Our best model is XGBoost with 82-83% accuracy and if we see to his learning curve, its a good fit.

Thank you

Share on

Twitter Facebook Google+ LinkedIn

Stevanus Setiawan

Predicting Credit Card Default - Classification

Context

Attribute Information:

Importing all packages

Exploration Data Analysis

Hypothesis Testing

Normality Test

Significance Test

Chi2 Test

Splitting Dataset

Preprocessing

Modeling

Adaboost

KNN

XGBoost Classifier

Evaluation

Metrics

Confusion Matrix

Adaboost

KNN

XGBoost

Learning Curves

Adaboost

KNN

XGBoost

Conclusion

Thank you

Share on

You May Also Enjoy

Fraud Detection on Bank Payments - Classification

Bank Telemarketing Analysis - Classification

Credit Risk Analysis - Classification and Clustering

E-commerce Product Recommendations - Recommendation System