Bank
customers are potential candidates who ask for credit loans from the bank with
the stipulation that they make monthly payments with some interest on the
amount to repay the credit amount. In a perfect world there would be credit
loans dished out freely and people would pay them back without issues.
Unfortunately, we are not living in a utopian world, and so there will be
customers who will default on their credit loans and be unable to repay amount,
cause huge losses to the bank. Therefore, credit risk analysis is one of the
crucial areas which bank focus on where they analyze detailed information
pertaining to customers and their credit history.
Now, we
need to analyze the dataset pertaining to customers, build a predictive model using
machine learning algorithms, and predict whether a customer is likely to
default on paying the credit loan and could be labeled as a potential credit
risk.
Banks get
value only from using domain knowledge to translate prediction outcomes and raw
numbers from machine learning algorithms to data driven decisions, which, when
executed al the right time, help grow the business.
Logistic
regression is a special case of regression models which are used for
classification, where the algorithm estimates the odds that a variable is in
one of the class labels as a function of the other features. Predicting the
credit rating for customers of a bank, where the credit rating can either be
good, which is denoted by 1 or bad, which is denoted by 0.
We
will use dataset german_credit_dataset.csv and
we change a Little bit the name of the columns in order to be meaningful.
Load the
dataset
credit.df <-
read.csv("german_credit.csv", header = TRUE, stringsAsFactors =
FALSE)
We must
transform the data, we mainly perform factoring of the categorical variables,
and transform the data type of the categorical features from numeric to factor.
There are several numeric variables, which include credit.amount, age and
credit.duration.months, which all have various values and were all skewed
distributions. This has multiple adverse effects, such as induced collinearity
and models taking longer times to converge. We will use z-score normalization.
##data type transformations - factoring
to.factors
<- function(df, variables) {
for(variable in variables) {
df[[variable]]
<- as.factor(df[[variable]])
}
return(df)
}
##
normalizing - scaling
scale.features
<- function(df, variables) {
for(variable in variables) {
df[[variable]]
<- scale(df[[variable]], center=T, scale=T)
}
return(df)
}
#normalize variables
numeric.vars
<- c("credit.duration.months", "age",
"credit.amount")
credit.df
<- scale.features(credit.df, numeric.vars)
#factor variables
categorical.vars
<-
c('credit.rating','account.balance','previous.credit.payment.status','credit.purpose','savings',
'employment.duration','installment.rate','marital.status','guarantor','residence.duration','current.assets','other.credits', 'apartment.type','bank.credits','occupation','dependents','telephone','foreign.worker')
credit.df
<- to.factors(df=credit.df, variables = categorical.vars)
We split
our data into training and test dataset in the ratio 60:40.
indexes
<- sample(1:nrow(credit.df), size = 0.6*nrow(credit.df))
train.data
<- credit.df[indexes, ]
test.data
<- credit.df[-indexes, ]
We load the
libraries
library(caret)
library(ROCR)
library(e1071)
#separate
feature and class variable
test.feature.vars
<- test.data[,-1]
test.class.vars
<- test.data[,1]
now, we
will train the initial model with all the independent variables as follows:
formula.init
<- "credit.rating ~ ."
formula.init
<- as.formula(formula.init)
lr.model
<- glm(formula=formula.init, data=train.data, family="binomial")
view model
details
summary(lr.model)
Call:
glm(formula = formula.init, family =
"binomial", data = train.data)
Deviance Residuals:
Min 1Q
Median 3Q Max
-2.9964
-0.5453 0.2840 0.6255 2.0303
Coefficients:
Estimate Std.
Error z value Pr(>|z|)
(Intercept) -3.144809 1.382285
-2.275 0.022901 *
account.balance2 -0.004871 0.309540
-0.016 0.987444
account.balance3 0.834650 0.468293
1.782 0.074696 .
account.balance4 1.914813 0.338392
5.659 1.53e-08 ***
credit.duration.months -0.218489 0.157883
-1.384 0.166400
previous.credit.payment.status1 0.188968
0.760591 0.248 0.803787
previous.credit.payment.status2 1.030730
0.579406 1.779 0.075249 .
previous.credit.payment.status3 1.872516
0.643985 2.908 0.003641 **
previous.credit.payment.status4 2.733108
0.614143 4.450 8.58e-06 ***
credit.purpose1 1.547360 0.498281
3.105 0.001900 **
credit.purpose2 0.953451 0.383743
2.485 0.012969 *
credit.purpose3 1.049455 0.352351
2.978 0.002897 **
credit.purpose4
0.435514 0.873065
0.499 0.617897
credit.purpose5 1.659976 0.791140
2.098 0.035887 *
credit.purpose6 -0.215683 0.587505
-0.367 0.713532
credit.purpose8 16.107223 589.297502 0.027 0.978194
credit.purpose9 0.862159 0.471874
1.827 0.067685 .
credit.purpose10 2.225815 1.163834
1.912 0.055814 .
credit.amount -0.470781 0.184143
-2.557 0.010570 *
savings2 0.720460 0.418107
1.723 0.084862 .
savings3 0.251267 0.631581
0.398 0.690750
savings4 2.087306 0.729554
2.861 0.004222 **
savings5 1.296324 0.366558
3.536 0.000405 ***
employment.duration2 1.212411 0.629362
1.926 0.054053 .
employment.duration3 1.450061 0.594548
2.439 0.014731 *
employment.duration4 2.062809
0.650394 3.172 0.001516 **
employment.duration5 1.064146 0.582952
1.825 0.067934 .
installment.rate2 0.405239 0.438865
0.923 0.355809
installment.rate3 -0.104170 0.475769
-0.219 0.826688
installment.rate4 -0.840081 0.433362
-1.939 0.052560 .
marital.status2 0.259297 0.548115
0.473 0.636162
marital.status3 0.633599 0.535577
1.183 0.236801
marital.status4
0.375088 0.648738
0.578 0.563142
guarantor2 -1.099162 0.628273
-1.749 0.080205 .
guarantor3 1.106218 0.654839
1.689 0.091162 .
residence.duration2 -0.576764 0.415320
-1.389 0.164917
residence.duration3 -0.446258 0.471767
-0.946 0.344186
residence.duration4 -0.393400 0.433344
-0.908 0.363970
current.assets2 0.302149 0.370269
0.816 0.414485
current.assets3 -0.012336 0.328560
-0.038 0.970049
current.assets4 -0.620318 0.554436
-1.119 0.263214
age 0.303512 0.147159
2.062 0.039163 *
other.credits2 0.411071 0.594295
0.692 0.489128
other.credits3 0.344513 0.362462
0.950 0.341868
apartment.type2 0.650850 0.337055
1.931 0.053484 .
apartment.type3 1.063537 0.640660
1.660 0.096902 .
bank.credits2 -0.744306 0.360400
-2.065 0.038902 *
bank.credits3 -0.635698 0.912803
-0.696 0.486163
bank.credits4 -1.779649 1.495298
-1.190 0.233982
occupation2 -0.973953 0.981885
-0.992 0.321236
occupation3 -0.919371 0.947409
-0.970 0.331844
occupation4 -0.581313 0.960812
-0.605 0.545164
dependents2 -0.467817
0.335598 -1.394 0.163324
telephone2 0.426096 0.297590
1.432 0.152194
foreign.worker2 2.676072 1.093251
2.448 0.014373 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be
1)
Null
deviance: 738.05 on 599 degrees of freedom
Residual deviance: 477.09 on 545
degrees of freedom
AIC: 587.09
Number of Fisher Scoring iterations: 14
lr.predictions
<- predict(lr.model, test.data, type = "response")
lr.predictions
<- round(lr.predictions)
confusionMatrix(data=lr.predictions,
reference=test.class.vars, positive = '1')
Confusion Matrix and
Statistics
Reference
Prediction 0 1
0
56 49
1
61 234
Accuracy : 0.725
95% CI : (0.6784, 0.7682)
No Information Rate : 0.7075
P-Value [Acc > NIR] : 0.2387
Kappa : 0.315
Mcnemar's Test P-Value : 0.2943
Sensitivity : 0.8269
Specificity : 0.4786
Pos Pred Value : 0.7932
Neg Pred Value : 0.5333
Prevalence : 0.7075
Detection Rate : 0.5850
Detection Prevalence : 0.7375
Balanced Accuracy : 0.6527
'Positive' Class : 1
lr.prediction.values
<- predict(lr.model, test.feature.vars, type = "response")
predictions
<- prediction(lr.prediction.values, test.class.vars)
perf <-
performance(predictions,"tpr","fpr")
plot(perf,col="black")
auc <-
performance(predictions,"auc")
auc
AUC is 0.73
which is pretty good for start. Now we have a model, but this does not solely
depend on the accuracy but on the domain and business requirements of the
problem. If we predict a customer with a bad credit rating (0) as good(1), it
means we are going to approve the credit loan for the customer who will end up
not paying it, which will cause losses to the bank. However, if we predict a
customer with good credit rating (1) as bad (0), it means we will deny him the
loan in which case the bank will neither profit nor will incur any losses. This
is much better than incurring huge losses by wrongly predicting bad credit
ratings as good.
Now, it's
up to you. As you can see p-values in the summary, try to apply the same
procedure in order to get a better model with selected vars based on p-values. Try
to implement another algorithm as SVM or Decision tree and compare it! You will
be amazed of the results.
No hay comentarios:
Publicar un comentario