martes, 4 de abril de 2017

Modeling Credit Risk in R. Case Study



Bank customers are potential candidates who ask for credit loans from the bank with the stipulation that they make monthly payments with some interest on the amount to repay the credit amount. In a perfect world there would be credit loans dished out freely and people would pay them back without issues. Unfortunately, we are not living in a utopian world, and so there will be customers who will default on their credit loans and be unable to repay amount, cause huge losses to the bank. Therefore, credit risk analysis is one of the crucial areas which bank focus on where they analyze detailed information pertaining to customers and their credit history.

Now, we need to analyze the dataset pertaining to customers, build a predictive model using machine learning algorithms, and predict whether a customer is likely to default on paying the credit loan and could be labeled as a potential credit risk.


Banks get value only from using domain knowledge to translate prediction outcomes and raw numbers from machine learning algorithms to data driven decisions, which, when executed al the right time, help grow the business.

Logistic regression is a special case of regression models which are used for classification, where the algorithm estimates the odds that a variable is in one of the class labels as a function of the other features. Predicting the credit rating for customers of a bank, where the credit rating can either be good, which is denoted by 1 or bad, which is denoted by 0.

We will use dataset german_credit_dataset.csv and we change a Little bit the name of the columns in order to be meaningful. 

Load the dataset
credit.df <- read.csv("german_credit.csv", header = TRUE, stringsAsFactors = FALSE)

We must transform the data, we mainly perform factoring of the categorical variables, and transform the data type of the categorical features from numeric to factor. There are several numeric variables, which include credit.amount, age and credit.duration.months, which all have various values and were all skewed distributions. This has multiple adverse effects, such as induced collinearity and models taking longer times to converge. We will use z-score normalization.

##data type transformations - factoring
to.factors <- function(df, variables) {
                for(variable in variables) {
                               df[[variable]] <- as.factor(df[[variable]])
                }
                return(df)
}

## normalizing - scaling
scale.features <- function(df, variables) {
                for(variable in variables) {
                               df[[variable]] <- scale(df[[variable]], center=T, scale=T)
                }
                return(df)
}

#normalize variables
numeric.vars <- c("credit.duration.months", "age", "credit.amount")
credit.df <- scale.features(credit.df, numeric.vars)

#factor variables
categorical.vars <- c('credit.rating','account.balance','previous.credit.payment.status','credit.purpose','savings',
'employment.duration','installment.rate','marital.status','guarantor','residence.duration','current.assets','other.credits', 'apartment.type','bank.credits','occupation','dependents','telephone','foreign.worker')

credit.df <- to.factors(df=credit.df, variables = categorical.vars)

We split our data into training and test dataset in the ratio 60:40.
indexes <- sample(1:nrow(credit.df), size = 0.6*nrow(credit.df))

train.data <- credit.df[indexes, ]
test.data <- credit.df[-indexes, ]

We load the libraries

library(caret)
library(ROCR)
library(e1071)

#separate feature and class variable
test.feature.vars <- test.data[,-1]
test.class.vars <- test.data[,1]

now, we will train the initial model with all the independent variables as follows:
formula.init <- "credit.rating ~ ."
formula.init <- as.formula(formula.init)
lr.model <- glm(formula=formula.init, data=train.data, family="binomial")

view model details
summary(lr.model)

Call:
glm(formula = formula.init, family = "binomial", data = train.data)

Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-2.9964  -0.5453   0.2840   0.6255   2.0303 

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)                      -3.144809   1.382285  -2.275 0.022901 * 
account.balance2                 -0.004871   0.309540  -0.016 0.987444   
account.balance3                  0.834650   0.468293   1.782 0.074696 . 
account.balance4                  1.914813   0.338392   5.659 1.53e-08 ***
credit.duration.months           -0.218489   0.157883  -1.384 0.166400   
previous.credit.payment.status1   0.188968   0.760591   0.248 0.803787   
previous.credit.payment.status2   1.030730   0.579406   1.779 0.075249 . 
previous.credit.payment.status3   1.872516   0.643985   2.908 0.003641 **
previous.credit.payment.status4   2.733108   0.614143   4.450 8.58e-06 ***
credit.purpose1                   1.547360   0.498281   3.105 0.001900 **
credit.purpose2                   0.953451   0.383743   2.485 0.012969 * 
credit.purpose3                   1.049455   0.352351   2.978 0.002897 **
credit.purpose4                   0.435514   0.873065   0.499 0.617897   
credit.purpose5                   1.659976   0.791140   2.098 0.035887 * 
credit.purpose6                  -0.215683   0.587505  -0.367 0.713532   
credit.purpose8                  16.107223 589.297502   0.027 0.978194   
credit.purpose9                   0.862159   0.471874   1.827 0.067685 . 
credit.purpose10                  2.225815   1.163834   1.912 0.055814 . 
credit.amount                    -0.470781   0.184143  -2.557 0.010570 * 
savings2                          0.720460   0.418107   1.723 0.084862 . 
savings3                          0.251267   0.631581   0.398 0.690750   
savings4                          2.087306   0.729554   2.861 0.004222 **
savings5                          1.296324   0.366558   3.536 0.000405 ***
employment.duration2              1.212411   0.629362   1.926 0.054053 . 
employment.duration3              1.450061   0.594548   2.439 0.014731 * 
employment.duration4              2.062809   0.650394   3.172 0.001516 **
employment.duration5              1.064146   0.582952   1.825 0.067934 . 
installment.rate2                 0.405239   0.438865   0.923 0.355809   
installment.rate3                -0.104170   0.475769  -0.219 0.826688   
installment.rate4                -0.840081   0.433362  -1.939 0.052560 . 
marital.status2                   0.259297   0.548115   0.473 0.636162   
marital.status3                   0.633599   0.535577   1.183 0.236801   
marital.status4                   0.375088   0.648738   0.578 0.563142   
guarantor2                       -1.099162   0.628273  -1.749 0.080205 . 
guarantor3                        1.106218   0.654839   1.689 0.091162 . 
residence.duration2              -0.576764   0.415320  -1.389 0.164917   
residence.duration3              -0.446258   0.471767  -0.946 0.344186   
residence.duration4              -0.393400   0.433344  -0.908 0.363970   
current.assets2                   0.302149   0.370269   0.816 0.414485   
current.assets3                  -0.012336   0.328560  -0.038 0.970049   
current.assets4                  -0.620318   0.554436  -1.119 0.263214   
age                               0.303512   0.147159   2.062 0.039163 * 
other.credits2                    0.411071   0.594295   0.692 0.489128   
other.credits3                    0.344513   0.362462   0.950 0.341868   
apartment.type2                   0.650850   0.337055   1.931 0.053484 . 
apartment.type3                   1.063537   0.640660   1.660 0.096902 . 
bank.credits2                    -0.744306   0.360400  -2.065 0.038902 * 
bank.credits3                    -0.635698   0.912803  -0.696 0.486163   
bank.credits4                    -1.779649   1.495298  -1.190 0.233982   
occupation2                      -0.973953   0.981885  -0.992 0.321236   
occupation3                      -0.919371   0.947409  -0.970 0.331844   
occupation4                      -0.581313   0.960812  -0.605 0.545164   
dependents2                      -0.467817   0.335598  -1.394 0.163324   
telephone2                        0.426096   0.297590   1.432 0.152194   
foreign.worker2                   2.676072   1.093251   2.448 0.014373 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 738.05  on 599  degrees of freedom
Residual deviance: 477.09  on 545  degrees of freedom
AIC: 587.09

Number of Fisher Scoring iterations: 14

lr.predictions <- predict(lr.model, test.data, type = "response")
lr.predictions <- round(lr.predictions)
confusionMatrix(data=lr.predictions, reference=test.class.vars, positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0  56  49
         1  61 234
                                         
               Accuracy : 0.725          
                 95% CI : (0.6784, 0.7682)
    No Information Rate : 0.7075         
    P-Value [Acc > NIR] : 0.2387         
                                         
                  Kappa : 0.315          
 Mcnemar's Test P-Value : 0.2943         
                                         
            Sensitivity : 0.8269         
            Specificity : 0.4786         
         Pos Pred Value : 0.7932         
         Neg Pred Value : 0.5333         
             Prevalence : 0.7075         
         Detection Rate : 0.5850         
   Detection Prevalence : 0.7375          
      Balanced Accuracy : 0.6527         
                                         
       'Positive' Class : 1  

lr.prediction.values <- predict(lr.model, test.feature.vars, type = "response")
predictions <- prediction(lr.prediction.values, test.class.vars)

perf <- performance(predictions,"tpr","fpr")
plot(perf,col="black")


auc <- performance(predictions,"auc")
auc

AUC is 0.73 which is pretty good for start. Now we have a model, but this does not solely depend on the accuracy but on the domain and business requirements of the problem. If we predict a customer with a bad credit rating (0) as good(1), it means we are going to approve the credit loan for the customer who will end up not paying it, which will cause losses to the bank. However, if we predict a customer with good credit rating (1) as bad (0), it means we will deny him the loan in which case the bank will neither profit nor will incur any losses. This is much better than incurring huge losses by wrongly predicting bad credit ratings as good.

Now, it's up to you. As you can see p-values in the summary, try to apply the same procedure in order to get a better model with selected vars based on p-values. Try to implement another algorithm as SVM or Decision tree and compare it! You will be amazed of the results.

No hay comentarios:

Publicar un comentario