Data Science In Your Hands: Big Data

Bank customers are potential candidates who ask for credit loans from the bank with the stipulation that they make monthly payments with some interest on the amount to repay the credit amount. In a perfect world there would be credit loans dished out freely and people would pay them back without issues. Unfortunately, we are not living in a utopian world, and so there will be customers who will default on their credit loans and be unable to repay amount, cause huge losses to the bank. Therefore, credit risk analysis is one of the crucial areas which bank focus on where they analyze detailed information pertaining to customers and their credit history.

Now, we need to analyze the dataset pertaining to customers, build a predictive model using machine learning algorithms, and predict whether a customer is likely to default on paying the credit loan and could be labeled as a potential credit risk.

Banks get value only from using domain knowledge to translate prediction outcomes and raw numbers from machine learning algorithms to data driven decisions, which, when executed al the right time, help grow the business.

Logistic regression is a special case of regression models which are used for classification, where the algorithm estimates the odds that a variable is in one of the class labels as a function of the other features. Predicting the credit rating for customers of a bank, where the credit rating can either be good, which is denoted by 1 or bad, which is denoted by 0.

We will use dataset german_credit_dataset.csv and we change a Little bit the name of the columns in order to be meaningful.

Load the dataset

credit.df <- read.csv("german_credit.csv", header = TRUE, stringsAsFactors = FALSE)

We must transform the data, we mainly perform factoring of the categorical variables, and transform the data type of the categorical features from numeric to factor. There are several numeric variables, which include credit.amount, age and credit.duration.months, which all have various values and were all skewed distributions. This has multiple adverse effects, such as induced collinearity and models taking longer times to converge. We will use z-score normalization.

##data type transformations - factoring

to.factors <- function(df, variables) {

for(variable in variables) {

df[[variable]] <- as.factor(df[[variable]])

}

return(df)

}

## normalizing - scaling

scale.features <- function(df, variables) {

for(variable in variables) {

df[[variable]] <- scale(df[[variable]], center=T, scale=T)

}

return(df)

}

#normalize variables

numeric.vars <- c("credit.duration.months", "age", "credit.amount")

credit.df <- scale.features(credit.df, numeric.vars)

#factor variables

categorical.vars <- c('credit.rating','account.balance','previous.credit.payment.status','credit.purpose','savings',

'employment.duration','installment.rate','marital.status','guarantor','residence.duration','current.assets','other.credits', 'apartment.type','bank.credits','occupation','dependents','telephone','foreign.worker')

credit.df <- to.factors(df=credit.df, variables = categorical.vars)

We split our data into training and test dataset in the ratio 60:40.

indexes <- sample(1:nrow(credit.df), size = 0.6*nrow(credit.df))

train.data <- credit.df[indexes, ]

test.data <- credit.df[-indexes, ]

We load the libraries

library(caret)

library(ROCR)

library(e1071)

#separate feature and class variable

test.feature.vars <- test.data[,-1]

test.class.vars <- test.data[,1]

now, we will train the initial model with all the independent variables as follows:

formula.init <- "credit.rating ~ ."

formula.init <- as.formula(formula.init)

lr.model <- glm(formula=formula.init, data=train.data, family="binomial")

view model details

summary(lr.model)

Call:

glm(formula = formula.init, family = "binomial", data = train.data)

Deviance Residuals:

Min 1Q Median 3Q Max

-2.9964 -0.5453 0.2840 0.6255 2.0303

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -3.144809 1.382285 -2.275 0.022901 *

account.balance2 -0.004871 0.309540 -0.016 0.987444

account.balance3 0.834650 0.468293 1.782 0.074696 .

account.balance4 1.914813 0.338392 5.659 1.53e-08 ***

credit.duration.months -0.218489 0.157883 -1.384 0.166400

previous.credit.payment.status1 0.188968 0.760591 0.248 0.803787

previous.credit.payment.status2 1.030730 0.579406 1.779 0.075249 .

previous.credit.payment.status3 1.872516 0.643985 2.908 0.003641 **

previous.credit.payment.status4 2.733108 0.614143 4.450 8.58e-06 ***

credit.purpose1 1.547360 0.498281 3.105 0.001900 **

credit.purpose2 0.953451 0.383743 2.485 0.012969 *

credit.purpose3 1.049455 0.352351 2.978 0.002897 **

credit.purpose4 0.435514 0.873065 0.499 0.617897

credit.purpose5 1.659976 0.791140 2.098 0.035887 *

credit.purpose6 -0.215683 0.587505 -0.367 0.713532

credit.purpose8 16.107223 589.297502 0.027 0.978194

credit.purpose9 0.862159 0.471874 1.827 0.067685 .

credit.purpose10 2.225815 1.163834 1.912 0.055814 .

credit.amount -0.470781 0.184143 -2.557 0.010570 *

savings2 0.720460 0.418107 1.723 0.084862 .

savings3 0.251267 0.631581 0.398 0.690750

savings4 2.087306 0.729554 2.861 0.004222 **

savings5 1.296324 0.366558 3.536 0.000405 ***

employment.duration2 1.212411 0.629362 1.926 0.054053 .

employment.duration3 1.450061 0.594548 2.439 0.014731 *

employment.duration4 2.062809 0.650394 3.172 0.001516 **

employment.duration5 1.064146 0.582952 1.825 0.067934 .

installment.rate2 0.405239 0.438865 0.923 0.355809

installment.rate3 -0.104170 0.475769 -0.219 0.826688

installment.rate4 -0.840081 0.433362 -1.939 0.052560 .

marital.status2 0.259297 0.548115 0.473 0.636162

marital.status3 0.633599 0.535577 1.183 0.236801

marital.status4 0.375088 0.648738 0.578 0.563142

guarantor2 -1.099162 0.628273 -1.749 0.080205 .

guarantor3 1.106218 0.654839 1.689 0.091162 .

residence.duration2 -0.576764 0.415320 -1.389 0.164917

residence.duration3 -0.446258 0.471767 -0.946 0.344186

residence.duration4 -0.393400 0.433344 -0.908 0.363970

current.assets2 0.302149 0.370269 0.816 0.414485

current.assets3 -0.012336 0.328560 -0.038 0.970049

current.assets4 -0.620318 0.554436 -1.119 0.263214

age 0.303512 0.147159 2.062 0.039163 *

other.credits2 0.411071 0.594295 0.692 0.489128

other.credits3 0.344513 0.362462 0.950 0.341868

apartment.type2 0.650850 0.337055 1.931 0.053484 .

apartment.type3 1.063537 0.640660 1.660 0.096902 .

bank.credits2 -0.744306 0.360400 -2.065 0.038902 *

bank.credits3 -0.635698 0.912803 -0.696 0.486163

bank.credits4 -1.779649 1.495298 -1.190 0.233982

occupation2 -0.973953 0.981885 -0.992 0.321236

occupation3 -0.919371 0.947409 -0.970 0.331844

occupation4 -0.581313 0.960812 -0.605 0.545164

dependents2 -0.467817 0.335598 -1.394 0.163324

telephone2 0.426096 0.297590 1.432 0.152194

foreign.worker2 2.676072 1.093251 2.448 0.014373 *

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 738.05 on 599 degrees of freedom

Residual deviance: 477.09 on 545 degrees of freedom

AIC: 587.09

Number of Fisher Scoring iterations: 14

lr.predictions <- predict(lr.model, test.data, type = "response")

lr.predictions <- round(lr.predictions)

confusionMatrix(data=lr.predictions, reference=test.class.vars, positive = '1')

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 56 49

1 61 234

Accuracy : 0.725

95% CI : (0.6784, 0.7682)

No Information Rate : 0.7075

P-Value [Acc > NIR] : 0.2387

Kappa : 0.315

Mcnemar's Test P-Value : 0.2943

Sensitivity : 0.8269

Specificity : 0.4786

Pos Pred Value : 0.7932

Neg Pred Value : 0.5333

Prevalence : 0.7075

Detection Rate : 0.5850

Detection Prevalence : 0.7375

Balanced Accuracy : 0.6527

'Positive' Class : 1

lr.prediction.values <- predict(lr.model, test.feature.vars, type = "response")

predictions <- prediction(lr.prediction.values, test.class.vars)

perf <- performance(predictions,"tpr","fpr")

plot(perf,col="black")

auc <- performance(predictions,"auc")

auc

AUC is 0.73 which is pretty good for start. Now we have a model, but this does not solely depend on the accuracy but on the domain and business requirements of the problem. If we predict a customer with a bad credit rating (0) as good(1), it means we are going to approve the credit loan for the customer who will end up not paying it, which will cause losses to the bank. However, if we predict a customer with good credit rating (1) as bad (0), it means we will deny him the loan in which case the bank will neither profit nor will incur any losses. This is much better than incurring huge losses by wrongly predicting bad credit ratings as good.

Now, it's up to you. As you can see p-values in the summary, try to apply the same procedure in order to get a better model with selected vars based on p-values. Try to implement another algorithm as SVM or Decision tree and compare it! You will be amazed of the results.

	Analisis de Datos	Ciencia de Datos
Perspectiva	Looking backward.	Looking forward.
Naturaleza del Trabajo	Report and optimize.	Explore, discover, investigate and visualize.
Resultados	Reports and Dashboards.	Data Product.
Herramientas usadas	Hive, Impala, Spark SQl and HBase.	MLib and Mahout.
Técnicas usadas	ETL and exploratory analytics.	Predictive analytics and sentiment analytics.
Habilidades Necesarias	Data engineering, SQL and programming.	Statistics, Machine Learning and programming.

Data Science In Your Hands

martes, 4 de abril de 2017

Modeling Credit Risk in R. Case Study

miércoles, 4 de enero de 2017

Data Science Unleashed