Mostrando entradas con la etiqueta Big Data. Mostrar todas las entradas
Mostrando entradas con la etiqueta Big Data. Mostrar todas las entradas

martes, 4 de abril de 2017

Modeling Credit Risk in R. Case Study



Bank customers are potential candidates who ask for credit loans from the bank with the stipulation that they make monthly payments with some interest on the amount to repay the credit amount. In a perfect world there would be credit loans dished out freely and people would pay them back without issues. Unfortunately, we are not living in a utopian world, and so there will be customers who will default on their credit loans and be unable to repay amount, cause huge losses to the bank. Therefore, credit risk analysis is one of the crucial areas which bank focus on where they analyze detailed information pertaining to customers and their credit history.

Now, we need to analyze the dataset pertaining to customers, build a predictive model using machine learning algorithms, and predict whether a customer is likely to default on paying the credit loan and could be labeled as a potential credit risk.


Banks get value only from using domain knowledge to translate prediction outcomes and raw numbers from machine learning algorithms to data driven decisions, which, when executed al the right time, help grow the business.

Logistic regression is a special case of regression models which are used for classification, where the algorithm estimates the odds that a variable is in one of the class labels as a function of the other features. Predicting the credit rating for customers of a bank, where the credit rating can either be good, which is denoted by 1 or bad, which is denoted by 0.

We will use dataset german_credit_dataset.csv and we change a Little bit the name of the columns in order to be meaningful. 

Load the dataset
credit.df <- read.csv("german_credit.csv", header = TRUE, stringsAsFactors = FALSE)

We must transform the data, we mainly perform factoring of the categorical variables, and transform the data type of the categorical features from numeric to factor. There are several numeric variables, which include credit.amount, age and credit.duration.months, which all have various values and were all skewed distributions. This has multiple adverse effects, such as induced collinearity and models taking longer times to converge. We will use z-score normalization.

##data type transformations - factoring
to.factors <- function(df, variables) {
                for(variable in variables) {
                               df[[variable]] <- as.factor(df[[variable]])
                }
                return(df)
}

## normalizing - scaling
scale.features <- function(df, variables) {
                for(variable in variables) {
                               df[[variable]] <- scale(df[[variable]], center=T, scale=T)
                }
                return(df)
}

#normalize variables
numeric.vars <- c("credit.duration.months", "age", "credit.amount")
credit.df <- scale.features(credit.df, numeric.vars)

#factor variables
categorical.vars <- c('credit.rating','account.balance','previous.credit.payment.status','credit.purpose','savings',
'employment.duration','installment.rate','marital.status','guarantor','residence.duration','current.assets','other.credits', 'apartment.type','bank.credits','occupation','dependents','telephone','foreign.worker')

credit.df <- to.factors(df=credit.df, variables = categorical.vars)

We split our data into training and test dataset in the ratio 60:40.
indexes <- sample(1:nrow(credit.df), size = 0.6*nrow(credit.df))

train.data <- credit.df[indexes, ]
test.data <- credit.df[-indexes, ]

We load the libraries

library(caret)
library(ROCR)
library(e1071)

#separate feature and class variable
test.feature.vars <- test.data[,-1]
test.class.vars <- test.data[,1]

now, we will train the initial model with all the independent variables as follows:
formula.init <- "credit.rating ~ ."
formula.init <- as.formula(formula.init)
lr.model <- glm(formula=formula.init, data=train.data, family="binomial")

view model details
summary(lr.model)

Call:
glm(formula = formula.init, family = "binomial", data = train.data)

Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-2.9964  -0.5453   0.2840   0.6255   2.0303 

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)                      -3.144809   1.382285  -2.275 0.022901 * 
account.balance2                 -0.004871   0.309540  -0.016 0.987444   
account.balance3                  0.834650   0.468293   1.782 0.074696 . 
account.balance4                  1.914813   0.338392   5.659 1.53e-08 ***
credit.duration.months           -0.218489   0.157883  -1.384 0.166400   
previous.credit.payment.status1   0.188968   0.760591   0.248 0.803787   
previous.credit.payment.status2   1.030730   0.579406   1.779 0.075249 . 
previous.credit.payment.status3   1.872516   0.643985   2.908 0.003641 **
previous.credit.payment.status4   2.733108   0.614143   4.450 8.58e-06 ***
credit.purpose1                   1.547360   0.498281   3.105 0.001900 **
credit.purpose2                   0.953451   0.383743   2.485 0.012969 * 
credit.purpose3                   1.049455   0.352351   2.978 0.002897 **
credit.purpose4                   0.435514   0.873065   0.499 0.617897   
credit.purpose5                   1.659976   0.791140   2.098 0.035887 * 
credit.purpose6                  -0.215683   0.587505  -0.367 0.713532   
credit.purpose8                  16.107223 589.297502   0.027 0.978194   
credit.purpose9                   0.862159   0.471874   1.827 0.067685 . 
credit.purpose10                  2.225815   1.163834   1.912 0.055814 . 
credit.amount                    -0.470781   0.184143  -2.557 0.010570 * 
savings2                          0.720460   0.418107   1.723 0.084862 . 
savings3                          0.251267   0.631581   0.398 0.690750   
savings4                          2.087306   0.729554   2.861 0.004222 **
savings5                          1.296324   0.366558   3.536 0.000405 ***
employment.duration2              1.212411   0.629362   1.926 0.054053 . 
employment.duration3              1.450061   0.594548   2.439 0.014731 * 
employment.duration4              2.062809   0.650394   3.172 0.001516 **
employment.duration5              1.064146   0.582952   1.825 0.067934 . 
installment.rate2                 0.405239   0.438865   0.923 0.355809   
installment.rate3                -0.104170   0.475769  -0.219 0.826688   
installment.rate4                -0.840081   0.433362  -1.939 0.052560 . 
marital.status2                   0.259297   0.548115   0.473 0.636162   
marital.status3                   0.633599   0.535577   1.183 0.236801   
marital.status4                   0.375088   0.648738   0.578 0.563142   
guarantor2                       -1.099162   0.628273  -1.749 0.080205 . 
guarantor3                        1.106218   0.654839   1.689 0.091162 . 
residence.duration2              -0.576764   0.415320  -1.389 0.164917   
residence.duration3              -0.446258   0.471767  -0.946 0.344186   
residence.duration4              -0.393400   0.433344  -0.908 0.363970   
current.assets2                   0.302149   0.370269   0.816 0.414485   
current.assets3                  -0.012336   0.328560  -0.038 0.970049   
current.assets4                  -0.620318   0.554436  -1.119 0.263214   
age                               0.303512   0.147159   2.062 0.039163 * 
other.credits2                    0.411071   0.594295   0.692 0.489128   
other.credits3                    0.344513   0.362462   0.950 0.341868   
apartment.type2                   0.650850   0.337055   1.931 0.053484 . 
apartment.type3                   1.063537   0.640660   1.660 0.096902 . 
bank.credits2                    -0.744306   0.360400  -2.065 0.038902 * 
bank.credits3                    -0.635698   0.912803  -0.696 0.486163   
bank.credits4                    -1.779649   1.495298  -1.190 0.233982   
occupation2                      -0.973953   0.981885  -0.992 0.321236   
occupation3                      -0.919371   0.947409  -0.970 0.331844   
occupation4                      -0.581313   0.960812  -0.605 0.545164   
dependents2                      -0.467817   0.335598  -1.394 0.163324   
telephone2                        0.426096   0.297590   1.432 0.152194   
foreign.worker2                   2.676072   1.093251   2.448 0.014373 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 738.05  on 599  degrees of freedom
Residual deviance: 477.09  on 545  degrees of freedom
AIC: 587.09

Number of Fisher Scoring iterations: 14

lr.predictions <- predict(lr.model, test.data, type = "response")
lr.predictions <- round(lr.predictions)
confusionMatrix(data=lr.predictions, reference=test.class.vars, positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0  56  49
         1  61 234
                                         
               Accuracy : 0.725          
                 95% CI : (0.6784, 0.7682)
    No Information Rate : 0.7075         
    P-Value [Acc > NIR] : 0.2387         
                                         
                  Kappa : 0.315          
 Mcnemar's Test P-Value : 0.2943         
                                         
            Sensitivity : 0.8269         
            Specificity : 0.4786         
         Pos Pred Value : 0.7932         
         Neg Pred Value : 0.5333         
             Prevalence : 0.7075         
         Detection Rate : 0.5850         
   Detection Prevalence : 0.7375          
      Balanced Accuracy : 0.6527         
                                         
       'Positive' Class : 1  

lr.prediction.values <- predict(lr.model, test.feature.vars, type = "response")
predictions <- prediction(lr.prediction.values, test.class.vars)

perf <- performance(predictions,"tpr","fpr")
plot(perf,col="black")


auc <- performance(predictions,"auc")
auc

AUC is 0.73 which is pretty good for start. Now we have a model, but this does not solely depend on the accuracy but on the domain and business requirements of the problem. If we predict a customer with a bad credit rating (0) as good(1), it means we are going to approve the credit loan for the customer who will end up not paying it, which will cause losses to the bank. However, if we predict a customer with good credit rating (1) as bad (0), it means we will deny him the loan in which case the bank will neither profit nor will incur any losses. This is much better than incurring huge losses by wrongly predicting bad credit ratings as good.

Now, it's up to you. As you can see p-values in the summary, try to apply the same procedure in order to get a better model with selected vars based on p-values. Try to implement another algorithm as SVM or Decision tree and compare it! You will be amazed of the results.

miércoles, 4 de enero de 2017

Data Science Unleashed

In today’s world, data science has immensely grown across a multitude of industries including finance, energy, travel, and government, but even more importantly, universities have begun to recognize the importance of offering courses and programs in this field.
Data Science and Analytics will continue to be one of the cornerstones for innovation as the businesses explore its revolutionary potential to transform business processes, generate new business models, boost operations’ efficiency and catalyze innovation.

Data Science is a multidisciplinary field that involves processes and systems to extract knowledge, focused on the future by performing exploratory analysis to provide recommendations based on models identified by past and present data, representing high value for the business.

While data science asks:
What will happen next? And What should be done to prevent...?

The data analysis asks: what happened? And Why did it happen?



The following table explains the differences with respect to processes, tools, techniques, skills and outputs:


Analisis de Datos
Ciencia de Datos
Perspectiva
Looking backward.
Looking forward.

Naturaleza del Trabajo
Report and optimize.
Explore, discover, investigate and visualize.

Resultados
Reports and Dashboards.
Data Product.

Herramientas usadas
Hive, Impala, Spark SQl and HBase.

MLib and Mahout.
Técnicas usadas
ETL and exploratory analytics.

Predictive analytics and sentiment analytics.
Habilidades Necesarias
Data engineering, SQL and programming.

Statistics, Machine Learning and programming.