lunes, 10 de abril de 2017

Preparing to model the data: Overfitting and Underfitting



Preparing to model the data: Overfitting and Underfitting

Usually, the accuracy of the provisional model is not as high on the test set as it is on the training set, often because the provisional model is overfitting on the training set.

There is an eternal tension in model building between model complexity (resulting in high accuracy on the training set) and generalizability to the test and validation sets. Increasing the complexity of the model in order to increase the accuracy on the training set eventually and inevitably leads to a degradation in the generalizability of the provisional model to the test set.

The provisional model begins to grow in complexity from the null model (with little or no complexity), the error rates on both the training set and the test set fall. As the model complexity increases, the error rate on the training set continues to fall in a monotone manner. However, as the model complexity increases, the test set error rate soon begins to flatten out and increase because the provisional model has memorized the training set rather than leaving room for generalizing to unseen data.

The point where the minimal error rate on the test set is encountered is the optimal level of model complexity. Complexity greater than this is considered to be overfitting; complexity less than this is considered to be underfitting.

Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. At the same time the error for the test set starts to rise again as the model’s ability to generalize decreases.

Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data and it describes random error or noise instead of the underlying relationship.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model would have poor predictive performance.

Overfitting in Regression Analysis

In regression analysis, overfitting a model is a real problem. An overfit model can cause the regression coefficients, p-values, and R-squared to be misleading.

Overfitting a regression model occurs when you attempt to estimate too many parameters from a sample that is too small. Regression analysis uses one sample to estimate the values of the coefficients for all of the terms in the equation.

Larger sample sizes allow you to specify more complex models. For trustworthy results, your sample size must be large enough to support the level of complexity that is required by your research question. If your sample size isn’t large enough, you won’t be able to fit a model that adequately approximates the true model for your response variable. You won’t be able to trust the results.

You must have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression. However, if the effect size is small or there is high multicollinearity, you may need more observations per term.

Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.

For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.

A Good Fit in Machine Learning

To avoid overfitting your model in the first place, collect a sample that is large enough so you can safely include all of the predictors, interaction effects, and polynomial terms that your response variable requires. The scientific process involves plenty of research before you even begin to collect data. You should identify the important variables, the model that you are likely to specify, and use that information to estimate a good sample size.

Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.

Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.

I have not encountered underfitting very often.  Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few.  Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.

How To Limit

Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.

Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.

There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:
  1. Use a resampling technique to estimate model accuracy.
  2. Hold back a validation dataset.
The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.

A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.

Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice. 

viernes, 7 de abril de 2017

Genetic Algorithm with Java and Python


Genetic Algorithms (GAs) attempt to computationally mimic the processes by which natural selection operates, and apply them to solve business and research problems. Developed by John Holland in the 1960s abd 1970s, GAs provide a framework for studying the effects of such biologically inspired factors as mate selection, reproduction, mutation and crossover of genetic information. In the natural world, the constraints and stresses of a particular environment force the different species (and different individuals within species) to compete to produce the fittest offspring. In the world of GAs, the fittest potential solutions evolve to produce even more optimal solutions.

Not surprisingly, the field of GAs has borrowed heavily from genomic terminology. Each cell in our body contains the same set of chromosomes, string of DNA that function as a blueprint for making one of us. Then, each chromosome can be partitioned into genes, which are block of DNA designed to encode a particular trait such as en eye color. Mutation, the altering of a single gene in a chromosome of the offspring. The offspring fitness is then evaluated, either in terms of viability (living long enough to reproduce).

Now, in the field of GAs, a chromosome refers to one of the candidate solutions to the problem, a gene is a single bit or digit of the candidate solution. Most GAs function by iteratively updating a collection of potential solutions, called a population. Each member of the population is evaluated for fitness on each cycle. A new population then replaces the old population with the fittest members.

The fittest function f(x) is a real-valued function operating on the chromosome (potential solution), not the gene, so that the x in f(x) refers to the numeric value taken by the chromosome at the time of fitness evaluation.

Goal oriented problem solving

Genetic algorithms are one of the tools we can use to apply machine learning to finding good, sometimes even optimal, solutions to problems that have billions of potential solutions. They use biological processes in software to find answers to problems that have really large search spaces by continuously generating candidate solutions, evaluating how well the solutions fit the desire outcome, and refining the best solutions.

When solving a problem with a genetic algorithm, instead of asking for a specific solution, you provide characteristics that the solution must have or rules its solution must pass to be accepted. The more constraints you add the more potential solutions are blocked. Genetic algorithms are commonly used to generate high-quality solutions to optimization and search problems by relying on bio-inspired operators such as mutation, crossover and selection.

 Imagine you are given 10 chances to guess a number between 1 and 1000 and the only feedback you get is whether your guess is right or wrong. Could you reliably guess the number? With only right or wrong as feedback, you have no way to improve your guesses so you have at best a 1 in 100 chance of guessing the number. A fundamental aspect of solving problems using genetic algorithms is that they must provide feedback that helps the engine select the better of two guesses. That feedback is called the fitness, for how closely guess fits the desired result. More importantly it implies a general progression.

You might be able to do better than random guessing if you have problem-specific knowledge that helps you eliminate certain number combinations. Using problem-specific knowledge to guide the genetic algorithm's creation and modification of potential solutions can help them find a solution orders of magnitude faster.

Genetic algorithms and genetic programming are very good at finding solutions to very large problems. They do it by taking millions of samples from the search space, making small changes, possibly recombining parts of the best solutions, comparing the resultant fitness against that of the current best solution, and keeping the better of the two. This process repeats until a stop condition like one of the following occurs: the known solution is found, a solution meeting all requirements is found, a certain number of generations has passed, a specific amount of time has passed, etc.

Guess the password

Imagine you are asked to guess a password; what kind of feedback would you want? These decisions are some of the ones you have to make when planning to implement a genetic algorithm to find a solution to your problem. Genetic algorithms are good at finding good solutions to problems with large search spaces because they can quickly find  the parts of the guesses that improve fitness values or lead to better solutions.

This is an intuitive sample that you understand and can fall back upon when learning to use other machine learning tools and techniques, or applying genetic algorithms in your own field of expertise.

Genetic Algorithms use random exploration of the problem space combined with evolutionary processes like mutation and crossover (exchange of genetic information) to improve guesses. But also, because they have no experience in the problem domain, they try things a human would never think to try. Thus, a person using a genetic algorithm may learn more about the problem space and potential solutions. This gives them the ability to make improvements to the algorithm, in a virtous cycle.

The genetic algorithm should make informed guesses.

Genes
To begin with, genetic algorithm needs a gene set to use for building guesses. For this sample will be a generic set of letters. It also needs a target password to guess:

Java:
/**
      Gene Set, we set a generic set of letters and target/desired result  
*/
private static final String geneSet = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!.,";
private static final String target = "This is a test for genetics aldorithms";

Python:
geneset = " abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!.,"



Generate a guess

Next the algorithm needs a way to generate a random string from the gene set.

Java:
private static final Random random = new Random();

/**
 * We need a way to generate random string (a guest) from the geneset.
 * @param length the number of random chars to be generated from a geneSet.
 */
private static String generate_parent(int length) {       
    StringBuilder genes = new StringBuilder();

    for (int i = 0; i < length; i++) {
        genes.append(geneSet.charAt(random.nextInt(geneSet.length())));
    }

    return genes.toString();
}


Python:
import random
     
def _generate_parent(length, geneSet, get_fitness):
    genes = []
    while len(genes) < length:
        sampleSize = min(length - len(genes), len(geneSet))
        genes.extend(random.sample(geneSet, sampleSize))
    return ''.join(genes)


Fitness
The fitness value the genetic algorithm provides is the only feedback the engine gets to guide it toward a solution. In this simple the fitness value is the total number of letters in the guess that can match the letter in the same position of the password.

Java:
/**
 * We need to define a fitness
 * The fitness is the only feedback the engine get t o guide it towards a solution.
 * Fitness is the total number in the guess that match the letter in the same position of the password.
 */
private static int get_fitness(String guess){
    char[] guess_array = guess.toCharArray();
    char[] target_array = target.toCharArray();
   
    int contador = 0;
   
    for(int i = 0; i < target_array.length; i++) {
        if (guess_array[i] == target_array[i]) {
            contador++;
        }
    }
   
    return contador;       
   
}

Python:
def get_fitness(guess, target):
    return sum(1 for expected, actual in zip(target, guess)

           if expected == actual)



Mutate
Next , the engine needs a way to produce a new guess by mutating the current one. The following implementation converts the parent string to an array, then replaces 1 letter in the array with a randomly selected one from geneSet, and fnally recombines the result into a new string.

Java:
/**
 * We need a new way to produce a new guess mutating the current one. Covnerts the parent string to an array
 * and replaces 1 letter in the array with a randomly selected from the geneSet.
 *
 * @return
 */
private static String mutate(String parent){       
    char[] guess_array = parent.toCharArray();
    char[] target_array = target.toCharArray();
           
    for(int i = 0; i < target_array.length; i++) {
        if (guess_array[i] != target_array[i]) {
            char gene = geneSet.charAt(random.nextInt(geneSet.length()));
            //is the new gene the same it was at that position?
            //replace the selected gene if it is the same as the one it is supposed to replace , which can prevent a significant number of wasted guesses.
            while (gene != guess_array[i]){
                guess_array[i] = gene;
            }                
        }
    }
   
    return String.valueOf(guess_array);
   
}

Python:
def _mutate(parent, geneSet, get_fitness):
    index = random.randrange(0, len(parent.Genes))
    childGenes = list(parent.Genes)
    newGene, alternate = random.sample(geneSet, 2)
    childGenes[index] = alternate \
        if newGene == childGenes[index] \
        else newGene
    return ''.join(childGenes)



Display

Next, it is imporant to monitor what is happening so that the engine can be stopped if it gets stuck. Having a visual representation of the gene sequence, which may not be the literal gene sequence, is often critical to identifying what works and what does not so that the algorithm can be improved.
Normally the display function also outputs the fitness value and how much time has elapsed.

Java:
/**
* It is important to monitor what is happening so that the engine can be stopped if      * gets stucked. Having a visual representatiion on the gene sequence , which may not be * the literal sequence, is often critical to identifying what works and what does not so * that the algorithms can be improved.
* Display function also show output the fitness value and how much time has elapsed.
*/
private static void display(String guess, Date startTime) {
    Date time = new Date();
    System.out.println(String.format("Guess: " + guess+ ", Fitness: " + get_fitness(guess) + ", Time: " + (time.getTime() - startTime.getTime()) + " milisegundos" ));       
}


Python:
   def display(candidate, startTime):
       timeDiff = datetime.datetime.now() - startTime
       print("{0}\t{1}\t{2}".format(

          candidate.Genes, candidate.Fitness, str(timeDiff)))


Main
The main program begins by initializing bestParent to a random sequence of letters and calling the display function.

The final piece is the heart of the genetic engine. It is a loop that:
  1. ·         generate a guess
  2. ·         requests the fitness for that guess, then
  3. ·         compares the fitness so that of the previous best guess, and
  4. ·         keeps the guess with the better fitness.

Java:
public static void main(String[] args) {   
     
    GuessTheNumber guessTheNumber = new GuessTheNumber();
   
    //Generamos
    random.setSeed(12345);
    Date startTime = new Date();
    String best_parent = guessTheNumber.generate_parent(target.length());
    int best_fitness = guessTheNumber.get_fitness(best_parent);
    guessTheNumber.display(best_parent, startTime);
   
    while (true) {
        String child = guessTheNumber.mutate(best_parent);
        int child_fitness = guessTheNumber.get_fitness(child);
        if (best_fitness >= child_fitness) {
            display(child, startTime);
        }
       
        if (child_fitness >= best_parent.length()){
            display(child, startTime);
            break;
        }
       
        best_fitness = child_fitness;
        best_parent = child;
    }
   

}


This cycle repeats until a stop condition occurs, in this case when all the letters in the guess match thise in the target.

Run the code and you'll see the output simlir to the following. Success!

Guess: spytwdwrmCCvnDBxIoBMGoVJlCiUMEaKmWEA w, Fitness: 0, Time: 0 milisegundos
.
.
.
.
Guess: IhX  is a GlKt fnV txnewiesvaNdorithms, Fitness: 24, Time: 5 milisegundos
Guess: Vhi  is a cAZt fu  gTnesiXsrasdorithms, Fitness: 26, Time: 5milisegundos
Guess: PhiI is a QKGt fhj gVneyi,sfafdorithms, Fitness: 26, Time: 5milisegundos
Guess: ghi. is a fjGt fD. gOneYissVaNdorithms, Fitness: 26, Time: 5 milisegundos
Guess: LhiV is a ZWEt fgK gUnexinsHaOdorithms, Fitness: 26, Time: 5 milisegundos
.
.
.
.
Guess: Thiv is a .yst fo! gtneIicspaldorithms, Fitness: 31, Time: 10 milisegundos
Guess: ThiR is a Dyst foa geneeicsDaldorithms, Fitness: 32, Time: 10 milisegundos
Guess: Thiu is a yJst fow geneeicsTaldorithms, Fitness: 32, Time: 10 milisegundos
Guess: ThiA is a oCst foY genexicsJaldorithms, Fitness: 32, Time: 10 milisegundos
Guess: Thil is a xEst fod geneWicsYaldorithms, Fitness: 32, Time: 10 milisegundos
.
.
.
.
Guess: This is a Zest for genetics aldorithms, Fitness: 37, Time: 15 milisegundos
Guess: This is a Vest for genetics aldorithms, Fitness: 37, Time: 15 milisegundos
Guess: This is a test for genetics aldorithms, Fitness: 38, Time: 15 milisegundos

Guess: This is a test for genetics algorithms, Fitness: 38, Time: 15 milisegundos


Chromosome Object
Next, we must introduce a Chromosome object that has Genes and Fitness attributes. This will make the genetic engine more flexible by making it possible to pass those values around as a unit.

public class Chromosome {
   
    private int fitness;
    private String genes;

    public Chromosome(int fitness, String genes) {
        this.fitness = fitness;
        this.genes = genes;
    }

    public int getFitness() {
        return fitness;
    }

    public void setFitness(int fitness) {
        this.fitness = fitness;
    }

    public String getGenes() {
        return genes;
    }

    public void setGenes(String genes) {
        this.genes = genes;
    }
   
}



We have a working engine, but is currently a tightly coupled class, so the next task is up to you,    

  1. -          Extract the genetic engine code from that specific to guessing the password so it can be reused for other projects.
  2. -          Create a package called genetic and include the Chromosome class there.
  3. -          Try to use unit tests.
  4. -          Try the genetic engine with a longer password.
  5. -          Add support for benchmarking to genetic because it is useful to know how long engine takes to find a solution on average and the standard deviation.


Write me to email miguelurbin@gmail.com in order to send you the project in Python and Java to make your own tests or improvements.

martes, 4 de abril de 2017

Modeling Credit Risk in R. Case Study



Bank customers are potential candidates who ask for credit loans from the bank with the stipulation that they make monthly payments with some interest on the amount to repay the credit amount. In a perfect world there would be credit loans dished out freely and people would pay them back without issues. Unfortunately, we are not living in a utopian world, and so there will be customers who will default on their credit loans and be unable to repay amount, cause huge losses to the bank. Therefore, credit risk analysis is one of the crucial areas which bank focus on where they analyze detailed information pertaining to customers and their credit history.

Now, we need to analyze the dataset pertaining to customers, build a predictive model using machine learning algorithms, and predict whether a customer is likely to default on paying the credit loan and could be labeled as a potential credit risk.


Banks get value only from using domain knowledge to translate prediction outcomes and raw numbers from machine learning algorithms to data driven decisions, which, when executed al the right time, help grow the business.

Logistic regression is a special case of regression models which are used for classification, where the algorithm estimates the odds that a variable is in one of the class labels as a function of the other features. Predicting the credit rating for customers of a bank, where the credit rating can either be good, which is denoted by 1 or bad, which is denoted by 0.

We will use dataset german_credit_dataset.csv and we change a Little bit the name of the columns in order to be meaningful. 

Load the dataset
credit.df <- read.csv("german_credit.csv", header = TRUE, stringsAsFactors = FALSE)

We must transform the data, we mainly perform factoring of the categorical variables, and transform the data type of the categorical features from numeric to factor. There are several numeric variables, which include credit.amount, age and credit.duration.months, which all have various values and were all skewed distributions. This has multiple adverse effects, such as induced collinearity and models taking longer times to converge. We will use z-score normalization.

##data type transformations - factoring
to.factors <- function(df, variables) {
                for(variable in variables) {
                               df[[variable]] <- as.factor(df[[variable]])
                }
                return(df)
}

## normalizing - scaling
scale.features <- function(df, variables) {
                for(variable in variables) {
                               df[[variable]] <- scale(df[[variable]], center=T, scale=T)
                }
                return(df)
}

#normalize variables
numeric.vars <- c("credit.duration.months", "age", "credit.amount")
credit.df <- scale.features(credit.df, numeric.vars)

#factor variables
categorical.vars <- c('credit.rating','account.balance','previous.credit.payment.status','credit.purpose','savings',
'employment.duration','installment.rate','marital.status','guarantor','residence.duration','current.assets','other.credits', 'apartment.type','bank.credits','occupation','dependents','telephone','foreign.worker')

credit.df <- to.factors(df=credit.df, variables = categorical.vars)

We split our data into training and test dataset in the ratio 60:40.
indexes <- sample(1:nrow(credit.df), size = 0.6*nrow(credit.df))

train.data <- credit.df[indexes, ]
test.data <- credit.df[-indexes, ]

We load the libraries

library(caret)
library(ROCR)
library(e1071)

#separate feature and class variable
test.feature.vars <- test.data[,-1]
test.class.vars <- test.data[,1]

now, we will train the initial model with all the independent variables as follows:
formula.init <- "credit.rating ~ ."
formula.init <- as.formula(formula.init)
lr.model <- glm(formula=formula.init, data=train.data, family="binomial")

view model details
summary(lr.model)

Call:
glm(formula = formula.init, family = "binomial", data = train.data)

Deviance Residuals:
    Min       1Q   Median       3Q      Max 
-2.9964  -0.5453   0.2840   0.6255   2.0303 

Coefficients:
                                  Estimate Std. Error z value Pr(>|z|)   
(Intercept)                      -3.144809   1.382285  -2.275 0.022901 * 
account.balance2                 -0.004871   0.309540  -0.016 0.987444   
account.balance3                  0.834650   0.468293   1.782 0.074696 . 
account.balance4                  1.914813   0.338392   5.659 1.53e-08 ***
credit.duration.months           -0.218489   0.157883  -1.384 0.166400   
previous.credit.payment.status1   0.188968   0.760591   0.248 0.803787   
previous.credit.payment.status2   1.030730   0.579406   1.779 0.075249 . 
previous.credit.payment.status3   1.872516   0.643985   2.908 0.003641 **
previous.credit.payment.status4   2.733108   0.614143   4.450 8.58e-06 ***
credit.purpose1                   1.547360   0.498281   3.105 0.001900 **
credit.purpose2                   0.953451   0.383743   2.485 0.012969 * 
credit.purpose3                   1.049455   0.352351   2.978 0.002897 **
credit.purpose4                   0.435514   0.873065   0.499 0.617897   
credit.purpose5                   1.659976   0.791140   2.098 0.035887 * 
credit.purpose6                  -0.215683   0.587505  -0.367 0.713532   
credit.purpose8                  16.107223 589.297502   0.027 0.978194   
credit.purpose9                   0.862159   0.471874   1.827 0.067685 . 
credit.purpose10                  2.225815   1.163834   1.912 0.055814 . 
credit.amount                    -0.470781   0.184143  -2.557 0.010570 * 
savings2                          0.720460   0.418107   1.723 0.084862 . 
savings3                          0.251267   0.631581   0.398 0.690750   
savings4                          2.087306   0.729554   2.861 0.004222 **
savings5                          1.296324   0.366558   3.536 0.000405 ***
employment.duration2              1.212411   0.629362   1.926 0.054053 . 
employment.duration3              1.450061   0.594548   2.439 0.014731 * 
employment.duration4              2.062809   0.650394   3.172 0.001516 **
employment.duration5              1.064146   0.582952   1.825 0.067934 . 
installment.rate2                 0.405239   0.438865   0.923 0.355809   
installment.rate3                -0.104170   0.475769  -0.219 0.826688   
installment.rate4                -0.840081   0.433362  -1.939 0.052560 . 
marital.status2                   0.259297   0.548115   0.473 0.636162   
marital.status3                   0.633599   0.535577   1.183 0.236801   
marital.status4                   0.375088   0.648738   0.578 0.563142   
guarantor2                       -1.099162   0.628273  -1.749 0.080205 . 
guarantor3                        1.106218   0.654839   1.689 0.091162 . 
residence.duration2              -0.576764   0.415320  -1.389 0.164917   
residence.duration3              -0.446258   0.471767  -0.946 0.344186   
residence.duration4              -0.393400   0.433344  -0.908 0.363970   
current.assets2                   0.302149   0.370269   0.816 0.414485   
current.assets3                  -0.012336   0.328560  -0.038 0.970049   
current.assets4                  -0.620318   0.554436  -1.119 0.263214   
age                               0.303512   0.147159   2.062 0.039163 * 
other.credits2                    0.411071   0.594295   0.692 0.489128   
other.credits3                    0.344513   0.362462   0.950 0.341868   
apartment.type2                   0.650850   0.337055   1.931 0.053484 . 
apartment.type3                   1.063537   0.640660   1.660 0.096902 . 
bank.credits2                    -0.744306   0.360400  -2.065 0.038902 * 
bank.credits3                    -0.635698   0.912803  -0.696 0.486163   
bank.credits4                    -1.779649   1.495298  -1.190 0.233982   
occupation2                      -0.973953   0.981885  -0.992 0.321236   
occupation3                      -0.919371   0.947409  -0.970 0.331844   
occupation4                      -0.581313   0.960812  -0.605 0.545164   
dependents2                      -0.467817   0.335598  -1.394 0.163324   
telephone2                        0.426096   0.297590   1.432 0.152194   
foreign.worker2                   2.676072   1.093251   2.448 0.014373 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 738.05  on 599  degrees of freedom
Residual deviance: 477.09  on 545  degrees of freedom
AIC: 587.09

Number of Fisher Scoring iterations: 14

lr.predictions <- predict(lr.model, test.data, type = "response")
lr.predictions <- round(lr.predictions)
confusionMatrix(data=lr.predictions, reference=test.class.vars, positive = '1')

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0  56  49
         1  61 234
                                         
               Accuracy : 0.725          
                 95% CI : (0.6784, 0.7682)
    No Information Rate : 0.7075         
    P-Value [Acc > NIR] : 0.2387         
                                         
                  Kappa : 0.315          
 Mcnemar's Test P-Value : 0.2943         
                                         
            Sensitivity : 0.8269         
            Specificity : 0.4786         
         Pos Pred Value : 0.7932         
         Neg Pred Value : 0.5333         
             Prevalence : 0.7075         
         Detection Rate : 0.5850         
   Detection Prevalence : 0.7375          
      Balanced Accuracy : 0.6527         
                                         
       'Positive' Class : 1  

lr.prediction.values <- predict(lr.model, test.feature.vars, type = "response")
predictions <- prediction(lr.prediction.values, test.class.vars)

perf <- performance(predictions,"tpr","fpr")
plot(perf,col="black")


auc <- performance(predictions,"auc")
auc

AUC is 0.73 which is pretty good for start. Now we have a model, but this does not solely depend on the accuracy but on the domain and business requirements of the problem. If we predict a customer with a bad credit rating (0) as good(1), it means we are going to approve the credit loan for the customer who will end up not paying it, which will cause losses to the bank. However, if we predict a customer with good credit rating (1) as bad (0), it means we will deny him the loan in which case the bank will neither profit nor will incur any losses. This is much better than incurring huge losses by wrongly predicting bad credit ratings as good.

Now, it's up to you. As you can see p-values in the summary, try to apply the same procedure in order to get a better model with selected vars based on p-values. Try to implement another algorithm as SVM or Decision tree and compare it! You will be amazed of the results.