Preparing to model the data: Overfitting and Underfitting
Usually, the accuracy of the provisional model is not as high on the test set as it is on the training set, often because the provisional model is overfitting on the training set.
There is an eternal tension in model building between model complexity (resulting in high accuracy on the training set) and generalizability to the test and validation sets. Increasing the complexity of the model in order to increase the accuracy on the training set eventually and inevitably leads to a degradation in the generalizability of the provisional model to the test set.
The provisional model begins to grow in complexity from the null model (with little or no complexity), the error rates on both the training set and the test set fall. As the model complexity increases, the error rate on the training set continues to fall in a monotone manner. However, as the model complexity increases, the test set error rate soon begins to flatten out and increase because the provisional model has memorized the training set rather than leaving room for generalizing to unseen data.
The point where the minimal error rate on the test set is encountered is the optimal level of model complexity. Complexity greater than this is considered to be overfitting; complexity less than this is considered to be underfitting.
Over time, as the algorithm learns, the error for the model on the training data goes down and so does the error on the test dataset. If we train for too long, the performance on the training dataset may continue to decrease because the model is overfitting and learning the irrelevant detail and noise in the training dataset. At the same time the error for the test set starts to rise again as the model’s ability to generalize decreases.
Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts to minor fluctuations in the training data and it describes random error or noise instead of the underlying relationship.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model would have poor predictive performance.
Overfitting in Regression Analysis
In regression analysis, overfitting a model is a real problem. An overfit model can cause the regression coefficients, p-values, and R-squared to be misleading.
Overfitting a regression model occurs when you attempt to estimate too many parameters from a sample that is too small. Regression analysis uses one sample to estimate the values of the coefficients for all of the terms in the equation.
Larger sample sizes allow you to specify more complex models. For trustworthy results, your sample size must be large enough to support the level of complexity that is required by your research question. If your sample size isn’t large enough, you won’t be able to fit a model that adequately approximates the true model for your response variable. You won’t be able to trust the results.
You must have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression. However, if the effect size is small or there is high multicollinearity, you may need more observations per term.
Overfitting is more likely with nonparametric and nonlinear models that have more flexibility when learning a target function. As such, many nonparametric machine learning algorithms also include parameters or techniques to limit and constrain how much detail the model learns.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible and is subject to overfitting training data. This problem can be addressed by pruning a tree after it has learned in order to remove some of the detail it has picked up.
A Good Fit in Machine Learning
To avoid overfitting your model in the first place, collect a sample that is large enough so you can safely include all of the predictors, interaction effects, and polynomial terms that your response variable requires. The scientific process involves plenty of research before you even begin to collect data. You should identify the important variables, the model that you are likely to specify, and use that information to estimate a good sample size.
Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric. The remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a good contrast to the problem of overfitting.
I have not encountered underfitting very often. Data sets that are used for predictive modelling nowadays often come with too many predictors, not too few. Nonetheless, when building any model in machine learning for predictive modelling, use validation or cross-validation to assess predictive accuracy – whether you are trying to avoid overfitting or underfitting.
How To Limit
Both overfitting and underfitting can lead to poor model performance. But by far the most common problem in applied machine learning is overfitting.
Overfitting is such a problem because the evaluation of machine learning algorithms on training data is different from the evaluation we actually care the most about, namely how well the algorithm performs on unseen data.
There are two important techniques that you can use when evaluating machine learning algorithms to limit overfitting:
- Use a resampling technique to estimate model accuracy.
- Hold back a validation dataset.
The most popular resampling technique is k-fold cross validation. It allows you to train and test your model k-times on different subsets of training data and build up an estimate of the performance of a machine learning model on unseen data.
A validation dataset is simply a subset of your training data that you hold back from your machine learning algorithms until the very end of your project. After you have selected and tuned your machine learning algorithms on your training dataset you can evaluate the learned models on the validation dataset to get a final objective idea of how the models might perform on unseen data.
Using cross validation is a gold standard in applied machine learning for estimating model accuracy on unseen data. If you have the data, using a validation dataset is also an excellent practice.