Preparing to model the
data: Overfitting and Underfitting
Usually, the accuracy of the
provisional model is not as high on the test set as it is on the training set,
often because the provisional model is overfitting on the training set.
There is an eternal
tension in model building between model complexity (resulting in high accuracy
on the training set) and generalizability to the test and validation sets.
Increasing the complexity of the model in order to increase the accuracy on the
training set eventually and inevitably leads to a degradation in the generalizability
of the provisional model to the test set.
The provisional model
begins to grow in complexity from the null model (with little or no
complexity), the error rates on both the training set and the test set fall. As
the model complexity increases, the error rate on the training set continues to
fall in a monotone manner. However, as the model complexity increases, the test
set error rate soon begins to flatten out and increase because the provisional
model has memorized the training set rather than leaving room for generalizing to
unseen data.
The point where the
minimal error rate on the test set is encountered is the optimal level of model
complexity. Complexity greater than this is considered to be overfitting;
complexity less than this is considered to be underfitting.
Over time,
as the algorithm learns, the error for the model on the training data goes down
and so does the error on the test dataset. If we train for too long, the
performance on the training dataset may continue to decrease because the model
is overfitting and learning the irrelevant detail and noise in the training
dataset. At the same time the error for the test set starts to rise again as
the model’s ability to generalize decreases.
Overfitting
occurs when a model is excessively complex, such as having too many parameters
relative to the number of observations. A model that has been overfit has poor predictive performance, as it overreacts
to minor fluctuations in the training data and it describes random error
or noise instead of the underlying relationship.
Underfitting occurs when a statistical model or machine
learning algorithm cannot capture the underlying trend of the data.
Underfitting would occur, for example, when fitting a linear model to non-linear
data. Such a model would have poor predictive performance.
Overfitting in
Regression Analysis
In
regression analysis, overfitting a model is a real problem. An overfit model
can cause the regression coefficients, p-values, and R-squared to be misleading.
Overfitting
a regression model occurs when you attempt to estimate too many parameters from
a sample that is too small. Regression analysis uses one sample to estimate the
values of the coefficients for all
of the terms in the equation.
Larger
sample sizes allow you to specify more complex models. For trustworthy results,
your sample size must be large enough to support the level of complexity that
is required by your research question. If your sample size isn’t large enough,
you won’t be able to fit a model that adequately approximates the true model
for your response variable. You won’t be able to trust the results.
You
must have a sufficient number of observations for each term in a regression
model. Simulation studies show that a good rule of thumb is to have 10-15
observations per term in multiple linear regression. However, if the effect
size is small or there is high multicollinearity, you may need more
observations per term.
Overfitting
is more likely with nonparametric and nonlinear models that have more
flexibility when learning a target function. As such, many nonparametric
machine learning algorithms also include parameters or techniques to limit and
constrain how much detail the model learns.
For
example, decision trees are a nonparametric machine learning algorithm that is
very flexible and is subject to overfitting training data. This problem can be
addressed by pruning a tree after it has learned in order to remove some
of the detail it has picked up.
A Good Fit in Machine
Learning
To
avoid overfitting your model in the first place, collect a sample that is large
enough so you can safely include all of the predictors, interaction effects,
and polynomial terms that your response variable requires. The scientific
process involves plenty of research before you even begin to collect data. You
should identify the important variables, the model that you are likely to
specify, and use that information to estimate a good sample size.
Underfitting
refers to a model that can neither model the training data nor generalize to
new data. An underfit machine learning model is not a suitable model and will
be obvious as it will have poor performance on the training data.
Underfitting
is often not discussed as it is easy to detect given a good performance metric.
The remedy is to move on and try alternate machine learning algorithms.
Nevertheless, it does provide a good contrast to the problem
of overfitting.
I
have not encountered underfitting very often. Data sets that are used for
predictive modelling nowadays often come with too many predictors, not too few.
Nonetheless, when building any model in machine learning for predictive
modelling, use validation or cross-validation to assess predictive accuracy –
whether you are trying to avoid overfitting or underfitting.
How To Limit
Both
overfitting and underfitting can lead to poor model performance. But by far the
most common problem in applied machine learning is overfitting.
Overfitting
is such a problem because the evaluation of machine learning algorithms on
training data is different from the evaluation we actually care the most about,
namely how well the algorithm performs on unseen data.
There
are two important techniques that you can use when evaluating machine
learning algorithms to limit overfitting:
- Use a resampling
technique to estimate model accuracy.
- Hold back a
validation dataset.
The
most popular resampling technique is k-fold cross validation. It allows you to
train and test your model k-times on different subsets of training data and
build up an estimate of the performance of a machine learning model on unseen
data.
A
validation dataset is simply a subset of your training data that you hold back
from your machine learning algorithms until the very end of your project. After
you have selected and tuned your machine learning algorithms on your training
dataset you can evaluate the learned models on the validation dataset to get a
final objective idea of how the models might perform on unseen data.
Using
cross validation is a gold standard in applied machine learning for estimating
model accuracy on unseen data. If you have the data, using a validation dataset
is also an excellent practice.
No hay comentarios:
Publicar un comentario