Irreducible Error, Bias, and Variance
The bias-variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.
In any prediction (or classification) model, we cannot determine the true and exact relation between the predictor and the dependent variables because we simply cannot realistically determine them. Consider, for example, the price of a house depends upon a lot of factors that are measurable, i.e., the area, location, condition, etc., among many. However, there are many factors that may be either impossible to measure or may not be accurately measured, like, how somebody feels when the person about a house, the personal relationship the owner may have with the buyer, etc., among many. This information that we cannot capture is the noise, which is inherently present in data. The noise function has a mean and variance (i.e., variation of house prices we are likely to see, based on the noise present). The unfortunate part is that we cannot control this error as it has nothing to do with the model or any estimation procedure. Therefore it is also known as irreducible error, because we cannot do anything to reduce this error and is represented by the bar in the left-hand plot of the Figure 1 below.
The quantity that measures how closely the model’s average prediction over all possible training sets, matches the target is the Squared Bias. Intuitively, it is how well the estimated model fits the true relationship between the predictor and the target. To understand this a little better, let us ask the question — if we consider different data sets of size m, what do I expect my fits to be? The answer is that there would exist a continuum of different possible fits to different data sets. And for these different fits, we will have a mean fit, averaging over all the model fits based on how likely they were to have appeared. Therefore the squared bias is the difference between the average fit and the true function.
In figure 1 (middle one), we consider a continuum of very simple fits called train functions, for different data sets and we average out these train functions to get a mean train function. The true function is represented by the blue regression line and the bias is represented as the shaded region, which is the difference between the true function and the average fit.
The quantity that measures how much the model’s prediction fluctuates with different training sets of the given size is the Variance. It is the tendency to learn random things irrespective of the real signal.
To understand, let us refer to the right-hand plot in figure 1, where we consider a continuum of simple fits from different data sets. Since we have considered a very simple function to fit different data sets, the variance of the average of the training functions are (more or less) equal. So in essence, if the specific fits had varied widely we would have erratic predictions, i.e., the model would be very sensitive to the data set under consideration; and that will be the source of error in the predictions.
Now, let us fit two complex polynomial models trained with two different data sets which are sampled from the same population. In Fig 2 the dark-orange line is the true function. The blue and green lines are different fits of two, degree-25 polynomial models. It is apparent that the models have high variance, leading us to define Variance as the difference of a specific fit from the expected average fit from all data sets.
If we consider a continuum of highly complex models with specific data sets, the average of all these models will not be a close representation of any one of the fits considered. The variation within each of the fits would be very large, which implies that High Complexity models have High Variance.
As discussed above:
• As model complexity increases, bias decreases
• As model complexity increases variance increases
We can also now say that-
• Bias Error — is the difference between the average prediction of our model and the correct value that we are trying to predict.
- Variance Error — is how much the predictions for a given point vary between different realizations of the model.
The bias-variance trade-off can be represented by the mean squared error as
All machine learning algorithms are intrinsically intertwined with bias and variance (along with the noise), and the goal is to find the sweet spot. The sweet spot (the dark-orange circle in Fig. 3.1) represents the appropriate level of complexity at which an increase in bias results in the reduction of variance. Any model complexity short of the sweet spot will result in an underfitted model and if we are overstepping the sweet spot, we are overfitting the model.
This implies that we need to find the bias and variance of our model to find the sweet spot. The challenge is that we cannot compute bias and variance as both are defined with respect to the true function and getting the fits from all the possible data sets which, we do not know. The good part is that even if we cannot compute them exactly, we can still optimize the trade-off between bias and variance.
In Fig. 3, a polynomial regression model (with degree 1 to 30) is trained with 100 different data sets, each having a sample size of 50. Each of these trained models are fitted to predict the unseen (test) data having a sample size of 1000. The average bias and variance respectively are calculated from the fitted models and plotted. As expected, the bias decreases monotonically and the variance increases. The test-MSE, initially decreases, but as the model becomes more complex (signifying overfitting, discussed in the next section), the error starts increasing. An overfitted model fits the noise present in the data and not the true relationships between the variables. It turns out that the polynomial model with degree 5 complexity has the lowest test-MSE and is identified by the ‘sweet-spot’, denoted by the orange circle.
REFERENCE:
G. Abhijit (2017). Machine Learning with R. Library of Congress Control Number: 2017954482.