Medium Last updated on May 7, 2022, 1:14 a.m.
The main challenge in machine learning is that our models must perform well on test sets(unseen data)- not just on training data, i.e., the model should have low prediction error. Prediction errors can be decomposed into three subcomponents:
$$ Err(x)= Bias^2 + Variance + Irreducible Error $$
By understanding the bias and variance of a model, we can better analyze model results and avoid the mistake of overfitting or underfitting. To tune Bias and Variance, we must understand three important terms:
1.Capacity: Capacity of a model refers to a model’s ability to represent complex boundaries. A model is said to have low capacity if it can only represent simple boundaries such as Linear Classification.
2. Bias: Bias refers to the error introduced by using an oversimplified model for complex data, and it can also lead to under-fitting. For example, Linear regression assumes a linear relation between Y and X variables, but if we use a linear model to approximate a complex non-linear dataset. The trained linear model will result in high prediction error.
In simpler terms, we can say that if we have a strong assumption about data distribution, that represents the model has a high bias, but if we have a weak assumption/ no assumption, it will mean we have a low bias. For Example: Consider a Linear Classifier, we assume that the decision boundary has to be linear, so it has a high bias, but with the help of basis expansion, we can reduce models’ bias and avoid underfitting.
3.Variance: Variance refers to the variability of a model prediction for a given sampled data. In simpler words, If we train our model using training data, it should provide good results on the test set. The function estimate obtained from training data should not vary too much in different data settings. However, if a model has high variance, small changes in the training data can result in significant changes in the estimated function.
Variance can also be understood in terms of the decision boundary of a classification model. A model is said to have a high variance when the decision boundary completely changes if we add/reduce data and a low variance when it almost remains constant.
For a good model, we need “Low Generalization Error,” i.e., Low Bias and Low Variance, which is difficult to accomplish, so in practice, we trade-off either bias for variance or variance for bias. We can understand bias vs variance with the help of a popular bulls-eye diagram. Imagine that the center (red, bull’s eye) of the target (concentric circles) is a model that perfectly predicts the correct values. The predictions worsen as we move away from the center (red). We repeat the entire model-building process to get several separate outcomes(blue dots) on the target (concentric circles). Each outcome(blue dot) represents an individual model trained on sampled training data. Sometimes sampled training data distribution represents the real-world distribution, so we predict close to the center(red) very well. In contrast, sometimes, our training data might be skewed, resulting in poorer predictions. These different realizations result in scattered outcomes(blue dots) on the target(concentric circles).
We can plot four different cases representing both high and low bias and variance combinations.
To minimize the expected test error or prediction error, we need to select a model that simultaneously achieves low bias and low variance, as our expected error depends on both bias and variance of the model. In practice, we generally see two main issues, either High Bias or High Variance.
Symptoms | Solutions | |
---|---|---|
High Bias | Training error is much lower than test error | Add More Training Data, Ensemble (Bagging), Reduce Model Capacity |
High Variance | High Training Error | Increase model complexity (Basis Expansion), Ensemble (Boosting), Enhance training data |
Many times enriching data or adding more data is not in our hands. In such scenarios, all we can do is tune model complexity. To understand it in more detail, take a look at the figure below:
Typically, as we increase the complexity of the model, we observe a reduction in error due to lower bias in the model. However, this only happens until a particular point. If we continue to make the model more complex, i.e., reduce bias further, we might start overfitting the model and entering the high variance region. Therefore, we must tune the model complexity(capacity) carefully to find optimal bias and variance point.
Lets’ review some commonly used classification models in terms of bias vs variance vs model capacity.
K-nearest neighbor:
a. Low bias and high variance.
b. Tune the value of k(number of neighbors contributing to classification), if we increase the value of k i.e., Include more neighbors as part of prediction. This results in less flexibility (utilizing the features of not only the closest data points but also the ones farther apart) and therefore higher bias but lower variance.
Support Vector Machine:
a. Low bias and high variance,
b. Increasing the C parameter will impact the number of violations of the margin allowed in the training data which increases the bias but decreases the variance.
Decision trees:
a. Low bias and high variance.
b. tune the depth of the tree to avoid high variance.
Linear Regression:
a. High bias and low variance.
b. Can use Basis Expansion to reduce bias.