Medium Last updated on Aug. 29, 2022, 9:50 p.m.
Bagging and boosting are two of the most important ideas in machine learning for making solutions that work well. They are used to solve a wide range of problems in many different fields.
By the time you finish reading this, you’ll be able to:
An ensemble is a collection of models trained to accomplish the same goal. The models that make up an ensemble may all be variations of the same type, or they may all be entirely distinct from one another. Typically, the final output of an ensemble of classifiers is derived from a (weighted) average or vote of the predictions of the many models comprising the ensemble. Oftentimes, an ensemble of models with similar generalization performance outperforms each individual model.
But, How is this possible? Lets’ go over intuition of an Ensemble to understand it further.
Suppose we have an ensemble of binary classification functions $ f_{k} (x) $ for $k = 1, …, K. $. Also, assume that on average they have the similar expected error rate $ε = Ep(x,y)[y \neq f_{k} (x)] < 0.5 $; However, the errors they make are independent of each other.
The intuition is that the majority of the K classifiers in the ensemble will be accurate for the majority of instances in which a single classifier makes an error. In this case, a simple majority vote can considerably enhance classification performance by reducing variation.
There are three main types of ensemble learning methods:
1. Bagging (for e.g; Random Forest )
2. Boosting (for e.g; Adaboost, XGBoost), and
3. Stacking.
Bootstrap aggregation or Bagging is an approximation to the previous method that takes a single training set Tr and randomly sub-samples from it $K$ times (with replacement) to form K training sets $T_{r_1},…,T_{r_K}$. Each of these training sets is used to train a different instance of the same classifier obtaining K classification functions $f_{1}(x),…,f_{K} (x)$. The obtained errors won’t by completely independent as the data sets aren’t independent; However the random re-sampling generally introduces enough diversity to decrease the variance and give improved performance.
High-variability, high-capacity models benefit the most from bagging. Historically, decision tree models are most closely associated with it. The random forest classifier is an extension of bagged trees that has been quite successful. The random forests algorithm further decorrelates the learned trees by only considering a random sub-set of the available features when deciding which variable to split on at each node in the tree.
Boosting is an ensemble method that reweights the data set iteratively rather than randomly. The main idea is to weight the importance of data cases that are misclassified by the ensemble’s current classifiers, and then add a new classifier that will focus on the data cases that are causing the errors. Assuming that the base classifier can always achieve an error of less than 0.5 on any data sample, it can be demonstrated that the boosting ensemble reduces error.
Boosting begins with the construction of a model from the base training data. Following that, a second model is constructed based on the results of the first, with the goal of correcting the errors discovered in the first model. Because this is done in series, many models are added until the entire dataset is predicted correctly or until all models are added. To accomplish this, rather than feeding the entire dataset to the next classifier, all incorrectly classified points are fed as input to the next classifier, allowing the next set of classifiers to work on correctly classifying the data.
AdaBoost is a widely used adaptive boosting algorithm in binary classification. AdaBoost is an abbreviation for Adaptive Boosting.
Algorithmically, the process of boosting is denoted as follows:
Step 1: Process the dataset by assigning a common weight to all of the data points.
Step 2:The parameters now serve as input and are fed into the model
Step 3:The model is trained to identify and assess the data points that are classified incorrectly.
Step 4:Weights of the incorrect classified points are increased gradually.
Step 5:If (accuracy obtained is sufficient)
Go to step 6
Else
Go to step 2
Step 6: End
As previously stated, the incorrect predictions are used as input to the next classifier with the weighted dataset rather than the entire dataset. This is the fundamental principle that allows us to achieve high efficiency when using boosting.
Stacking, unlike bagging and boosting, is an algorithm for combining multiple types of models. The main idea is to divide the training data into a train-validation-test split and train many classifiers on it. The trained classifiers are then used to predict on the validation data set, and a new feature representation is created, with each data case consisting of the vector of predictions from each classifier in the ensemble. Finally, a combiner meta-classifier is trained to minimize the validation error given the data. The additional layer of combiner learning can deal with correlated classifiers as well as underperforming classifiers.
References