Bagging and Python

Aqib Gul
4 min readOct 20, 2022

Bootstrap Aggregation

Bagging also known as Bootstrap Aggregation is a machine learning ensemble algorithm designed to overcome the problem of overfitting caused by Decision Tree algorithm and many other individual algorithms. In bagging, various base learners or individual algorithms are trained in combinaton and the data input provided to them is generated by random sampling with replacement. Although the base learners are usually Decision Trees but any other methods or algorithms can be used.

The method is used in classification as well as regression tasks. The name bootstrap aggregation is because of the bootstrapping of the input data and finally the aggreagation into a single output predicion either by major voting in case of classification or averaging in case of regression.

Source: Wikipedia

As stated above. we can use ensemble methods to combine different models in two ways: either using a single base learning algorithm that remains the same across all models (a homogeneous ensemble model), or using multiple base learning algorithms that differ for each model (a heterogeneous ensemble model).

It all started in the year 1994, when Leo Breiman proposed this algorithm, then known as “Bagging Predictors”. When the learners are unstable and have a tendency to overfit, i.e., when minor changes in the training data result in significant changes in the projected output, bagging is very effective. By combining the individual learners with varied statistical features, such as different means, standard deviations, etc., it efficiently lowers the variance. For high variance models like decision trees, it performs effectively. It doesn’t really have an impact on learning when employed with low variance models like linear regression. The properties of the dataset determine how many base learners (trees) should be used. Overusing trees can be computationally intensive but does not result in overfitting.

Bagging can be represented mathematically as:

Right hand side represents the base learners and the left hand side is the averaged strong learner or bagged/averaged prediction.

Bagging Classifier in Python

First, let us fit a decision tree algorithm and then move towards the bagging classifier.

  1. Importing the data
from sklearn import datasets
data = datasets.load_wine(as_frame=True)

2. Divide the dataset into predictors ‘x’ and the target variable ‘y’ as

x = data.datay = data.target

3. Splitting the data into a test and training set

from sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state = 42)

4. Fitting of Decision Tree Algorithm

from sklearn.tree import DecisionTreeClassifierclassifier = DecisionTreeClassifier(random_state = 0)classifier.fit(x_train,y_train)

5. Predicting the test data using fitted algorithm

y_pred = classifier.predict(x_test)

6. Checking the accuracy of the results

from sklearn.metrics import confusion_matrix,accuracy_scoreconfusion_matrix(y_test,y_pred)print('Accuracy',accuracy_score(y_test,y_pred))  #Accuracy of the model for test dataprint('Accuracy',accuracy_score(y_train,y_pred = classifier.predict(x_train)))  #Accuracy for training dataset

The accuracy for test data and training data are

Accuracy 0.9444444444444444   #test data accuracyAccuracy 1.0 #training data accuracy

Bagging Classifier

For bagging we need to set the parameter n_estimators, this is the number of base classifiers that our model is going to aggregate together.

  1. Importing the model
from sklearn.ensemble import BaggingClassifier

2. Create a range of values that will represent the number of estimators to be used in the bagging classifier

estimator_range = [2,4,6,8,10,12,14,16,18,20]

3. To check the performance of the classifier at different estimators, we need to create a loop and iterate over a range of values and store the result from each ensemble for further visualization.

models = []scores = []for n_estimators in estimator_range:classifier = BaggingClassifier(n_estimators = n_estimators, random_state=22)classifier.fit(x_train,y_train)models.append(classifier)scores.append(accuracy_score(y_test, y_pred = classifier.predict(x_test)))

4. Get the accuracy and visualize the results

import matplotlib.pyplot as pltplt.figure(figsize = (12,10))plt.plot(estimator_range,scores)

Vertical axis represents accuracy and horizontal axis, the estimator range.

accuracy_score(y_test, y_pred = classifier.predict(x_test))

By iterating through different values for the number of estimators we can see an increase in model performance from 90% to 97%. After 3 estimators the accuracy remains constant, again if you set a different random_state the values you see will vary. That is why it is best practice to use crossvalidation to ensure stable results.

We see the accuracy of the Decision Tree algorithm was 94% and that increased to 97.2% after using a bagging classifier instead.

I hope you will find this blog helpfull. Thanks

--

--

Aqib Gul

A scholar currently pursuing Ph. D. in Statistics. Intermediate level in Data Science and Machine Learning.