Model ensembling

Reading time10 min

In brief

Article summary

Model ensembling is a machine learning technique that combines multiple models to improve predictive performance and robustness. By aggregating the strengths of various models, it reduces errors and enhances generalization to unseen data.

Main takeaways

Ensembling combines multiple models to improve accuracy and reduce errors.
Common techniques include bagging, boosting, stacking, and voting.
Diversity among models is key to successful ensembling.
Ensembling may require more computational resources but offers significant performance gains.

Article contents

Model ensembling leverages the idea that no single model can capture all aspects of a dataset. By combining models with complementary strengths, we can achieve better results.

1 — Introduction to model ensembling

In machine learning, no single model can consistently deliver the best results for all datasets and tasks. This limitation arises because different models excel in capturing distinct patterns and characteristics of data. Model ensembling is a technique that addresses this by combining the predictions of multiple models to achieve better accuracy, robustness, and generalization than any individual model alone.

The core idea of ensembling is rooted in the wisdom of crowds: aggregating diverse perspectives (or in this case, model outputs) reduces errors and leads to more reliable predictions. This concept is especially valuable in real-world applications where datasets are noisy, complex, or imbalanced.

Ensembling is not a single method but a collection of approaches, ranging from simple aggregation techniques like voting to more sophisticated methods like stacking and boosting. It has been instrumental in winning machine learning competitions, such as those on Kaggle, and is widely used in industries ranging from finance to healthcare.

In this article, we will explore the key ensembling methods, their benefits, and their practical applications. We’ll also delve into advanced techniques to help you push the boundaries of model performance. Whether you’re working on a classification, regression, or clustering task, ensembling provides a versatile framework for building robust models.

2 — Common ensembling techniques

Here’s an in-depth look at the most common ensembling techniques:

2.1 — Bagging

Bagging, short for Bootstrap Aggregating, is an ensembling technique designed to reduce variance and improve model stability. It works by training multiple instances of the same model on different subsets of the training data, generated through bootstrapping—random sampling with replacement. Each model in the ensemble learns slightly different patterns from the data due to these varied training subsets. Once trained, their predictions are aggregated, either by averaging (for regression tasks) or majority voting (for classification tasks). The aggregation step smooths out the errors of individual models, resulting in improved generalization.

A prominent example of bagging is the Random Forest algorithm, which builds an ensemble of decision trees. Bagging is particularly effective for high-variance models like decision trees, as it helps counteract their tendency to overfit the training data while maintaining their ability to capture complex relationships.

2.2 — Boosting

Boosting is an ensembling technique that focuses on reducing bias by sequentially training models, where each subsequent model attempts to correct the errors made by its predecessors. Unlike bagging, boosting does not train models independently but builds them iteratively, giving more weight to misclassified or poorly predicted data points in each step. This process allows the ensemble to concentrate on the hardest-to-predict instances, progressively improving its accuracy. The final predictions are made by combining the outputs of all models, often using a weighted average or sum.

Algorithms like AdaBoost, Gradient Boosting Machines (GBM), and XGBoost are popular implementations of boosting. These techniques are particularly effective for handling complex datasets and achieving high predictive performance. However, boosting is more sensitive to overfitting, especially when working with noisy data, so careful regularization and hyperparameter tuning are essential.

2.3 — Stacking

Stacking is an advanced ensembling technique that combines multiple models, referred to as base learners, by training a meta-model to integrate their predictions. Unlike bagging and boosting, where the focus is on improving individual models, stacking leverages the strengths of diverse models, regardless of their type or complexity. The base learners are trained independently on the dataset, and their predictions (or transformed outputs) are passed as features to the meta-model, which learns how to best combine them to make the final prediction.

For example, a stacking ensemble might combine the outputs of a k-Nearest Neighbors (kNN) classifier, a decision tree, and a support vector machine (SVM), with a logistic regression model as the meta-learner. This flexibility allows stacking to exploit the complementary strengths of different algorithms, often leading to superior performance. However, stacking requires careful consideration of overfitting, especially for the meta-model, which should not rely too heavily on a single base learner. Techniques like cross-validation can be used to ensure robust training and validation of the ensemble.

2.4 — Voting

Voting is one of the simplest and most intuitive ensembling techniques, where the predictions of multiple models are combined to make a final decision. It is particularly useful in classification tasks, where it can be implemented as either “hard voting” or “soft voting”. In hard voting, each model contributes a single vote for a class label, and the majority class is selected as the ensemble’s prediction. In soft voting, the models provide probabilities for each class, and the class with the highest average probability is chosen.

Voting ensembles work best when the base models are diverse and relatively accurate, as this reduces the likelihood of them making the same mistakes. This method is computationally efficient, as it does not require additional training beyond the base models. However, its simplicity can be a limitation when the base models’ predictions have complex dependencies, in which case more sophisticated techniques like stacking may be more effective. Voting is a go-to technique for quickly boosting performance with minimal complexity in tasks like classification or regression.

To go further

3 — Advanced techniques in model ensembling

3.1 — Weighted ensembling

In weighted ensembling, the outputs of individual models are combined with specific weights assigned to each model, reflecting their importance or performance. This approach allows better-performing models to have a greater influence on the final prediction.

Key Steps:

Train multiple models independently.
Evaluate their performance on a validation set.
Assign weights to each model based on metrics such as accuracy, precision, recall, or F1-score.
Combine predictions using the formula: $$P_{final} = \sum_{i = 1}^n w_i P_i$$ where $w_i$ is the weight for model $i$, and $P_i$ is the prediction of model $i$.

Example applications:

Financial forecasting, where some models may handle specific market conditions better than others.
Image classification tasks where certain architectures excel at detecting specific features.

3.2 — Blending

Blending is similar to stacking but uses a holdout dataset (a portion of the training data set aside) for training the meta-model instead of performing cross-validation. This reduces computational complexity but may result in less robust meta-model training.

Key steps:

Split the training data into two parts: the majority for training base models, and a small holdout set for training the meta-model.
Train base models on the larger subset.
Use the holdout set to generate predictions from base models and train the meta-model on these predictions.

Advantages:

Simpler than stacking since it doesn’t require cross-validation.
Faster to train and implement.

Limitations:

Relies on the quality and representativeness of the holdout dataset.

3.3 — Hybrid models

Hybrid models combine unsupervised techniques like clustering with supervised learning to leverage the strengths of both. For example, kMeans clustering can be used to preprocess or group the data, which is then fed into supervised models like kNN or decision trees.

Applications:

Semi-supervised learning: Use unsupervised clustering to label data and supervised learning for final predictions.
Data augmentation: Generate synthetic samples by modifying cluster centroids to balance class distributions.
Feature engineering: Extract cluster information as new features to enrich the dataset.

3.4 — Diversity maximization

The success of ensembling depends on the diversity of base models. Diverse models are more likely to capture different patterns or errors in the data. Techniques for maximizing diversity include:

Using models with different architectures (e.g., decision trees, neural networks, SVMs).
Training models on different subsets of data (bagging).
Applying different feature subsets to each model (feature bagging).
Varying hyperparameters significantly between models.

Metrics for diversity:

Error correlation: Low correlation between errors of models indicates better diversity.
Ensemble accuracy gain: Compare the ensemble’s accuracy to the average accuracy of individual models.

3.5 — Specialized ensembles for deep learning

Deep learning models, being inherently complex, can also benefit from ensembling.

Techniques include:

Snapshot Ensembling: Train a single deep learning model with cyclical learning rates, saving “snapshots” of the model at different points. Combine these snapshots as an ensemble.
Model Distillation: Use a large ensemble to train a single simpler model (student) to achieve comparable performance with reduced computational costs.
Ensemble Averaging: Combine predictions from multiple independently trained deep neural networks.

Applications:

Computer vision (e.g., ensembling CNNs for image classification).
Natural language processing (e.g., ensembling transformers like BERT or GPT).

To go beyond

A Comprehensive Guide to Ensemble Learning (with Python codes).

A nice introduction to ensemble learning with practical examples in Python.