OOP and ML using sklearn

Reading time10 min

In brief

Article summary

This article explains how the sklearn library used object-orienting programming to define classes that are well suited to machine learning.

Main takeaways

Sklearn implements an estimator class, that is used as a parent class to define two other classes: predictors and transformers.
the sklearn predictor class is mostly used for supervised learning.
the sklearn transformer class is mostly used for unsupervised learning.

Article contents

1 — Introduction

How could we implement machine learning algorithms in a flexible and efficient way using Object-Oriented programming ? We need to think about the different steps taken when using ML.

As we have studied in the session about standard algorithms for ML, there are two main types of machine learning approaches: supervised and unsupervised learning. While both of these types use some data as input for training, unsupervised learning does not need labels as input. In addition, after training, a supervised learning model can generate predictions of labels, while a clustering method will assign clusters. However, a decomposition technique such as Principal Components Analysis will transform the input data into a new coordinate system.

Therefore, we need an API that can ideally deal with all these cases.

2 — Estimators, predictors and transformers

The choice made by the sklearn library is to consider that all ML models are estimators. An estimator has a training phase that is implemented in a .fit() method. Next, depending on the type of ML approach, another class inheriting from the estimator class will be used, defining a .fit() method will a different prototype:

for supervised learning, a predictor class implements .fit(X,y) to use data and labels.
for unsupervised learning, a transformer class implements .fit(X) to use data only.
In both unsupervised and supervised learning, the .fit() method must be called to train the model before being able to use it.

After being trained, the model can be used. Here again, it is slightly different depending on the type of ML approach:

for supervised learning, the predictor class implements a .predict(X) method to generate predictions of labels using data.
for unsupervised learning, the transformer class implements .transform(X) to put data into a new space (e.g. distances to clusters, or a new coordinate system). In the specific case of clustering techniques, a transformer class can also implement a .predict(X) method to assign cluster labels to input data.

In the practical, we will implement a generic predictor class for supervised learning, that we will name GenericClassifier, and next we will use the KNN code developped previously to implement the KNN algorithm using this class.

To go further

3 — Examples of classes in sklearn

Here, we presented a simplified version of OO applied to machine learning. A more detailed explanation of possible use cases, in particular in sklearn, is described in detail in this page. Examples of implementations of new classes following those conventions are here. We encourage you to take the time to read these pages.

To go beyond

Looks like this section is empty!

Anything you would have liked to see here? Let us know on the Discord server! Maybe we can add it quickly. Otherwise, it will help us improve the course for next year!