Practical session

Duration1h30

Presentation & objectives

The goal of this session is put in practice object-oriented programming in the context of artificial intelligence, in particular with supervised learning approaches.

Important

The aim of this session is to help you master important notions in computer science. An intelligent programming assistant such as GitHub Copilot, that you may have installed already, will be able to provide you with a solution to these exercises based only on a wisely chosen file name.

For the sake of training, we advise you to disable such tools first.

At the end of the practical activity, we suggest you to work on the exercise again with these tools activated. Following these two steps will improve your skills both fundamentally and practically.

Also, we provide you the solutions to the exercises. Make sure to check them only after you have a solution to the exercises, for comparison purpose! Even if you are sure your solution is correct, please have a look at them, as they sometimes provide additional elements you may have missed.

1 — Implement a generic classifier using OOP

Implement an object that will be a generic classifier with methods fit, predict and score. The class must be defined in a file named classifiers.py and be named GenericClassifier.

The goal of this object is to define how a classifier should behave, and we will use this class in the next exercices to actually implement machine learning algorithms.

  • The class must have a constructor without parameters.
  • The method fit takes two parameters X and y (both of type np.ndarray) and will be used to train the model (here, it does not do anything).
  • The method fit modifies a private boolean attribute self._isfitted of the class to indicate that the model has been trained.
  • The method predict takes a single parameter X (of type np.ndarray) and returns predicted labels predictions.
  • The method score takes two parameters X and y (both of type np.ndarray) and returns the accuracy of the model on the given data X according to ground truth labels y.
  • Methods predict and score return an error if they are called while the model has not been trained yet.
Correction
import numpy as np

class GenericClassifier:
    def __init__(self):
        """Initialize the GenericClassifier with default attributes."""
        self._isfitted = False  # Private attribute to indicate if the model is trained
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Generic Method for training the classifier.
        Needs to be implemented in an actual classifier, here it does nothing.

        Parameters:
        - X: np.ndarray, training data.
        - y: np.ndarray, labels for training data.
        """
        # Set the internal flag to indicate the model is trained
        self._isfitted = True
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict labels for given data.

        Parameters:
        - X: np.ndarray, input data to predict labels for.

        Returns:
        - predictions: np.ndarray, predicted labels.

        Raises:
        - ValueError: If the model is called without being trained.
        """
        if not self._isfitted:
            raise ValueError("The model must be trained (call fit) before predictions can be made.")
        
        # Prediction (to be implemented in subclasses) : here we just return zeros
        return np.zeros(X.shape[0], dtype=int)
    
    def score(self, X: np.ndarray, y: np.ndarray) -> float:
        """
        Calculate the accuracy of the model on the given data.

        Parameters:
        - X: np.ndarray, input data.
        - y: np.ndarray, ground truth labels.

        Returns:
        - accuracy: float, accuracy of the model on the data X given ground truth labels y .

        Raises:
        - ValueError: If the model is called without being trained.
        """
        if not self._isfitted:
            raise ValueError("The model must be trained (call fit) before scoring can be performed.")
        
        # Placeholder prediction logic
        predictions = self.predict(X)
        
        # Calculate and return accuracy
        return np.mean(predictions == y)

Then, create a test file named test_classifier.py that tests the three methods of the class individually.

2 — Implement a $k$-NN classifier using the generic classifier

Implement the KNN classifier coded in session algoS6 using an instance of inheriting of the GenericClassifier.

The class must be defined in a file named knn_classifier.py and be named KNNClassifier. The parameter k is specific to an instance of the class.

  • The class must have a constructor inheriting of GenericClassifier, and that takes a single parameter k which is the number of neighbors to consider.
  • The method fit takes two parameters X and y (both of type np.ndarray) and trains the model using the algorithm described in session algo 6.
  • The method predict takes a single parameter X (of type np.ndarray) and returns the predicted labels.
  • The method fit modifies a private boolean attribute self._isfitted of the class to indicate that the model has been trained.
  • The method predict returns an error if it is called while the model has not been trained yet.
  • You don’t have to implement the method score again as it’s definition inherits from the GenericClassifier.

Then, create a file named main.py that uses the class to train a model on the digits dataset and print the accuracy of the model on the test set.

Here is a snippet of code to load the digits dataset :

from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

# Load the digits dataset
digits = load_digits()

# Split the data into a training and test set
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, random_state=0)

Ideally, you also create a test file named test_knn_classifier.py that tests the three methods of the class individually.

Correction
import numpy as np

class KNNClassifier(GenericClassifier):
    def __init__(self, n_neighbors: int = 3):
        """
        Initialize the KNNClassifier with the number of neighbors.
        
        Parameters:
        - n_neighbors: int, the number of nearest neighbors to consider.
        """
        super().__init__()
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Store the training data and mark the model as fitted.
        
        Parameters:
        - X: np.ndarray, training data.
        - y: np.ndarray, labels for training data.
        """
        self.X_train = X
        self.y_train = y
        super().fit(X, y)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict labels for the given data using the KNN algorithm.
        
        Parameters:
        - X: np.ndarray, input data to predict labels for.
        
        Returns:
        - predictions: np.ndarray, predicted labels.
        """
        if not self._isfitted:
            raise ValueError("The model must be trained (call fit) before predictions can be made.")
        
        def euclidean_distance(x1, x2):
            """Compute the Euclidean distance between two points."""
            return np.sqrt(np.sum((x1 - x2) ** 2))

        # Predict labels for each test instance
        predictions = []
        for x_test in X:
            # Compute distances between the test point and all training points
            distances = [euclidean_distance(x_test, x_train) for x_train in self.X_train]
            
            # Get the indices of the k nearest neighbors
            k_indices = np.argsort(distances)[:self.n_neighbors]
            
            # Get the labels of the k nearest neighbors
            k_labels = np.array([self.y_train[i] for i in k_indices])
            
            # Perform majority vote with numpy
            unique_labels, counts = np.unique(k_labels, return_counts=True)
            most_common = unique_labels[np.argmax(counts)]
            predictions.append(most_common)
        
        return np.array(predictions)

3 — Implement model ensembling

Now that you have implemented the classifier, you can use them to create an ensemble model. Code a class ModelEnsemble that inherits from GenericClassifier, takes as argument a list of trained classifiers, and does a majority vote of all classifiers. The class must be defined in a file named model_ensemble.py and be named ModelEnsemble.

  • The class takes as argument a list of trained classifiers
  • At the initialisation of the class, raises an error if the provided list of classifiers is empty.
  • The class must implement a fit method that checks that all the classifiers from the list have been trained before.
  • The class must have a method predict that takes a single parameter X (of type np.ndarray) and returns the predicted labels. This method returns an error if the classifiers from the list have not been trained before.

Then, create a file named main.py that uses the class to train an ensemble model on the digits dataset and print the accuracy of the model on the test set. As an ensemble model, you can use a list of KNNs with different values of K.

Correction
import numpy as np

class ModelEnsemble(GenericClassifier):
    def __init__(self, classifiers: list):
        """
        Initialize the ModelEnsemble with a list of trained classifiers.

        Parameters:
        - classifiers: list, a list of trained classifier objects.
        """
        super().__init__()
        if len(classifiers)==0:
            raise ValueError("The list of classifiers cannot be empty.")
        self.classifiers = classifiers
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Check that all classifiers in the ensemble are already trained.

        Parameters:
        - X: np.ndarray, training data (not used, only to match GenericClassifier interface).
        - y: np.ndarray, labels for training data (not used, only to match GenericClassifier interface).
        
        Raises:
        - ValueError: If any classifier in the ensemble is not trained.
        """
        for clf in self.classifiers:
            if not clf._isfitted:
                raise ValueError("All classifiers in the ensemble must be trained before using the ensemble.")
        self._isfitted = True  # Mark the ensemble as ready to use
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Perform majority voting to predict labels for the given data.

        Parameters:
        - X: np.ndarray, input data to predict labels for.

        Returns:
        - predictions: np.ndarray, predicted labels by majority vote.
        
        Raises:
        - ValueError: If any classifier in the ensemble has not been trained.
        """
        # Collect predictions from all classifiers
        all_predictions = []
        for clf in self.classifiers:
            if not hasattr(clf, "_isfitted") or not clf._isfitted:
                raise ValueError("All classifiers must be trained before using the ensemble.")
            all_predictions.append(clf.predict(X))
        
        # Stack predictions into a 2D array
        all_predictions = np.stack(all_predictions, axis=1)
        
        # Perform majority voting
        unique_labels = np.unique(all_predictions)
        label_counts = np.zeros((all_predictions.shape[0], unique_labels.size), dtype=int)
        
        for i, label in enumerate(unique_labels):
            label_counts[:, i] = np.sum(all_predictions == label, axis=1)
        
        majority_votes = unique_labels[np.argmax(label_counts, axis=1)]
        
        return majority_votes

4 — Cross-validated model ensembling

Analyze the performances of the ensemble model from the previous section. Accuracy should be very high, and the ensemble model is not significantly better than the individual models.

In order to better see the interest of ensemble models, and also to show a more realistic use case of model ensembling, we will simulate a situation in which each classifier sees a different split of the training data:

  • Keep the test set X_test,y_test the same as in the previous exercise
  • Split the previous train set X_train,y_train into P different subsets
  • Train P K-NN algorithms independently on each subset, with a the same value of K. As the splits are different, each K-NN will have a slightly different performance on the test set.
  • Evaluate the performance on an ensemble model taking all the P K-NN.

You can experiment with this framework, by using a different number of splits P, and different values of K. We suggest to start with P=5 and K=1, but feel free to experiment.

5 — Optimize your solutions

What you can do now is to use AI tools such as GitHub Copilot or ChatGPT, either to generate the solution, or to improve the first solution you came up with! Try to do this for all exercises above, to see the differences with your solutions.

To go further

Code a class CVGridSearch that takes a dataset (data and labels), a range of an hyperparameter (e.g. a range of integer values of K) divides it in a training and validation dataset, and uses the validation dataset to optimize for the best hyperparameter. You can check the validity on your KNN implementation.

To go beyond

7 — Multilayer perceptron

The multilayer perceptron (MLP) is a basic building block very commonly used in machine learning solutions based on Deep Learning. This page uses a class definition to build a MLP from scratch using numpy.

8 — Automatic differentiation

Another interesting (but complex) application of OO programming is given by pytorch, with the notion of Automatic differentation. In summary, pytorch performs operations on tensors just like numpy does, but pytorch also automatically tracks the history and dependencies between all computations and their gradient. This enables a very straightforward implementation of Deep Learning Architectures. More information can be found on the official pytorch documentation.

You can also find a full pytorch tutorial here