Practical session

Duration1h40

Presentation & objectives

In this practical activity, we guide you through a first usage of machine learning tools. We will rely on data in matrix form, and will implement both a $k$-NN and a $k$-means algorithms. Later, in programming session 5, we will rely on existing libraries for more efficiency.

Important

The aim of this session is to help you master important notions in computer science. An intelligent programming assistant such as GitHub Copilot, that you may have installed already, will be able to provide you with a solution to these exercises based only on a wisely chosen file name.

For the sake of training, we advise you to disable such tools first.

At the end of the practical activity, we suggest you to work on the exercise again with these tools activated. Following these two steps will improve your skills both fundamentally and practically.

Also, we provide you the solutions to the exercises. Make sure to check them only after you have a solution to the exercises, for comparison purpose! Even if you are sure your solution is correct, please have a look at them, as they sometimes provide additional elements you may have missed.

Activity contents

1 — Data Preparation

The goal of this first exercise is to:

  • Familiarize yourself with dealing with multidimensional data.
  • Generate a synthetic dataset composed of clouds of points.
  • Split this dataset into a training set and a test set.

For this, we will use the numpy package in Python.

1.1 — How to create arrays

Numpy proposes multiple functions to build arrays:

  • Creating arrays with np.array(), np.zeros(), np.ones(), np.arange().
  • Use the np.shape() function on any numpy array to know its shape (numbers of rows, columns, etc.).

The np.array() function converts a Python list (or a list of lists) into a numpy array. You can also create arrays filled with zeros or ones using np.zeros() and np.ones().

Here are a few examples of how to define arrays in Python using numpy:

# Needed imports
import numpy as np



# Creating a 1D array from a list
array_1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", array_1d)

# Creating a 2D array (list of lists)
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("2D Array:\n", array_2d)

# Creating a 1D array of zeros
zeros_1d = np.zeros(5)
print("1D Array of zeros:", zeros_1d)

# Creating a 2D array (3 rows, 4 columns) of zeros
zeros_2d = np.zeros((3, 4))
print("2D Array of zeros:\n", zeros_2d)

# Creating a 2D array (2 rows, 3 columns) of ones
ones_2d = np.ones((2, 3))
print("2D Array of ones:\n", ones_2d)
print(np.shape(ones_2d))

Once numpy arrays have been defined, you can access their values (or groups of values), using square brackets, like lists, but separating dimensions using commas (,). You can also use the colon (:) to select all elements of a dimension, and use negative integers to count backwards.

Illustration with a few useful indexing examples:

# Once numpy arrays have been defined, you can access their values (or groups of values), using square brackets, like lists, but separating dimensions using `,`. 
print("Array1D: " , array_1d)
print("Array2D: " , array_2d)

# Accessing elements of a 1D array
print("Element of array1d at index 0:", array_1d[0])
print("Element at array1d at index 1:", array_1d[1])

# Accessing elements of a 2D array
print("Element of array2d at 0, 0:", array_2d[0, 0])
print("Element at array2d 0, 1:", array_2d[0, 1])

# Accessing all elements of a row
print("All elements of array2d row 0:", array_2d[0])
print("All elements of array2d, column 0: ", array_2d[:,0])

# Accessing the last row
print("Elements of last row:", array_2d[-1])
print("Element at last row, second column:", array_2d[-1, 1])
print("Elements of second to last column:", array_2d[:,-2])

# It is also possible to use an array of booleans (`True` or `False`, or ones and zeros) in order to select specific rows / columns

# Select elements that are greater than 2, and get their indices
print(f"Elements greater than 2: {array_1d[array_1d > 2]}, at indices {np.where(array_1d > 2)}")

# Use a vector of indices to select specific elements
print("Elements at 0, 2, 4 of array_1d:", array_1d[[0, 2, 4]])

A more complete tutorial on indexing can be found here: numpy tutorial on indexing.

Finally, it is also possible to generate random arrays using numpy, with functions np.random.normal(), np.random.uniform(), np.random.randint().

1.2 — Synthetic data generation

Using these functions, generate a synthetic dataset:

  • cloudA and cloudB, two clouds (sets) of points in dimension 2, following a normal distribution centered around around two distinct coordinates (coord1_A or coord1_B) with variance 1 (coord2_A or coord2_B).
  • Che number of points to generate in each cloud is given, N_A = 500 and N_B = 200.
Hints

Use the np.random.normal() function to generate random points following a normal distribution. See the documentation for more details.

Correction
# Needed imports
import numpy as np



# Parameters of the first distribution
coord1_A, coord2_A = (1, 1)

# Number of points in the first cloud
N_A = 500

# Parameters of the second distribution
coord1_B, coord2_B = (-1, 1)

# Number of points in the second cloud
N_B = 200

# Generate random points
cloudA = np.random.normal(coord1_A, coord2_A, size=(N_A, 2))
cloudB = np.random.normal(coord1_B, coord2_B, size=(N_B, 2))

Here is a visualisation of the two clouds (assuming that you named the two could points cloudA and cloudB) as a scatter plot using the matplotlib library.

# Needed imports
import matplotlib.pyplot as plt



# Plot the clouds
plt.scatter(cloudA[:, 0], cloudA[:, 1], c='r')
plt.scatter(cloudB[:, 0], cloudB[:, 1], c='b')

# Display all open figures
plt.show()

1.3 — Prepare the dataset and split into a training and test set

We will now prepare the dataset. The following steps are needed:

  • For each cloud, prepare a vector of labels, using a different integer value for each, for instance 0 for cloud A and 1 for cloud B.
  • Using the np.vstack and/or np.hstack function (vertical/horizontal stack), concatenate the two clouds into one array X and the labels vector into a (1D) array y.
  • Generate a vector of permutations of the indices of the concatenated array using np.random.permutation().
  • Use this permutation to split X and y into X_train,y_train and X_test,y_test with ratios 80 percent for train / 20 percent for test.
  • Check the shapes of the generated train and test set.
Hints vstack, hstack

The np.vstack function can be used to concatenate arrays vertically. For example:

a = np.array([[1, 2], [3,4]])
b = np.array([[5,6], [7,8]])
c = np.vstack([a, b])
print(c)

outputs:

[[1 2]
 [3 4]
 [5 6]
 [7 8]]

The np.hstack function can be used to concatenate arrays horizontally. For example:

a = np.array([[1, 2], [3,4]])
b = np.array([[5,6], [7,8]])
c = np.hstack([a, b])
print(c)

outputs:

[[1 2 5 6]
 [3 4 7 8]]
Hints np.random.permutation

The np.random.permutation function can be used to generate a vector of permutations of the indices of an array. For example:

a = np.array([[1, 2, 3,4]])
permuted = np.random.permutation(4)
print(permuted)

outputs:

[2 0 3 1]
Correction
# Needed imports
import numpy as np



# For each cloud, prepare a vector of labels, using a different integer value for each.
y_A = np.zeros(N_A) # we will assign the value "0" to cloud A 
y_B = np.ones(N_B) # we will assign the value "1" to cloud B

# Using the `np.vstack` and `np.hstack` functions, concatenate the two clouds into one array `X` and the labels vector into a (1D) array `y`
X = np.vstack([cloudA, cloudB])
y = np.hstack([y_A, y_B])

# Generate a vector of permutations of the indices of the concatenated array using `np.random.permutation()`
permuted = np.random.permutation(N_A+N_B)

# Permute the rows of `X` and `y` using the generated permutation
X = X[permuted]
y = y[permuted]

# Split `X` and `y` into `X_train,y_train` and `X_test,y_test` with ratios 80 percent for train / 20 percent for test.
split_train = int(0.8*(N_A+N_B))
X_train = X[:split_train]
y_train = y[:split_train]
X_test = X[split_train:]
y_test = y[split_train:]

# Check the shapes of the resulting arrays
print(f"Number of samples in train set: {np.shape(y_train)}")
print(f"Number of samples in test set: {np.shape(y_test)}")

2 — Supervised learning using $k$-nearest neighbors

In this exercice, we will implement the prediction on a test dataset using a $k$-NN on a training dataset. You can have a look at the course material to understand the principle of the algorithm. The different steps of the $k$-NN are as follows.

For each sample in the test dataset:

  1. Compute all pairwise distances between the sample and the training dataset.
  2. Extract the labels of the k smallest distances.
  3. Assign the label according to the one that is the most represented in the labels associated to the $k$ smallest distances (majority vote).

We will use the Euclidean distance to consider distances between samples. You can test your function on the train / test dataset generated in the first part.

The function should have the following signature:

# Needed imports
import numpy as np
from typing import List


def kNN (k: int, X_train: np.array, y_train: List[int], X_test: np.array) -> np.array:
    
    """
        k-NN prediction on a test set using a training dataset.
        In: 
            * k:       Number of neighbours to consider. 
            * X_train: Training dataset, numpy array with N (samples) rows and d (features) columns.
            * y_train: List of N labels associated with each example in X_train.
            * X_test:  Test dataset, numpy array with M (samples rows and d (features) columns.
        Out:
            * List of M labels associated with each example in X_test according to the k nearest neighbours with X_train.
    """

    pass

Test your function with k = 3 using the train and test datasets you generated in the previous part.

Hints
  • The np.argsort function can be used to get the indices of the k smallest elements of an array.
  • The np.unique function can be used to get the unique elements of an array, and the return_counts=True argument can be used to get the counts of each unique element.
Correction
# Needed imports
import numpy as np
from typing import List



def kNN (k: int, X_train: np.array, y_train: List[int], X_test: np.array) -> np.array:
    
    """
        k-NN prediction on a test set using a training dataset.
        In: 
            * k:       Number of neighbours to consider. 
            * X_train: Training dataset, numpy array with N (samples) rows and d (features) columns.
            * y_train: List of N labels associated with each example in X_train.
            * X_test:  Test dataset, numpy array with M (samples rows and d (features) columns.
        Out:
            * List of M labels associated with each example in X_test according to the k nearest neighbours with X_train.
    """

    # Inner function to compute the Euclidean distance between two points
    def euclidean_distance (x1, x2):
        """Compute the Euclidean distance between two points."""
        return np.sqrt(np.sum((x1 - x2) ** 2))

    # Predict for all data
    predictions = []
    for x_test in X_test:

        # Compute distances between the test point and all training points
        distances = [euclidean_distance(x_test, x_train) for x_train in X_train]

        # Get the indices of the k nearest neighbors
        k_indices = np.argsort(distances)[:k]

        # Get the labels of the k nearest neighbors
        k_labels = [y_train[i] for i in k_indices]

        # Determine the most common label (majority vote) using numpy
        unique_labels, counts = np.unique(k_labels, return_counts=True)
        most_common = unique_labels[np.argmax(counts)]
        predictions.append(most_common)
    
    # Return this in an array
    return np.array(predictions)



# Test the k-NN 
# Assuming that X_train, y_train, X_test have been defined in the previous exercize 
k = 3
predictions = kNN(k, X_train, y_train, X_test)
print("Predicted labels:", predictions)

You can use the following snippet to check the accuracy of the trained $k$-NN. Accuracy corresponds to the proportion of correct answers.

# Percentage of correct predictions
accuracy = np.mean(predictions == y_test)
print(f"Accuracy of k-NN with k={k}: {accuracy}")

Using the parameters that generate the dataset from the correction above, you should obtain an accuracy of approximately 0.93.

You can also plot decision boundaries using this snippet. This shows the areas that, according to the positions of the training points, will be classified as class A or B.

# Needed imports
from matplotlib.colors import ListedColormap
import numpy as np



def plot_boundaries (classifier, X, Y, h=0.2):

    # Create color maps
    x0_min, x0_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x1_min, x1_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    x0, x1 = np.meshgrid(np.arange(x0_min, x0_max, h), np.arange(x1_min, x1_max, h))
    dataset = np.c_[x0.ravel(), x1.ravel()]
    Z = kNN(k, X, Y, dataset)

    # Put the result into a color plot
    Z = Z.reshape(x0.shape)
    plt.figure()
    plt.pcolormesh(x0, x1, Z)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolor='k', s=20)
    plt.xlim(x0.min(), x0.max())
    plt.ylim(x1.min(), x1.max())

    # Display the plot
    plt.show()


# Call the function with your training set
plot_boundaries(predictions, X_train, y_train)

3 — $k$-means clustering

In this exercise, we will implement the $k$-means algorithm from scratch. You can have a look at the course material) to understand the principle of the algorithm.

The different steps of $k$-means are as follows:

  1. Randomly initialize the centroids
  2. Compute current assignation of clusters.
  3. Update the centroids.
  4. Compute the current goodness of fit (GoF).
  5. Repeat from step 2 until convergence (GoF small enough).

The function should have the following signature:

from typing import List, Tuple

def k_means (k: int, X_train: np.array, X_test: np.array,max_iters:int, tol: float) -> Tuple[np.array, np.array]:

    """
        Cluster assignments on a test set according to clusters defined using k-means clustering on a training dataset.
        In: 
            * k:         Number of clusters to consider.
            * X_train:   Training dataset, numpy array with N (samples) rows and d (features) columns.
            * X_test:    Test dataset, numpy array with M (samples rows and d (features) columns.
            * max_iters: Maximum number of iterations to run the k-means algorithm.
            * tol:       Convergence tolerance for centroid changes.
        Out:
            * List of M cluster assignements associated with each example in X_test.
            * Cluster centroids as a numpy array.
    """

    pass

Test your function with k = 2 using the train and test datasets you generated.

Correction
# Needed imports
import numpy as np
from typing import List, Tuple


def k_means (k: int, X_train: np.array, X_test: np.array,max_iters:int, tol: float) -> Tuple[np.array, np.array]:

    """
        Cluster assignments on a test set according to clusters defined using k-means clustering on a training dataset.
        In: 
            * k:         Number of clusters to consider.
            * X_train:   Training dataset, numpy array with N (samples) rows and d (features) columns.
            * X_test:    Test dataset, numpy array with M (samples rows and d (features) columns.
            * max_iters: Maximum number of iterations to run the k-means algorithm.
            * tol:       Convergence tolerance for centroid changes.
        Out:
            * List of M cluster assignements associated with each example in X_test.
            * Cluster centroids as a numpy array.
    """

    # Inner function for initialization of centroids
    def initialize_centroids (X: np.array, k: int) -> np.array:
        """Randomly initialize k centroids from the dataset."""
        indices = np.random.choice(X.shape[0], k, replace=False)
        return X[indices]

    # Inner function for assigning clusters to centroids
    def assign_clusters (X: np.array, centroids: np.array) -> np.array:
        """Assign each data point to the nearest centroid."""
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        return np.argmin(distances, axis=1)

    # Inner function for updating centroids
    def update_centroids(X: np.array, labels, k: int) -> np.array:
        """Compute new centroids as the mean of points in each cluster."""
        return np.array([X[labels == i].mean(axis=0) for i in range(k)])

    # Inner function for fitting k-means
    def fit_k_means (X: np.array, k: int, max_iters: int, tol: float) -> np.array:
        """Run the {k-means algorithm to find cluster centroids."""
        centroids = initialize_centroids(X, k)
        for _ in range(max_iters):
            labels = assign_clusters(X, centroids)
            new_centroids = update_centroids(X, labels, k)
            if np.linalg.norm(centroids - new_centroids) < tol:
                break
            centroids = new_centroids
        return centroids

    # Fit k-means on the training set
    centroids = fit_k_means(X_train, k, max_iters, tol)

    # Assign clusters to the test set
    test_labels = assign_clusters(X_test, centroids)
    return test_labels,centroids

# Test the k-means with the training set
k = 2 # Number of clusters
test_labels, centroids = k_means(k, X_train, X_test, 100, 1e-4)
print("Predicted labels:", test_labels)

Use this code to visualize the dataset with the added cluster centroids.:

# Needed imports
import matplotlib.pyplot as plt

## Plot
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.scatter(centroids[:, 0], centroids[:, 1], c='r',marker ='x')

5 — Optimize your solutions

What you can do now is to use AI tools such as GitHub Copilot or ChatGPT, either to generate the solution, or to improve the first solution you came up with! Try to do this for all exercises above, to see the differences with your solutions.

To go further

6 — Confusion matrix and performance metrics

You can further analyze the performance of a supervised learning algorithm using other tools:

  • The confusion matrix is a (normalized) count of samples according to all occuring possibilities of predicted classes and true classes. Therefore it provides details on misclassifications, showing which classes are more difficult than others. You can estimate the confusion matrix using the confusion_matrix function from sklearn.

  • The precision score is the percentage of correctly classified examples with respect to all retrieved examples.

  • The recall score is the percentage of correctly classified examples with respect to all examples belonging to a given class.

  • The f1 score is the harmonic mean of precision and recall.

Here is an illustration of precision and recall:

These three metrics can be computed from the functions in the sklearn.metrics module. An easy way to show all of them is to print the classification report.

To go beyond

Looks like this section is empty!

Anything you would have liked to see here? Let us know on the Discord server! Maybe we can add it quickly. Otherwise, it will help us improve the course for next year!