Standard algorithms for machine learning

Reading time20 min

In brief

Article summary

In this page, we explain two standard algorithms for machine learning, $k$-nearest neighbor (for supervised learning), and $k$-means clustering (for unsupervised learning).

These algorithm are some of the simplest examples in the wild, but many others exist for supervised or unsupervised learning.

Main takeaways

  • $k$-nearest neighbor ($k$-NN) is a simple supervised learning method that consists in storing samples from a training set, and compute distances from a new (unseen) query example to all training set examples.

  • $k$-NN make a prediction on a new sample by considering the true labels associated with $k$-nearest training samples that are the closest to the query.

  • $k$-means clustering is a simple iterative clustering method that considers a fixed set of centroids that maximize inter-cluster variance.

Article contents

1 — Supervised learning

Supervised learning is a fundamental concept in machine learning where an algorithm learns a mapping from input data $X$ to output labels $y$, using a labeled dataset. The goal is to learn a function $f:X \to y$ that minimizes the prediction error on new, unseen data. This type of learning is “supervised” because the training data includes both the inputs and the corresponding correct outputs, guiding the learning process.

1.1 — Formalism

Given a dataset $\{(x_i, y_i)\}_{i=1}^N$, where $x_i \in \mathbb{R}^d$ are input features and $y_i \in \mathcal{Y}$ are the associated labels, supervised learning seeks to optimize a model $f_\theta(x)$, parameterized by $\theta$, to minimize a loss function $\mathcal{L}(y, f_\theta(x))$.

Here, the loss function is a measurement of the error the model performs on the training dataset.

The optimization problem can be formally expressed as: \[ \theta^* = \arg\min_\theta \frac{1}{N} \sum_{i=1}^N \mathcal{L}(y_i, f_\theta(x_i)). \]

1.2 — Types of supervised learning problems

There are many tasks to solve in supervised learning. Here are the most common two you may encounter:

  1. Classification:

    • The output $y$ is discrete, representing class labels.
    • Example – Predicting whether an email is “spam” or “not spam”.
    • Common loss functionCross-entropy loss for probabilistic outputs: \[ \mathcal{L}(y, \hat{y}) = -\frac{1}{N} \sum_{i=1}^N \sum_{c=1}^C y_{i,c} \log(\hat{y}_{i,c}), \] where $C$ is the number of classes, $y_{i,c}$ is a one-hot encoded label, and $\hat{y}_{i,c}$ is the predicted probability for class $c$.
  2. Regression:

    • The output $y$ is continuous, representing numerical values.
    • Example – Predicting house prices based on features like size and location.
    • Common loss functionMean Squared Error (MSE): \[ \mathcal{L}(y, \hat{y}) = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2. \]

1.3 — The $k$-nearest neighbors algorithm

The $k$-nearest neighbors ($k$-NN) algorithm is a parameter-free supervised learning method used for both classification and regression. It predicts the output for a given query point based on the $k$ nearest training samples in the feature space.

Here is a pseudo-code of the algorithm:

Input:
    - Training dataset: {(x_1, y_1), (x_2, y_2), ..., (x_N, y_N)}
    - Query point: q
    - Number of neighbors: k
    - Distance metric: d(·, ·)

Output:
    - Predicted label (classification) or value (regression) for q

--

Steps:

    1. (compute distances):
        - For each training point (x_i, y_i), calculate d(q, x_i).

    2. (Find k nearest neighbors)
        - Identify the k training points with the smallest distances to q.

    3. (make a prediction):
        - If classification, perform a majority vote among the labels of the k nearest neighbors.
        - If regression, compute the average of the labels of the k nearest neighbors.

    4. (return the prediction)

1.4 — Remarks on $k$-NN

1.4.1 — Distance metric

The choice of the distance metric significantly impacts performance. Common choices include:

  • Euclidean distance: \[ d(q, x_i) = \sqrt{\sum_{j=1}^d (q_j - x_{i,j})^2}. \]

  • Manhattan distance: \[ d(q, x_i) = \sum_{j=1}^d |q_j - x_{i,j}|. \]

1.4.2 — Influence of $k$

A small $k$ (e.g., $k = 1$) can lead to overfitting. This means that the model becomes too specialized to the training set, but has poor generaliation performance to new data.

On the opposite, a large $k$ smooths predictions but may lose local details.

1.4.3 — Complexity

Training complexity is $O(1)$ since it involves storing the dataset (to use as reference points).

Prediction complexity is $O(Nd)$ per query point, making the algorithm computationally expensive for large datasets ($N$ distance computations between vectors of dimension $d$).

Information

Note that $k$ is a not a parameter learned by the algorithm, but needs to be given a priori. $k$-NN is actually paramater free, but still needs a training phase to store the samples from the training set.

$k$ is called an hyper-parameter (different from the parameters $\theta$ of $f_\theta$ that can be learned).

2 — Unsupervised learning

Unsupervised learning is a type of machine learning where the algorithm learns patterns or structures in data without labeled outputs. Unlike supervised learning, where the goal is to predict outcomes, unsupervised learning aims to uncover hidden structures, groupings, or representations in the input data. It is particularly useful for exploratory data analysis, feature extraction, and dimensionality reduction.

2.1 — Formalism

Given a dataset $\{x_i\}_{i=1}^N$, where $x_i \in \mathbb{R}^d$, unsupervised learning seeks to discover structures or representations that best describe the data. Depending on the task, this involves optimizing different types of objective functions. Broadly, these tasks fall into two main categories:

  1. Clustering – Partition the data into $k$ clusters, where similar data points are grouped together. Formally, the goal is to assign each data point $x_i$ to a cluster $C_k$ such that a similarity or distance metric (e.g., Euclidean distance) is minimized within clusters and maximized across clusters.

    Example objective function for clustering: \[ \text{Minimize } \sum_{k=1}^K \sum_{x_i \in C_k} \|x_i - \mu_k\|^2, \] where $\mu_k$ is the centroid of cluster $C_k$.

  2. Decomposition / Manifold Learning – Learn a lower-dimensional representation of the data that captures its intrinsic structure. This often involves approximating the data using a set of basis vectors or nonlinear mappings.

    Example objective function for dimensionality reduction: \[ \text{Minimize reconstruction error: } \sum_{i=1}^N \|x_i - \hat{x}_i\|^2, \] where $\hat{x}_i$ is the reconstructed input from the learned representation.

2.2 — $k$-means algorithm

The $k$-means algorithm aims to partition $N$ data points into $k$ clusters by minimizing the within-cluster variance.

Here is a pseudo-code:

Input:
    - Dataset: {x_1, x_2, ..., x_N}, where x_i ∈ ℝ^d
    - Number of clusters: K
    - Convergence threshold: ε (e.g., 10^-4)

Output:
    - K cluster centroids {μ_1, μ_2, ..., μ_K} and cluster assignments {C_1, C_2, ..., C_K}

--

Steps:

    1. (initialize centroids):
        - Randomly select K data points as the initial centroids {μ_1, μ_2, ..., μ_K}.

    2. (repeat until convergence):
    |
    |   a. (assign points to nearest centroid):
    |       - For each data point x_i:
    |       - Assign x_i to the cluster C_k where k = argmin_j ||x_i - μ_j||^2.
    |
    |   b. (update centroids):
    |       - For each cluster C_k, recompute the centroid:
    |       - Set μ_k = (1 / |C_k|) Σ_{x_i ∈ C_k} x_i.
    |
    |   c. Check for convergence:
    |       - Compute the change in centroids Δ = Σ_k ||μ_k^new - μ_k^old||.
    |       - If Δ < ε, stop the iteration.

    3. (return the final centroids and cluster assignments)

2.3 — Remarks on $k$-means

The algorithm can get stuck in local minima depending on the initialization of centroids. Techniques like “$k$-means++”" improve the initialization by selecting centroids that are spread out.

Computational complexity is $O(NKId)$, where $I$ is the number of iterations.

$k$-means works best when clusters are roughly spherical and separable in the feature space.

To go further

3 — Feature preprocessing

In general, features should be normalized or standardized if they have different scales to ensure fair contribution to distance calculations. This is called feature preprocessing, and is very standard when performing ML on a new dataset. Check out this set of techniques described on feature preprocessing using sklearn.

Another important step before applying ML algorithms is feature selection. A few example of simple approaches for feature selection:

  • Removal of features with low variance (e.g., remove features based on a variance threshold).
  • Selection of features using statistics on the dataset (e.g., selecting the features that have a significant linear relationship with the changes in labels).

4 — Other supervised learning algorithms

In addition to $k$-NN, here are a few common supervised learning algorithms. The list is obviously not exhaustive.

  1. Linear regression:

    • Used for regression tasks.
    • Learns a linear mapping from input features to continuous outputs.
  2. Logistic regression:

    • Used for binary classification.
    • Outputs probabilities for class membership using a sigmoid function.
  3. Support vector machines (SVMs):

    • Effective for both classification and regression.
    • Maximizes the margin between data points and a decision boundary.
  4. Decision trees / Random forest:

    • Constructs a tree-like model of decisions and their consequences for both classification and regression.
  5. Neural networks / Deep learning:

    • Composed of layers of interconnected nodes for capturing complex relationships in data.
    • Highly versatile, used in tasks ranging from image recognition to financial forecasting.
    • More details in the dedicated deep learning article.

5 — Other unsupervised learning algorithms

In addition to $k$-NN, here are a few common unsupervised learning algorithms. The list is obviously not exhaustive.

5.1 — Clustering

  • Hierarchical clustering:

    • Builds a tree-like structure (dendrogram) of nested clusters by successively merging or splitting clusters.
    • Can be agglomerative (bottom-up) or divisive (top-down).
  • Density-based clustering (DBSCAN):

    • Groups points into dense regions separated by sparser areas.
    • Does not require specifying the number of clusters in advance.

5.2 — Decomposition / Manifold learning

  • t-Distributed Stochastic Neighbor Embedding (t-SNE):

    • Projects high-dimensional data into a low-dimensional space (usually 2D or 3D) while preserving local neighborhood relationships.
    • Ideal for visualizing clusters.
  • Uniform Manifold Approximation and Projection (UMAP):

    • Similar to t-SNE but often faster and better at preserving global structure in the data.
    • The UMAP algorithm is detailed here
  • Autoencoders:

    • Deep Learning model trained to compress (encode) data into a lower-dimensional representation and then reconstruct it (decode).
    • Useful for nonlinear dimensionality reduction.

To go beyond

Looks like this section is empty!

Anything you would have liked to see here? Let us know on the Discord server! Maybe we can add it quickly. Otherwise, it will help us improve the course for next year!