Randomized algorithms
Reading time15 minIn brief
Article summary
In this article, we introduce randomized algorithms, that use randomness to make decisions, resulting in potential variations in execution and output across different runs with the same input. They are classified into Monte-Carlo (which provides correct results with high probability) and Las-Vegas (which always provides correct results but with variable performance).
As an example, we will study Frievald’s Algorithm for matrix multiplication, which provides an efficient randomized approach with an expected time complexity of $O(n^2)$. We will focus on another example, fingerprinting, which is a technique used to verify data equality with minimal information transfer, utilizing random hash functions.
Random numbers, critical for randomized algorithms, are typically generated using pseudo-random methods, as true randomness is hard to achieve. We discuss the challenges of generating random numbers and the importance of properties like uniformity, independence, and replicability in pseudo-random number generation.
Main takeaways
-
Randomized algorithms can be more efficient and simpler than deterministic algorithms but lack guaranteed correctness, providing only a probability of accuracy.
-
Generating Random Numbers is Complex. While generating truly random numbers is difficult, pseudo-random numbers are widely used in algorithms. Good pseudo-random number generators should balance speed, portability, and uniformity.
Article contents
1 — What is a randomized algorithm?
Let us start by giving the definition of a probabilistic algorithm:
A probabilistic algorithm is an algorithm that generates a random number $r \in \mathbb{R}$ and makes decision based on $r$’s value.
Note that given the same input on different executions, a randomized algorithm may :
- Run a different number of steps.
- Produce a different output.
We can classify probabilistic algorithms in two classes:
- Monte-Carlo – These algorithms run in polynomial time, the output is always correct with a high probability. They are used when time is critical, but exact correctness is not essential.
- Las-Vegas – These algorithms run in expected polynomial time, the output is always correct. They are used in situations where correctness is non-negotiable, but performance may be variable.
1.1 — Why using probabilistic algorithms?
Pros:
- when well designed, more efficient in most of the execution.
- Conceptually simpler and more elegant.
Cons:
- Lack of correctness: probabilistic algorithms, most of the time, provide a probability of correctness, rather than guaranteeing the correct result every time.
- Approximate solutions.
- Testing can be more challenging because two runs of the same algorithm with the same input may produce different outputs.
2 — A few examples
2.1 — Matrix product checker
Try to answer the following question:
What is the complexity of multiplying two $n\times n$ matrices $A$ and $B$?
Can we make an algorithm that does better than $O(n^3)$?
Answer – Frievald’s algorithm. It works as follows:
Choose a random binary vector $r:=[r_i]_{1\leq i \leq n}$, where $Pr[r_i = 1] = Pr[r_i = 0] = 1/2$ for all $i\in [[1,n]]$. The algorithm will output true if $ABr = Cr$ and false otherwise.
What is the complexity of Frievald’s algorithm?
What can we expect from this algorithm?
- If $AB = C$, then $Pr[output=true] = 1$.
- If $AB \neq C$, then $Pr[output=false] \leq \frac{1}{2}$.
2.2 — Fingerprinting
Here is a definition of fingerprinting:
A fingerprinting algorithm is a procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string, its fingerprint, that uniquely identifies the original data for all practical purposes just as human fingerprints uniquely identify people for practical purposes.
Formally, a fingerprinting algorithm is the task that designs a family of functions $H$ (“hash family”) such that, if we select a function $h \in H$, then if $x = y$, it is also true that $h(x) = h(y)$ with high probabilitys with the constraint that it is faster to check that $h(x)=h(y)$ than $x=y$.
Application – Alice and Bob each have a large file consisting of $n$ ASCII characters (128 characters). They need to determine if their files are identical with the constraint that they want to send as little information as possible about their respective file.
Solution – Reed-Solomon fingerprinting:
- Alice’s (resp. Bob’s) file is a sequence of characters $(a_1,\ldots, a_n)$ (resp. $(b_1, \ldots , b_n)$) where every character can take $m$ values.
- Fix a prime number $p > \max\{m, n^2\}$, $F_p$ denote the set of integers modulo $p$ (all arithmetic is done modulo $p$).
- For each $r \in F_p$, define $h_r(a_1,\ldots, a_n) = \sum_{i=1}^na_ir^i$. The family $H$ of hash functions we will consider is $H = \{h_r : r \in F_p\}$. Each hash function take $(a_1,\ldots , a_n)$ as the coefficients of a degree $n$ polynomial evaluated at $r$.
- Alice picks a random element $r$ from $F_p$, computes $v = h_r(a)$, and sends $v$ and $r$ to Bob. Bob outputs true if $v = h_r(b)$, and outputs false otherwise.
You can see the similarity between Reed-Solomon fingerprinting and Frievald’s algorithm, where we only evaluate a function at a random value.
3 — Random number generation
We have seen in the previous section that using randomness in algorithm can drastically improve their performance. However comes the question on how to generate random numbers in a computer.
A sequence of random numbers, $R_1,R_2,\ldots,$ must have two important statistical properties:
- Uniformity – $P(R_i\leq x)= x$, with $x\in[0,1]$.
- Independence – Each event is independent of any combination of other events.
In practice, it is very hard to generate truly random numbers, and the current algorithms generate pseudo-random numbers.
The generation of pseudo-random numbers can have the following flaws:
- Generated numbers may not be uniformly distributed.
- Generated numbers may be discrete-valued instead of continuous valued.
- Autocorrelation between numbers.
Despite these issues, pseudo-random numbers are still widely used to simulate randomness. When designing algorithms to generate pseudo-random numbers, you should always consider the following properties:
- The algorithm should be fast.
- The algorithm should be portable to different computers and ideally to different programming languages.
- The generated sequence should be replicable.
- The generated sequence should closely approximate uniformity and independence properties.
A classical example for generated random numbers is the linear congruential method. The linear congruential method, proposed by Lehmer, produces a sequence of integers $R_1,R_2,\ldots,$ between, zero and $m-1$ according to the following recursive relationship:
$$R_{i+1}=(aR_{i}+c)\text{ mod } m,\;i=0,1,2,\ldots \;,$$where:
- The initial value $R_0$ is called the seed.
- $a$ is called the constant multiplier.
- $c$ is the increment.
- $m$ is the modulus.
To generate random numbers between 0 and 1 you can use $\frac{R_i}{m}$.
To go further
Looks like this section is empty!
Anything you would have liked to see here? Let us know on the Discord server! Maybe we can add it quickly. Otherwise, it will help us improve the course for next year!
To go beyond
-
The Power of Randomness: Fingerprinting and Freivalds’ Algorithm.
Performances proof of fingerprint algorithms.
-
An interesting book chapter on random numbers.
-
Design and Analysis of Algorithms.
Videos on random algorithms (6-7-8).