We looked at the basic k-means algorithm (Lloyd’s) in the last post. It’s one of the algorithms used by Mahout for clustering, both sequentially and as a mapreduction. We’ll talk about Mahout’s k-means implementation in a few posts.

Recall that the first step of every iteration in Lloyd’s is to classify each point (assign it to its nearest cluster). To do this, we need to calculate the distance between the current $d$-dimensional, vector $\mathbf{x}$ and each of the centroids $\mathbf{c_i}, \,\forall i=0..k$ using some distance metric, like the Euclidean distance $||\mathbf{x} - \mathbf{c_i}||_2$ or the cosine of the angle between the vectors, $\frac{\mathbf{x} \cdot \mathbf{c_i}}{||\mathbf{x}||\,||\mathbf{c_i}||}$.

In any case, calculating the distance to each center (of which there are $k$) and assigning it to the closest one takes $O(kd)$ per point (we also need to go through the $d$ dimensions of the vector), so $O(nkd)$ in total per Lloyd’s iteration.

We can get a faster algorithm by trading optimality (we no longer get the closest neighbor) for speed. One thing that adds quite a bit of overhead is the $d$ (the dimension of the vector) because we need to loop through all of them to calculate the distance. Random projections can really help with this. The idea is to sample a set of vectors (that are normally distributed, see [3] for a more detailed lemma) and project the existing vectors (that we want to search for the nearest neighbor in) onto them. Then, we project the vector we’re looking for, and the value of its projection will be close to  the value of the projections of its nearest neighbor. The scalar projection is 1D and we only need to compute the projections of the points we’re searching in once. We can get an idea of how close two $d$ vectors are to one another by looking at these values. Of course, since we’re colapsing $d$ dimensions into just one, we lose information, but it turns out that we only introduce false positives (points that were further apart are brought closer by projecting them).

Let’s look at a bit of math first. Suppose our projection vector is $\mathbf{u}$ and the vector we want to project is $\mathbf{v}$. If $v = ||\mathbf{v}||$ (the norm of the vector, a postive real number), the value of the scalar projection is $v \cos \angle(\mathbf{u}, \mathbf{v})$. Now, suppose we’re trying compute how close two vectors $\mathbf{v}$ and $\mathbf{w}$ and we’re using the Euclidean distance. We would compute $||\mathbf{v} - \mathbf{w}|| = \sqrt{u^2 + v^2 + 2 u v \cos \theta}$ where $\theta = \angle(\mathbf{v}, \mathbf{w})$. If instead, we project them on some vector $\mathbf{u}$, we would compare $v \cos \theta_1$ and $v \cos \theta_2$ where $\theta_1 = \angle(\mathbf{u}, \mathbf{v})$ and $\theta_2 = \angle(\mathbf{u}, \mathbf{w})$.

For Euclidean distance, we can skip extracting the square root altogether since $\sqrt x$ is a monotonically increasing function. Why is it that we neve get false negatives? What do we mean when we say that when two vectors are “close toghether”, their projects are also “close together”? Let’s compare $d_1 = ||\mathbf{v} - \mathbf{w}||^2 = u^2 + v^2 + 2 u v \cos \theta$ and $d_2 = (v \cos \theta_1 - w \cos \theta_2)^2$. Let’s see if $d_1 \ge d_2$, the actual distance is always be greater or equal than the estimate. Expanding, $d_2 = v^2 \cos^2 \theta_1 + w^2 \cos^2 \theta_2 - 2 v w \cos \theta_1 \cos \theta_2$. Additionally, the projection vector can either be between the two vectors, in which case $\theta = \theta_1 + \theta_2$, or on one side of the vectors, in which case $\theta = \theta_1 = \theta_2$ (or the other way around). For the two vectors to be “close together”, the angle between them, $\theta \rightarrow 0$, which means $\cos \theta \rightarrow 1$. Moreover, the angle $\theta$ drops down to zero (so it’s be a small positive integer). This also means that $\cos \theta \ge 0$ (for $\theta \le \frac{\pi}{2}$).
So, if $\theta = \theta_1 + \theta_2$, we know that $\cos (\theta_1 + \theta_2) = \cos \theta_1 \cos \theta_2 - \sin \theta_1 \sin \theta_2$, so $d_2 = u^2 \cos^2 \theta_1 + v^2 \cos^2 \theta_2 - 2 u v (\cos \theta + \sin \theta_1 \sin \theta_2) \le u^2 + v^2 - 2 u v \cos \theta)$ (since $\theta_1 \rightarrow 0, \theta_2 \rightarrow 0 \Rightarrow \sin \theta_1 \rightarrow 0, \sin \theta_2 \rightarrow 0$ and $a \cos^2 theta \le a$).
Otherwise, for $\theta = \theta_1 - \theta_2$ you add $\sin \theta_1 \sin \theta_2$ rather than substracting it, which makes the same argument less convincing. You can argue that the inequality still holds based on limits, but here’s a cool picture instead!

Heatmap of d1 – d2 for u = v = 1.

This is a heat map of the function $d_1 - d_2 = u^2 + v^2 + 2 u v \cos (\theta_1 - \theta_2) - u^2 \cos^2 \theta_1 - v^2 \cos^2 \theta_2 - 2 u v \cos \theta_1 \cos \theta_2$, where $u = v = 1$.
You can clearly see that worst case, $d_1 = d_2$, but most of the time, $d_1 > d_2$ and there are some awesome patterns.

So, how does this help search faster? So yes, projection only ever produces false positives! There are two main approaches that both involve multiple projection vectors. Just one is not precise enough.

## Projection Search

Suppose we use $m$ projection vectors (whose components are sampled from $\mathcal{N}(0, 1)$ for probability guarantees I don’t fully understand… look up the lemma in [3]).
For each random projection vector, get the scalar projections of the vectors that we get the nearest neighbor from (the centroids for k-means) and keep them sorted (either in an array, or a binary search tree, …).
For a given query, project it onto each of the projection vectors and look up its scalar projection in the corresponding sorted set of projections built for the initial vectors. Looking up a projection should take $O(\log k)$ steps. Get a ball of $l$ elements around the position and collect them (if we want $l$ overall, we can use a heap to keep track of the $l$ best ones).
Perform the complete distance calculation for the $m l$ points that were saved and select the closest one (or the closest $p$ ones).
So, the complexity of this algorithm (without the heap) is $O(m d) + O(m k d) + O(n m (d + \log k + l d))$ (generating the projections, projecting the centroids, for each query point, project it, look up the scalar projection, calculate the actual distance to the closest vectors around it). It turns that even for $d = 500$ dimensions, the probability of finding the nearest cluster reaches 80% by combining $m = 20$ projections [2]. The biggest deal is depending less on $k$ (the number of points that might be neighbors).

## Locality Sensitive Hashing

Also based on random projections, rather than look at the entire $d$ dimensions, how about generating a hash? The idea is similar to the Rabin-Karp string-matching algorithm, where expensive string comparisons are only done when required (i.e. the hashes match). Here however, if two vectors are close, their hashes should also be close (rather than uniformly distributed as required for a hash table).
So, instead of looking at the scalar projection, we can look at just its sign. If $v \cos \theta > 0$, $\theta \in (0, \frac{\pi}{2})$. Looking at the projection vector $\mathbf{u}$ as the normal vector of a hyperplane in $d - 1$ dimensional space, the sign of the scalar projection represents which side of space the vector $\mathbf{v}$ is in (“in front” or “in the back”). If we flip a bit (1 for a positive sign, 0 otherwise), and we have $m$ projection vectors, we can build an $m$-bit hash. In the end, we hash the $k$ vectors to search as well as our query vectors ($O((k + n) d)$). We go through the vectors but only perform the expensive comparison if the difference in hash functions between a candidate and the query vector is below a certain threshold (which could be dynamic).

Unfortunately, I’m not exactly sure how LSH is implemented. I mean, yes you hash the values, put them in a set and iterate over them checking to see whether they’re close enough and check the actual distances… but I read about what seemed like other ways of doing this in [1]. I should look more into that. Stay tuned!

Here are some more references:

[1] http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf (nice visuals, pretty terse)

[2] http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf (look for experimental results; also describes the new k-means algorithm I’m working on with Ted)

[3] http://yima.csl.illinois.edu/psfile/CVPR10-Hashing.pdf (contains a lemma about why the projection vectors need to be normally distributed)