I’ve been away for quite some time but am now ready to start what is probably the most important project I’ve worked on until now (excluding internships).

I’ll start work on my Bachelor’s Project soon which is going to be on top of Mahout.

Mahout is a scalable, open-source machine learning framework. Some of its algorithms are built on top of Hadoop. I realized that I really like maths and algorithms and I’d like to work on something that combines both ideally. So, Machine Learning it is!
No, I didn’t just pick this because of the hype, I swear!

I started by e-mailing the dev@mahout.apache.org list and introducing myself and asking for guidance. Nobody answered at first which was pretty disconcerting, but a few days later, when I ping’d the list again, someone was willing to help.

Ted Dunning to be exact. Ted has worked on Mahout for some time now and you can find some of his talks on YouTube here, here or here. The first one is a short interview about Hadoop from Strata 2012, and the second two are talks about Mahout (in particular new ideas for clustering algorithms) in Boston and LA.

Body

The latest thing he’s been working on and that I’ll be helping integrate into Mahout and benchmark is a fast single-pass k-means clustering algorithm.

A successful k-means clustering for k=3. The circles are centroids and the squares are normal data points. Copyright Wikipedia.

k-means clustering is an iterative method by which $n$ data points (in $d$ dimensional space) are clustered into $k$ clusters based on each point’s proximity to each cluster. Here’s a description the classic k-means algorithm, Lloyd’s algorithm:

First, the $k$ cluster centers are initialized randomly (there are better approaches); these new points are called centroids. They will (hopefully) have converged to the real centers by the end of the algorithm.

A clustering step first takes each of the $n$ points and computes the distance from it to the $k$ centroids assigning to the cluster whose centroid is closest to this point. After all the points are assigned a cluster, suppose the assignment is $\{k_1, k_2, \ldots, k_n\}$, where cluster $i$ has $k_i$ points, $\{\mathbf{x_{i1}},\mathbf{x_{i2}}, \ldots, \mathbf{x_{ik_i}}\}$. The new centroid of cluster $i$ is the average of these points, $\frac{1}{k_i} \sum_{j = 0}^{i} \mathbf{x_{ij}}$.

The clustering step is performed until there are no more changes in the cluster assignment or centroids.

While conceptually simple and quite straightforward to code, this algorithm suffers from a number of drawbacks.

• The problem of clustering the data points is NP-hard in general. This algorithm uses a heuristic approach and might get stuck at a local optimum depending on what points the centroids are initialized at. Supposed two centroids are initialized close to one another (in the same logical cluster); then, one of the clusters that we wanted to find will be split. So, multiple attempts should be made with different initial values for the centroids.

• There are are cases where this algorithm takes exponential time $2^{\Omega(n)}$ (see http://en.wikipedia.org/wiki/K-means_clustering#cite_note-5 and http://en.wikipedia.org/wiki/K-means_clustering#Complexity), but the runtime is polynomial for practical cases.
• Even if the runtime is polynomial, assuming that the number of steps it takes to converge is $log n$ (highly optimistic), the total complexity of the algorithm is $O(nkd log n)$ (calculating the minimum distance takes $O(kd)$ for one point). This is still too expensive for huge applications, especially since k-means is used as a step in more complicated algorithms (for example, in finding the $m$ nearest neighbors, rather than looking at all the points, if there are some $k$ clusters, find the adjacent clusters and look for neighbors only there).