## Quick update on StreamingKMeans

Haven’t written anything in a long time but have been slowly making progress.

So, now I have a `StreamingKMeans`

implementation (improved Ted’s) and I did lots of cleanups and bugfixes [1].

Also, there’s a new MapReduce version in `experimental/`

in both `main/`

and `test`

. `StreamingKMeansDriver`

implements the command line tool. The code is in a new branch appropriately titled `mapreduce`

where I’ll do most development from now on.

Finally, there’s a new `EvaluateClustering`

class that tries the data on a `SequenceFile`

of TF-IDF vectors generated using Mahout’s `seqdirectory`

and `seq2sparse`

tools [2].

[1] https://github.com/dfilimon/knn

[2] https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html

## Speeding up k-means: Random projections

We looked at the basic k-means algorithm (Lloyd’s) in the last post. It’s one of the algorithms used by Mahout for clustering, both sequentially and as a mapreduction. We’ll talk about Mahout’s k-means implementation in a few posts.

Recall that the first step of every iteration in Lloyd’s is to classify each point (assign it to its nearest cluster). To do this, we need to calculate the distance between the current -dimensional, vector and each of the centroids using some distance metric, like the Euclidean distance or the cosine of the angle between the vectors, .

In any case, calculating the distance to each center (of which there are ) and assigning it to the closest one takes per point (we also need to go through the dimensions of the vector), so in total per Lloyd’s iteration.

We can get a faster algorithm by trading optimality (we no longer get the closest neighbor) for speed. One thing that adds quite a bit of overhead is the (the dimension of the vector) because we need to loop through all of them to calculate the distance. Random projections can really help with this. The idea is to sample a set of vectors (that are normally distributed, see [3] for a more detailed lemma) and project the existing vectors (that we want to search for the nearest neighbor in) onto them. Then, we project the vector we’re looking for, and the value of its projection will be close to the value of the projections of its nearest neighbor. The scalar projection is 1D and we only need to compute the projections of the points we’re searching in once. We can get an idea of how close two vectors are to one another by looking at these values. Of course, since we’re colapsing dimensions into just one, we lose information, but it turns out that we only introduce false positives (points that were further apart are brought closer by projecting them).

Let’s look at a bit of math first. Suppose our projection vector is and the vector we want to project is . If (the norm of the vector, a postive real number), the value of the scalar projection is . Now, suppose we’re trying compute how close two vectors and and we’re using the Euclidean distance. We would compute where . If instead, we project them on some vector , we would compare and where and .

For Euclidean distance, we can skip extracting the square root altogether since $\sqrt x$ is a monotonically increasing function. Why is it that we neve get false negatives? What do we mean when we say that when two vectors are “close toghether”, their projects are also “close together”? Let’s compare and . Let’s see if , the actual distance is always be greater or equal than the estimate. Expanding, . Additionally, the projection vector can either be between the two vectors, in which case , or on one side of the vectors, in which case (or the other way around). For the two vectors to be “close together”, the angle between them, , which means . Moreover, the angle drops down to zero (so it’s be a small positive integer). This also means that (for ).

So, if , we know that , so (since and ).

Otherwise, for you add rather than substracting it, which makes the same argument less convincing. You can argue that the inequality still holds based on limits, but here’s a cool picture instead!

This is a heat map of the function , where .

You can clearly see that worst case, , but most of the time, and there are some awesome patterns.

So, how does this help search faster? So yes, projection only ever produces false positives! There are two main approaches that both involve multiple projection vectors. Just one is not precise enough.

## Projection Search

Suppose we use projection vectors (whose components are sampled from for probability guarantees I don’t fully understand… look up the lemma in [3]).

For each random projection vector, get the scalar projections of the vectors that we get the nearest neighbor from (the centroids for k-means) and keep them sorted (either in an array, or a binary search tree, …).

For a given query, project it onto each of the projection vectors and look up its scalar projection in the corresponding sorted set of projections built for the initial vectors. Looking up a projection should take steps. Get a ball of elements around the position and collect them (if we want overall, we can use a heap to keep track of the best ones).

Perform the complete distance calculation for the points that were saved and select the closest one (or the closest ones).

So, the complexity of this algorithm (without the heap) is (generating the projections, projecting the centroids, for each query point, project it, look up the scalar projection, calculate the actual distance to the closest vectors around it). It turns that even for dimensions, the probability of finding the nearest cluster reaches 80% by combining projections [2]. The biggest deal is depending less on (the number of points that might be neighbors).

## Locality Sensitive Hashing

Also based on random projections, rather than look at the entire dimensions, how about generating a hash? The idea is similar to the Rabin-Karp string-matching algorithm, where expensive string comparisons are only done when required (i.e. the hashes match). Here however, if two vectors are close, their hashes should also be close (rather than uniformly distributed as required for a hash table).

So, instead of looking at the scalar projection, we can look at just its sign. If , . Looking at the projection vector as the normal vector of a hyperplane in dimensional space, the sign of the scalar projection represents which side of space the vector is in (“in front” or “in the back”). If we flip a bit (1 for a positive sign, 0 otherwise), and we have projection vectors, we can build an -bit hash. In the end, we hash the vectors to search as well as our query vectors (). We go through the vectors but only perform the expensive comparison if the difference in hash functions between a candidate and the query vector is below a certain threshold (which could be dynamic).

Unfortunately, I’m not exactly sure how LSH is implemented. I mean, yes you hash the values, put them in a set and iterate over them checking to see whether they’re close enough and check the actual distances… but I read about what seemed like other ways of doing this in [1]. I should look more into that. Stay tuned!

Here are some more references:

[1] http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf (nice visuals, pretty terse)

[2] http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf (look for experimental results; also describes the new k-means algorithm I’m working on with Ted)

[3] http://yima.csl.illinois.edu/psfile/CVPR10-Hashing.pdf (contains a lemma about why the projection vectors need to be normally distributed)

## Starting Work on Bachelor’s Project. k-means basics.

**Head, in which I talk about some background (skip to Body if not interested)**

I’ve been away for quite some time but am now ready to start what is probably the most important project I’ve worked on until now (excluding internships).

I’ll start work on my Bachelor’s Project soon which is going to be on top of Mahout.

Mahout is a scalable, open-source machine learning framework. Some of its algorithms are built on top of Hadoop. I realized that I really like maths and algorithms and I’d like to work on something that combines both ideally. So, Machine Learning it is!

No, I didn’t just pick this because of the hype, I swear!

I started by e-mailing the dev@mahout.apache.org list and introducing myself and asking for guidance. Nobody answered at first which was pretty disconcerting, but a few days later, when I ping’d the list again, someone was willing to help.

Ted Dunning to be exact. Ted has worked on Mahout for some time now and you can find some of his talks on YouTube here, here or here. The first one is a short interview about Hadoop from Strata 2012, and the second two are talks about Mahout (in particular new ideas for clustering algorithms) in Boston and LA.

**Body**

The latest thing he’s been working on and that I’ll be helping integrate into Mahout and benchmark is a fast single-pass k-means clustering algorithm.

k-means clustering is an iterative method by which data points (in dimensional space) are clustered into clusters based on each point’s proximity to each cluster. Here’s a description the classic k-means algorithm, Lloyd’s algorithm:

First, the cluster centers are initialized randomly (there are better approaches); these new points are called centroids. They will (hopefully) have converged to the real centers by the end of the algorithm.

A clustering step first takes each of the $n$ points and computes the distance from it to the centroids assigning to the cluster whose centroid is closest to this point. After all the points are assigned a cluster, suppose the assignment is , where cluster has points, . The new centroid of cluster is the average of these points, .

The clustering step is performed until there are no more changes in the cluster assignment or centroids.

While conceptually simple and quite straightforward to code, this algorithm suffers from a number of drawbacks.

- The problem of clustering the data points is NP-hard in general. This algorithm uses a heuristic approach and might get stuck at a local optimum depending on what points the centroids are initialized at. Supposed two centroids are initialized close to one another (in the same logical cluster); then, one of the clusters that we wanted to find will be split. So, multiple attempts should be made with different initial values for the centroids.

- There are are cases where this algorithm takes exponential time (see http://en.wikipedia.org/wiki/K-means_clustering#cite_note-5 and http://en.wikipedia.org/wiki/K-means_clustering#Complexity), but the runtime is polynomial for practical cases.
- Even if the runtime is polynomial, assuming that the number of steps it takes to converge is (highly optimistic), the total complexity of the algorithm is (calculating the minimum distance takes for one point). This is still too expensive for huge applications, especially since k-means is used as a step in more complicated algorithms (for example, in finding the nearest neighbors, rather than looking at all the points, if there are some clusters, find the adjacent clusters and look for neighbors only there).

There are lots of uses for k-means and I’ll be able to tell you more about them as well as improvements to this algorithm, or completely new ideas as I delve deeper into the project.

For now, Ted’s code lives in https://github.com/tdunning/knn/ and he describes the ideas used to implement new solutions in docs/scaling-k-means/scaling-k-means.pdf. I’ll be talking about more details soon!

## Bad Scheme

I am *so ashamed*😐

(define string-tokenize (lambda (s delims) (letrec ((string-tokenize-dirty (lambda (s delims) (cond ((null? s) (list '())) ((let ((tokens (string-tokenize-dirty (cdr s) delims))) (if (member (car s) delims) (if (not (null? (car tokens))) (cons '() tokens) tokens) (cons (cons (car s) (car tokens)) (cdr tokens))))))))) (let ((tokens (string-tokenize-dirty (string->list s) (string->list delims)))) (map list->string (if (null? (car tokens)) (cdr tokens) tokens))))))

> (string-tokenize " a few good men +b*c+d d " " *+")

("a" "few" "good" "men" "b" "c" "d" "d")

Any advice on how to make this tail-recursive?

Also, how can I manipulate strings like lists?

## Testing for Primes: Scheme

I cannot possibly look at parentheses the same way now. And that’s all thanks to Scheme. I can only see function application. Even when solving the matrix chain multiplication problem, I still had this odd idea that in fact is a function applied to some other function. Somehow that made sense.

But today, let’s talk about primes. How to test if a number is prime and how to generate the stream of primes. And which method works best. And most importantly, why does the method I thought should work best run out of memory?

I’ll try testing for primes in 3 ways: finding relevant divisors (**prime?**), the Fermat primality test (**fermat-prime?**) and the Rabin-Miller primality test (**rabinmiller-prime?** and **rabinmiller-true-prime?**). After that, we’ll try to generate the stream of prime numbers and see which of these algorithms perform better and also compare them with the tried and tested Erathosthenes’ Sieve method.

Let’s first have a look at everyone’s favorite primality test, search for divisors up to its square root:

(define divides? (lambda (a b) (= (remainder a b) 0))) (define prime? (lambda (n) (cond ((or (= n 1) (= n 0)) #f) ((= n 2) #t) ((even? n) #f) (else (let prime-test ( (d 3) ) (cond ((> (square d) n) #t) ((divides? n d) #f) (else (prime-test (+ d 2)))))))))

This is a really basic test that for a given odd number checks all odd numbers starting from 3 to see whether or not they divide . Obviously, a better approach would have been to do the test only for prime , but there’s no way of figuring that out without first having all prime numbers less than already. I remember implementing this in high-school.

Let’s get to something more interesting, the Fermat test, which is nothing but a restatement of Fermat’s Little Theorem: *If is a prime number and , then .*

So, everyone, open section 1.2 of the SICP and let’s have a look at exponentiation (fast-exponentiation to be exact). We’ll have this done in logarithmic time and we’ll also have the remainder calculated at each step to avoid the size of the numbers from getting out of hand:

(define expmod (lambda (base exp mod) (remainder (cond ((= exp 0) 1) ((even? exp) (square (expmod base (/ exp 2) mod))) (else (* base (expmod base (- exp 1) mod)))) mod)))

There are some issues with the Fermat primality test. Most importantly, it is **not conclusive**. We’ll set a fixed value of and test the condition in Fermat’s Little Theorem. If is prime, then the test is true for all values of , but testing all values of a would not be feasible.

Checking

(= (expmod a n n) a)

only proves that is *probably prime*, but there is no way of knowing for sure in only one test. Picking a few random values of and having the test succeed is a pretty good indicator of the primality of the number and is in fact what the PGP program uses to test for primes.

So, our Fermat test would look something like:

(define fermat-prime? (lambda (n) (let fermat-tests ((i 5)) (if (= i 0) #t (let ( (a (+ 1 (random (- n 1)))) ) (if (= (expmod a n n) a) (fermat-tests (- i 1)) #f))))))

I could have specified the number of tests to run as a parameter to **fermat-prime?** instead of 5. Finding a single failure is proof that the number is **not prime**. However, there are certain numbers that always verify this condition despite not being prime!

They are called Carmichael numbers and they are not detected by the **fermat-prime?** test:

1 ]=> (prime? 9746347772161)

;Value: #f

1 ]=> (fermat-prime? 9746347772161)

;Value: #t

In fact, .

This is where the **Miller-Rabin Test** comes in. It’s based on a slightly modified version of Fermat’s Little Theorem with a small twist that gets rid of Carmichael numbers as well. Instead of fixing an and checking to see if , it checks to see if .

The key difference is that when performing the squaring step in the **expmod** function we check to see if the number we’re supposed to be squaring is a nontrivial square root of 1 modulo n.

Let’s explain what that actually means (courtesy of the Wiki article on the Miller-Rabin test). Let be a prime number. Clearly , for any . Those are called the trivial square roots of 1. We’ll prove that if is prime, there can be no others: suppose there is another such that , so divides the product, therefore it must divide one of the factors, meaning either or which means that either or (in the field ).

If we at some point in the exponentiation get a partial result `(s (expmod2 base (/ exp 2) mod))`

meaning that and without being either or , then by what we’ve proven above, **cannot be prime** since we’ve just found a **non-trivial square root of 1**.

So, we can formulate a function that works with Carmichael numbers too:

(define sqrt-of-1? (lambda (s n) (if (or (= s 1) (= s (- n 1))) #f (= (remainder (* s s) n) 1)))) (define expmod2 (lambda (base exp mod) (remainder (cond ((= exp 0) 1) ((even? exp) (let ((s (expmod2 base (/ exp 2) mod))) (if (sqrt-of-1? s mod) 0 (* s s)))) (else (* base (expmod2 base (- exp 1) mod)))) mod))) (define rabinmiller-prime? (lambda (n) (if (or (= n 0) (= n 1)) #f (let rabinmiller-tests ((i 5)) (if (= i 0) #t (let ( (a (+ 1 (random (- n 1)))) ) (if (= (expmod2 a (- n 1) n) 1) (rabinmiller-tests (- i 1)) #f)))))))

Notice however that we still only test a number 5 times. There is a proven number of times the exponentiation needs to be done in order for the test to be deterministic. Chrid Caldwell’s Primality Proving website lists the following result:

Miller’s Test[Miller76]:If the extended Riemann hypothesis is true,then ifnis ana-SPRP for all integersawith 1 <a< 2(logn)^{2}, thennis prime.

So, in fact we could adjust the code to generate the range between 2 and :

(define rabinmiller-true-prime? (lambda (n) (cond ((or (= n 0) (= n 1)) #f) ((= n 2) #t) ((even? n) #f) (else (let rabinmiller-tests ( (test-range (range 2 (min (- n 1) (* 2 (square (floor (log n))))))) ) (if (null? test-range) #t (let ((a (car test-range))) (if (= (expmod2 a (- n 1) n) 1) (rabinmiller-tests (cdr test-range)) #f))))))))

Let’s try some primitive benchmarks. I’m going to try and get the first 5000 prime numbers by filtering the stream of all naturals and benchmarking (subtracting endtime and starttime) for each method above including a sieve approach.

For that, let’s introduce a couple of new constructs for manipulating streams:

(define naturals (let make-naturals ((i 0)) (cons-stream i (make-naturals (+ i 1))))) (define filter-stream (lambda (f s) (let ( (rest (delay (filter-stream f (force (cdr s))))) (head (car s)) ) (if (f head) (cons head rest) (force rest))))) (define take (lambda (n s) (if (= n 0) '() (let ( (rest (take (- n 1) (stream-cdr s))) (head (stream-car s)) ) (cons head rest)))))

The code for the sieve is very basic (done in a lab at school):

(define sieve (lambda (s) (cons (car s) (delay (sieve (filter-stream (lambda (x) (not (= (remainder x (car s)) 0))) s)))))) (define naturals-from-2 (force (cdr (force (cdr naturals))))) (define primes (sieve naturals-from-2))

The results are very interesting:

1 ]=> (benchmark (delay (take 5000 (filter-stream rabinmiller-prime? naturals))))

;Value: 7.3799999999999955

1 ]=> (benchmark (delay (take 5000 (filter-stream rabinmiller-true-prime? naturals))))

;Value: 97.03999999999999

1 ]=> (benchmark (delay (take 5000 (filter-stream prime? naturals))))

;Value: 1.8300000000000125

1 ]=> (benchmark (delay (take 5000 primes)))

;Aborting!: out of memory

;GC #32: took: 0.37 (51%) CPU time, 0.47 (57%) real time; free: 14882

;GC #33: took: 0.34 (100%) CPU time, 0.36 (99%) real time; free: 14938

Something is seriously wrong with the sieve based approach…

The interesting part is that it works great up about 2390:

1 ]=> (benchmark (delay (take 2390 primes)))

;Value: .00999999999999801

1 ]=> (benchmark (delay (take 2400 primes)))

;Aborting!: out of memory

;GC #58: took: 0.34 (51%) CPU time, 0.34 (50%) real time; free: 13773

;GC #59: took: 0.33 (100%) CPU time, 0.34 (99%) real time; free: 13806

This should be the fastest of all approaches but it just dies.

I’d really like some help here! Volunteers?

Also, it’s interesting to note that while the Rabin-Miller test performs better for testing huge numbers (only one at a time), but only by a small fraction (there is really not a big difference < 1sec for 32416190071, which is prime), both of these tests are ineffective against the simple divisor check fo smaller numbers as can be clearly seen from the benchmarks above. (disclaimer: I know that they are really really imprecise!)

Still, the **prime?** test beats **rabinmiller-prime?** by a factor of about **3** and **rabinmiller-true-prime?** by a factor of **over 10**.

What’s wrong? Clearly there are too many tests being done. For the biggest prime number I could fine on the Big Primes website, there are a total of 1250 = (* 2 (square (ceiling (log 32416190071)))) being performed.

You can find **the code** in my pp2011 github repo, here.

Update: I think however that there are **major issues** with the tests. Restarting Scheme makes a huge difference. Can anyone please confirm? I’m using MIT/GNU Scheme Release 9.0.1 under Mac OS X.

I would have liked to make a point about testing the primality of numbers in an efficient way. It turns out the point is: **I suck at it!😀**

## LaunchPad is here!

So, I’ve been busy with life these last weeks. But there are still interesting things going on at school — microcontrollers seem to be all the rage these days: we have 2 courses where we play with the critters.

Well, I now have my own (really cheap) board to play with: a Texas Instruments LaunchPad! I’m stepping through its temperature monitoring program as we speak

Fortunately, there are lots of tutorials on how to get started and if all goes well, we might have a small project to do with these dev kits. I leave you with some neat unpacking pics. I really wanted to capture my excitement (that wore off pretty quick) in getting my first hardware kit!

## FOSDEM 2011 Highlights: systemd

Lennart Pottering’s systemd is Fedora 15’s Upstart replacement, part of the current push to replace the aging SystemV init system. Different distributions have used various kind of init scripts, but the main problem has always been parallelization. The classic method of booting up a system is intrinsically serial.

This is what happens in a nutshell:

As the kernel first boots up, it launches its first process, with process id 1, called init. This special process is responsible for starting up every other process in the system, usually by executing specific scripts in /etc/init.d/ (or /etc/rc.d/ on BSD-like distros). The order in which services are started is vital as daemons such as crond are required for services like dbus to function.

The problem is *how to boot up the system faster? *This is especially important for servers (where uptime is vital) and embedded devices (which are expected to start instantly).

The *old* method, describing the dependencies between services and starting them according to a topological sort is inherently serial. You could start daemons that don’t depend on one another at the same time, but if two daemons both depend on dbus, dbus must start first. This creates inevitable bottlenecks which slow down the boot up.

In BSD-style systems (I’m actually about Arch Linux, it being the one I currently use), starting services serially is a significant problem. Particularly starting the network takes quite some time due to the DHCP daemon that takes a while to find a DHCP server. The order daemons are started is specified in /etc/rc.conf in a specific DAEMONS section and you can load daemons in the background for somewhat of an improvement, but it’s not clear which daemons *can* be configured to start in the background and not cause too much trouble

I’m not very familiar to comment on SysV style, but as far as I can tell (please correct me if I’m wrong), it just uses a different (more flexible) syntax to achieve the same thing.

**Upstart**, Ubuntu’s init replacement (one of the many to have popped up in recent years) does things differently. To cite the official description *“All jobs waiting on a particular event are normally started at the same time when that event occurs”. *This means that *all services* that depend on dbus are started immediately after dbus has finished starting. Although this certainly improves boot times by a fair margin (Ubuntu boots very quickly nowadays), the *bottleneck* is still there.

*dbus must finish loading before jobs which depend on it can start.*

Although a significant improvement, there is also a totally different way of doing things. Mac OS X for example, uses **launchd, **a system wide daemon and agent manager (agents are, in Mac OS X parlance daemons that run under specific user accounts) that starts everything at the same time.

**launchd** came up with the following very interesting observation: dependency between different processes is chiefly expressed through IPC. Simplifying, we’ll just talk about sockets. On Mac OS X, the system logger, **syslogd** receives log messages across UNIX sockets, kernel printf APIs etc. If the socket is available, another process (like **diskarbitrationd**, the daemon responsible for disk mounting, filesystems and disk access) can communicate with **syslogd**.

**launchd** and **systemd** first allocate sockets (actually Mach ports on Mac OS X I think) to all daemons that need to start and they actually start a daemon when its socket is actually needed.

What happens is something like (disclaimer: very sketchy understanding of how this *actually works*): for each possible daemon, allocate its resources (sockets, ports…), keep the mapping from daemons to resources and pretend everything has started (while listening on every socket). Actually trigger loading of every daemon at the same time. If a daemon requires a socket that belongs to a daemon that has not finished starting yet, the requests are buffered and ordered by the kernel and the requesting daemon blocks, waiting for an answer (could have a time-out in case of failure of course). Finally, when a daemon finishes starting up, it gets all the requests buffered by the kernel without any fuss, as if it had just received them.

This is called socket-based activation and DBus-based activation works in a similar way, except that DBus is responsible for the buffering, not the kernel.

**systemd** can therefore boot up a computer much faster than other approaches. Other features are the obligatory, daemon babysitting, which is really quite convenient especially since socket buffering is managed externally so reloading a faulty daemon doesn’t automatically clear its message queue (socket-daemon matchings are preserved), the on-demand starting of daemons (there’s no point in starting CUPS if there’s no printer in sight) etc.

For more information, check out **systemd**‘s home page, at freedesktop.org and a very well written description by Lennart Pottering himself of what **systemd** is and how it works.

You can also find more information about Upstart from Scott James Remnant’s blog (I linked to an overview of Upstart vs. launchd).

Finally, you can find out more about **launchd** from the MacOS forge website. **launchd** is also open source under the Apache license, even if written completely by Apple. There was a project to use it for FreeBSD but that apparently fizzled. Perhaps unsurprisingly, because launchd failed to gather meaningful contributions, the last commit in its trunk is from 15 months ago. You can also check out a presentation about **launchd** I held for the Use of Operating Systems course held in Fall 2009 at school (unfortunately, it’s in Romanian and the demo doesn’t work properly).

Are so many **init** replacements a good thing? There only used to be simple init-scripts and you either had BSD-style or SystemV scripts. Easy to use shell scripts. Nowadays, init replacements are vastly different among Unices.

BSDs and most Linux distros continue to use traditional init scripts (their specific flavors anyway), Ubuntu uses **Upstart**, as do Chrome OS and Fedora. Fedora will switch to **systemd** this may with its new release possibly followed by OpenSUSE. It seems that the division between deb-based and rpm-based distributions grows even wider. Mac OS X has never used conventional init-scripts, replacing them from the beginning with SystemStarter which has in turn been replaced as of 2004 (?) by **launchd**. Solaris (from version 10+) on the other hand, uses something *completely different:* the Service Management Facility, you can find out more about it from Oracle’s official documentation here.

Say hello to Fragmentation. Or is it just Healthy Differentiation?