So, now I have a `StreamingKMeans`

implementation (improved Ted’s) and I did lots of cleanups and bugfixes [1].

Also, there’s a new MapReduce version in `experimental/`

in both `main/`

and `test`

. `StreamingKMeansDriver`

implements the command line tool. The code is in a new branch appropriately titled `mapreduce`

where I’ll do most development from now on.

Finally, there’s a new `EvaluateClustering`

class that tries the data on a `SequenceFile`

of TF-IDF vectors generated using Mahout’s `seqdirectory`

and `seq2sparse`

tools [2].

[1] https://github.com/dfilimon/knn

[2] https://cwiki.apache.org/MAHOUT/quick-tour-of-text-analysis-using-the-mahout-command-line.html

]]>

Recall that the first step of every iteration in Lloyd’s is to classify each point (assign it to its nearest cluster). To do this, we need to calculate the distance between the current -dimensional, vector and each of the centroids using some distance metric, like the Euclidean distance or the cosine of the angle between the vectors, .

In any case, calculating the distance to each center (of which there are ) and assigning it to the closest one takes per point (we also need to go through the dimensions of the vector), so in total per Lloyd’s iteration.

We can get a faster algorithm by trading optimality (we no longer get the closest neighbor) for speed. One thing that adds quite a bit of overhead is the (the dimension of the vector) because we need to loop through all of them to calculate the distance. Random projections can really help with this. The idea is to sample a set of vectors (that are normally distributed, see [3] for a more detailed lemma) and project the existing vectors (that we want to search for the nearest neighbor in) onto them. Then, we project the vector we’re looking for, and the value of its projection will be close to the value of the projections of its nearest neighbor. The scalar projection is 1D and we only need to compute the projections of the points we’re searching in once. We can get an idea of how close two vectors are to one another by looking at these values. Of course, since we’re colapsing dimensions into just one, we lose information, but it turns out that we only introduce false positives (points that were further apart are brought closer by projecting them).

Let’s look at a bit of math first. Suppose our projection vector is and the vector we want to project is . If (the norm of the vector, a postive real number), the value of the scalar projection is . Now, suppose we’re trying compute how close two vectors and and we’re using the Euclidean distance. We would compute where . If instead, we project them on some vector , we would compare and where and .

For Euclidean distance, we can skip extracting the square root altogether since $\sqrt x$ is a monotonically increasing function. Why is it that we neve get false negatives? What do we mean when we say that when two vectors are “close toghether”, their projects are also “close together”? Let’s compare and . Let’s see if , the actual distance is always be greater or equal than the estimate. Expanding, . Additionally, the projection vector can either be between the two vectors, in which case , or on one side of the vectors, in which case (or the other way around). For the two vectors to be “close together”, the angle between them, , which means . Moreover, the angle drops down to zero (so it’s be a small positive integer). This also means that (for ).

So, if , we know that , so (since and ).

Otherwise, for you add rather than substracting it, which makes the same argument less convincing. You can argue that the inequality still holds based on limits, but here’s a cool picture instead!

This is a heat map of the function , where .

You can clearly see that worst case, , but most of the time, and there are some awesome patterns.

So, how does this help search faster? So yes, projection only ever produces false positives! There are two main approaches that both involve multiple projection vectors. Just one is not precise enough.

Suppose we use projection vectors (whose components are sampled from for probability guarantees I don’t fully understand… look up the lemma in [3]).

For each random projection vector, get the scalar projections of the vectors that we get the nearest neighbor from (the centroids for k-means) and keep them sorted (either in an array, or a binary search tree, …).

For a given query, project it onto each of the projection vectors and look up its scalar projection in the corresponding sorted set of projections built for the initial vectors. Looking up a projection should take steps. Get a ball of elements around the position and collect them (if we want overall, we can use a heap to keep track of the best ones).

Perform the complete distance calculation for the points that were saved and select the closest one (or the closest ones).

So, the complexity of this algorithm (without the heap) is (generating the projections, projecting the centroids, for each query point, project it, look up the scalar projection, calculate the actual distance to the closest vectors around it). It turns that even for dimensions, the probability of finding the nearest cluster reaches 80% by combining projections [2]. The biggest deal is depending less on (the number of points that might be neighbors).

Also based on random projections, rather than look at the entire dimensions, how about generating a hash? The idea is similar to the Rabin-Karp string-matching algorithm, where expensive string comparisons are only done when required (i.e. the hashes match). Here however, if two vectors are close, their hashes should also be close (rather than uniformly distributed as required for a hash table).

So, instead of looking at the scalar projection, we can look at just its sign. If , . Looking at the projection vector as the normal vector of a hyperplane in dimensional space, the sign of the scalar projection represents which side of space the vector is in (“in front” or “in the back”). If we flip a bit (1 for a positive sign, 0 otherwise), and we have projection vectors, we can build an -bit hash. In the end, we hash the vectors to search as well as our query vectors (). We go through the vectors but only perform the expensive comparison if the difference in hash functions between a candidate and the query vector is below a certain threshold (which could be dynamic).

Unfortunately, I’m not exactly sure how LSH is implemented. I mean, yes you hash the values, put them in a set and iterate over them checking to see whether they’re close enough and check the actual distances… but I read about what seemed like other ways of doing this in [1]. I should look more into that. Stay tuned!

Here are some more references:

[1] http://www.slaney.org/malcolm/yahoo/Slaney2008-LSHTutorial.pdf (nice visuals, pretty terse)

[2] http://web.engr.oregonstate.edu/~shindler/papers/FastKMeans_nips11.pdf (look for experimental results; also describes the new k-means algorithm I’m working on with Ted)

[3] http://yima.csl.illinois.edu/psfile/CVPR10-Hashing.pdf (contains a lemma about why the projection vectors need to be normally distributed)

]]>

I’ve been away for quite some time but am now ready to start what is probably the most important project I’ve worked on until now (excluding internships).

I’ll start work on my Bachelor’s Project soon which is going to be on top of Mahout.

Mahout is a scalable, open-source machine learning framework. Some of its algorithms are built on top of Hadoop. I realized that I really like maths and algorithms and I’d like to work on something that combines both ideally. So, Machine Learning it is!

No, I didn’t just pick this because of the hype, I swear!

I started by e-mailing the dev@mahout.apache.org list and introducing myself and asking for guidance. Nobody answered at first which was pretty disconcerting, but a few days later, when I ping’d the list again, someone was willing to help.

Ted Dunning to be exact. Ted has worked on Mahout for some time now and you can find some of his talks on YouTube here, here or here. The first one is a short interview about Hadoop from Strata 2012, and the second two are talks about Mahout (in particular new ideas for clustering algorithms) in Boston and LA.

**Body**

The latest thing he’s been working on and that I’ll be helping integrate into Mahout and benchmark is a fast single-pass k-means clustering algorithm.

k-means clustering is an iterative method by which data points (in dimensional space) are clustered into clusters based on each point’s proximity to each cluster. Here’s a description the classic k-means algorithm, Lloyd’s algorithm:

First, the cluster centers are initialized randomly (there are better approaches); these new points are called centroids. They will (hopefully) have converged to the real centers by the end of the algorithm.

A clustering step first takes each of the $n$ points and computes the distance from it to the centroids assigning to the cluster whose centroid is closest to this point. After all the points are assigned a cluster, suppose the assignment is , where cluster has points, . The new centroid of cluster is the average of these points, .

The clustering step is performed until there are no more changes in the cluster assignment or centroids.

While conceptually simple and quite straightforward to code, this algorithm suffers from a number of drawbacks.

- The problem of clustering the data points is NP-hard in general. This algorithm uses a heuristic approach and might get stuck at a local optimum depending on what points the centroids are initialized at. Supposed two centroids are initialized close to one another (in the same logical cluster); then, one of the clusters that we wanted to find will be split. So, multiple attempts should be made with different initial values for the centroids.

- There are are cases where this algorithm takes exponential time (see http://en.wikipedia.org/wiki/K-means_clustering#cite_note-5 and http://en.wikipedia.org/wiki/K-means_clustering#Complexity), but the runtime is polynomial for practical cases.
- Even if the runtime is polynomial, assuming that the number of steps it takes to converge is (highly optimistic), the total complexity of the algorithm is (calculating the minimum distance takes for one point). This is still too expensive for huge applications, especially since k-means is used as a step in more complicated algorithms (for example, in finding the nearest neighbors, rather than looking at all the points, if there are some clusters, find the adjacent clusters and look for neighbors only there).

There are lots of uses for k-means and I’ll be able to tell you more about them as well as improvements to this algorithm, or completely new ideas as I delve deeper into the project.

For now, Ted’s code lives in https://github.com/tdunning/knn/ and he describes the ideas used to implement new solutions in docs/scaling-k-means/scaling-k-means.pdf. I’ll be talking about more details soon!

]]>

(define string-tokenize (lambda (s delims) (letrec ((string-tokenize-dirty (lambda (s delims) (cond ((null? s) (list '())) ((let ((tokens (string-tokenize-dirty (cdr s) delims))) (if (member (car s) delims) (if (not (null? (car tokens))) (cons '() tokens) tokens) (cons (cons (car s) (car tokens)) (cdr tokens))))))))) (let ((tokens (string-tokenize-dirty (string->list s) (string->list delims)))) (map list->string (if (null? (car tokens)) (cdr tokens) tokens))))))

> (string-tokenize " a few good men +b*c+d d " " *+")

("a" "few" "good" "men" "b" "c" "d" "d")

Any advice on how to make this tail-recursive?

Also, how can I manipulate strings like lists?

]]>

But today, let’s talk about primes. How to test if a number is prime and how to generate the stream of primes. And which method works best. And most importantly, why does the method I thought should work best run out of memory?

I’ll try testing for primes in 3 ways: finding relevant divisors (**prime?**), the Fermat primality test (**fermat-prime?**) and the Rabin-Miller primality test (**rabinmiller-prime?** and **rabinmiller-true-prime?**). After that, we’ll try to generate the stream of prime numbers and see which of these algorithms perform better and also compare them with the tried and tested Erathosthenes’ Sieve method.

Let’s first have a look at everyone’s favorite primality test, search for divisors up to its square root:

(define divides? (lambda (a b) (= (remainder a b) 0))) (define prime? (lambda (n) (cond ((or (= n 1) (= n 0)) #f) ((= n 2) #t) ((even? n) #f) (else (let prime-test ( (d 3) ) (cond ((> (square d) n) #t) ((divides? n d) #f) (else (prime-test (+ d 2)))))))))

This is a really basic test that for a given odd number checks all odd numbers starting from 3 to see whether or not they divide . Obviously, a better approach would have been to do the test only for prime , but there’s no way of figuring that out without first having all prime numbers less than already. I remember implementing this in high-school.

Let’s get to something more interesting, the Fermat test, which is nothing but a restatement of Fermat’s Little Theorem: *If is a prime number and , then .*

So, everyone, open section 1.2 of the SICP and let’s have a look at exponentiation (fast-exponentiation to be exact). We’ll have this done in logarithmic time and we’ll also have the remainder calculated at each step to avoid the size of the numbers from getting out of hand:

(define expmod (lambda (base exp mod) (remainder (cond ((= exp 0) 1) ((even? exp) (square (expmod base (/ exp 2) mod))) (else (* base (expmod base (- exp 1) mod)))) mod)))

There are some issues with the Fermat primality test. Most importantly, it is **not conclusive**. We’ll set a fixed value of and test the condition in Fermat’s Little Theorem. If is prime, then the test is true for all values of , but testing all values of a would not be feasible.

Checking

(= (expmod a n n) a)

only proves that is *probably prime*, but there is no way of knowing for sure in only one test. Picking a few random values of and having the test succeed is a pretty good indicator of the primality of the number and is in fact what the PGP program uses to test for primes.

So, our Fermat test would look something like:

(define fermat-prime? (lambda (n) (let fermat-tests ((i 5)) (if (= i 0) #t (let ( (a (+ 1 (random (- n 1)))) ) (if (= (expmod a n n) a) (fermat-tests (- i 1)) #f))))))

I could have specified the number of tests to run as a parameter to **fermat-prime?** instead of 5. Finding a single failure is proof that the number is **not prime**. However, there are certain numbers that always verify this condition despite not being prime!

They are called Carmichael numbers and they are not detected by the **fermat-prime?** test:

1 ]=> (prime? 9746347772161)

;Value: #f

1 ]=> (fermat-prime? 9746347772161)

;Value: #t

In fact, .

This is where the **Miller-Rabin Test** comes in. It’s based on a slightly modified version of Fermat’s Little Theorem with a small twist that gets rid of Carmichael numbers as well. Instead of fixing an and checking to see if , it checks to see if .

The key difference is that when performing the squaring step in the **expmod** function we check to see if the number we’re supposed to be squaring is a nontrivial square root of 1 modulo n.

Let’s explain what that actually means (courtesy of the Wiki article on the Miller-Rabin test). Let be a prime number. Clearly , for any . Those are called the trivial square roots of 1. We’ll prove that if is prime, there can be no others: suppose there is another such that , so divides the product, therefore it must divide one of the factors, meaning either or which means that either or (in the field ).

If we at some point in the exponentiation get a partial result `(s (expmod2 base (/ exp 2) mod))`

meaning that and without being either or , then by what we’ve proven above, **cannot be prime** since we’ve just found a **non-trivial square root of 1**.

So, we can formulate a function that works with Carmichael numbers too:

(define sqrt-of-1? (lambda (s n) (if (or (= s 1) (= s (- n 1))) #f (= (remainder (* s s) n) 1)))) (define expmod2 (lambda (base exp mod) (remainder (cond ((= exp 0) 1) ((even? exp) (let ((s (expmod2 base (/ exp 2) mod))) (if (sqrt-of-1? s mod) 0 (* s s)))) (else (* base (expmod2 base (- exp 1) mod)))) mod))) (define rabinmiller-prime? (lambda (n) (if (or (= n 0) (= n 1)) #f (let rabinmiller-tests ((i 5)) (if (= i 0) #t (let ( (a (+ 1 (random (- n 1)))) ) (if (= (expmod2 a (- n 1) n) 1) (rabinmiller-tests (- i 1)) #f)))))))

Notice however that we still only test a number 5 times. There is a proven number of times the exponentiation needs to be done in order for the test to be deterministic. Chrid Caldwell’s Primality Proving website lists the following result:

Miller’s Test[Miller76]:If the extended Riemann hypothesis is true,then ifnis ana-SPRP for all integersawith 1 <a< 2(logn)^{2}, thennis prime.

So, in fact we could adjust the code to generate the range between 2 and :

(define rabinmiller-true-prime? (lambda (n) (cond ((or (= n 0) (= n 1)) #f) ((= n 2) #t) ((even? n) #f) (else (let rabinmiller-tests ( (test-range (range 2 (min (- n 1) (* 2 (square (floor (log n))))))) ) (if (null? test-range) #t (let ((a (car test-range))) (if (= (expmod2 a (- n 1) n) 1) (rabinmiller-tests (cdr test-range)) #f))))))))

Let’s try some primitive benchmarks. I’m going to try and get the first 5000 prime numbers by filtering the stream of all naturals and benchmarking (subtracting endtime and starttime) for each method above including a sieve approach.

For that, let’s introduce a couple of new constructs for manipulating streams:

(define naturals (let make-naturals ((i 0)) (cons-stream i (make-naturals (+ i 1))))) (define filter-stream (lambda (f s) (let ( (rest (delay (filter-stream f (force (cdr s))))) (head (car s)) ) (if (f head) (cons head rest) (force rest))))) (define take (lambda (n s) (if (= n 0) '() (let ( (rest (take (- n 1) (stream-cdr s))) (head (stream-car s)) ) (cons head rest)))))

The code for the sieve is very basic (done in a lab at school):

(define sieve (lambda (s) (cons (car s) (delay (sieve (filter-stream (lambda (x) (not (= (remainder x (car s)) 0))) s)))))) (define naturals-from-2 (force (cdr (force (cdr naturals))))) (define primes (sieve naturals-from-2))

The results are very interesting:

1 ]=> (benchmark (delay (take 5000 (filter-stream rabinmiller-prime? naturals))))

;Value: 7.3799999999999955

1 ]=> (benchmark (delay (take 5000 (filter-stream rabinmiller-true-prime? naturals))))

;Value: 97.03999999999999

1 ]=> (benchmark (delay (take 5000 (filter-stream prime? naturals))))

;Value: 1.8300000000000125

1 ]=> (benchmark (delay (take 5000 primes)))

;Aborting!: out of memory

;GC #32: took: 0.37 (51%) CPU time, 0.47 (57%) real time; free: 14882

;GC #33: took: 0.34 (100%) CPU time, 0.36 (99%) real time; free: 14938

Something is seriously wrong with the sieve based approach…

The interesting part is that it works great up about 2390:

1 ]=> (benchmark (delay (take 2390 primes)))

;Value: .00999999999999801

1 ]=> (benchmark (delay (take 2400 primes)))

;Aborting!: out of memory

;GC #58: took: 0.34 (51%) CPU time, 0.34 (50%) real time; free: 13773

;GC #59: took: 0.33 (100%) CPU time, 0.34 (99%) real time; free: 13806

This should be the fastest of all approaches but it just dies.

I’d really like some help here! Volunteers?

Also, it’s interesting to note that while the Rabin-Miller test performs better for testing huge numbers (only one at a time), but only by a small fraction (there is really not a big difference < 1sec for 32416190071, which is prime), both of these tests are ineffective against the simple divisor check fo smaller numbers as can be clearly seen from the benchmarks above. (disclaimer: I know that they are really really imprecise!)

Still, the **prime?** test beats **rabinmiller-prime?** by a factor of about **3** and **rabinmiller-true-prime?** by a factor of **over 10**.

What’s wrong? Clearly there are too many tests being done. For the biggest prime number I could fine on the Big Primes website, there are a total of 1250 = (* 2 (square (ceiling (log 32416190071)))) being performed.

You can find **the code** in my pp2011 github repo, here.

Update: I think however that there are **major issues** with the tests. Restarting Scheme makes a huge difference. Can anyone please confirm? I’m using MIT/GNU Scheme Release 9.0.1 under Mac OS X.

I would have liked to make a point about testing the primality of numbers in an efficient way. It turns out the point is: **I suck at it! **

]]>

Well, I now have my own (really cheap) board to play with: a Texas Instruments LaunchPad! I’m stepping through its temperature monitoring program as we speak

Fortunately, there are lots of tutorials on how to get started and if all goes well, we might have a small project to do with these dev kits. I leave you with some neat unpacking pics. I really wanted to capture my excitement (that wore off pretty quick) in getting my first hardware kit!

]]>

This is what happens in a nutshell:

As the kernel first boots up, it launches its first process, with process id 1, called init. This special process is responsible for starting up every other process in the system, usually by executing specific scripts in /etc/init.d/ (or /etc/rc.d/ on BSD-like distros). The order in which services are started is vital as daemons such as crond are required for services like dbus to function.

The problem is *how to boot up the system faster? *This is especially important for servers (where uptime is vital) and embedded devices (which are expected to start instantly).

The *old* method, describing the dependencies between services and starting them according to a topological sort is inherently serial. You could start daemons that don’t depend on one another at the same time, but if two daemons both depend on dbus, dbus must start first. This creates inevitable bottlenecks which slow down the boot up.

In BSD-style systems (I’m actually about Arch Linux, it being the one I currently use), starting services serially is a significant problem. Particularly starting the network takes quite some time due to the DHCP daemon that takes a while to find a DHCP server. The order daemons are started is specified in /etc/rc.conf in a specific DAEMONS section and you can load daemons in the background for somewhat of an improvement, but it’s not clear which daemons *can* be configured to start in the background and not cause too much trouble

I’m not very familiar to comment on SysV style, but as far as I can tell (please correct me if I’m wrong), it just uses a different (more flexible) syntax to achieve the same thing.

**Upstart**, Ubuntu’s init replacement (one of the many to have popped up in recent years) does things differently. To cite the official description *“All jobs waiting on a particular event are normally started at the same time when that event occurs”. *This means that *all services* that depend on dbus are started immediately after dbus has finished starting. Although this certainly improves boot times by a fair margin (Ubuntu boots very quickly nowadays), the *bottleneck* is still there.

*dbus must finish loading before jobs which depend on it can start.*

Although a significant improvement, there is also a totally different way of doing things. Mac OS X for example, uses **launchd, **a system wide daemon and agent manager (agents are, in Mac OS X parlance daemons that run under specific user accounts) that starts everything at the same time.

**launchd** came up with the following very interesting observation: dependency between different processes is chiefly expressed through IPC. Simplifying, we’ll just talk about sockets. On Mac OS X, the system logger, **syslogd** receives log messages across UNIX sockets, kernel printf APIs etc. If the socket is available, another process (like **diskarbitrationd**, the daemon responsible for disk mounting, filesystems and disk access) can communicate with **syslogd**.

**launchd** and **systemd** first allocate sockets (actually Mach ports on Mac OS X I think) to all daemons that need to start and they actually start a daemon when its socket is actually needed.

What happens is something like (disclaimer: very sketchy understanding of how this *actually works*): for each possible daemon, allocate its resources (sockets, ports…), keep the mapping from daemons to resources and pretend everything has started (while listening on every socket). Actually trigger loading of every daemon at the same time. If a daemon requires a socket that belongs to a daemon that has not finished starting yet, the requests are buffered and ordered by the kernel and the requesting daemon blocks, waiting for an answer (could have a time-out in case of failure of course). Finally, when a daemon finishes starting up, it gets all the requests buffered by the kernel without any fuss, as if it had just received them.

This is called socket-based activation and DBus-based activation works in a similar way, except that DBus is responsible for the buffering, not the kernel.

**systemd** can therefore boot up a computer much faster than other approaches. Other features are the obligatory, daemon babysitting, which is really quite convenient especially since socket buffering is managed externally so reloading a faulty daemon doesn’t automatically clear its message queue (socket-daemon matchings are preserved), the on-demand starting of daemons (there’s no point in starting CUPS if there’s no printer in sight) etc.

For more information, check out **systemd**‘s home page, at freedesktop.org and a very well written description by Lennart Pottering himself of what **systemd** is and how it works.

You can also find more information about Upstart from Scott James Remnant’s blog (I linked to an overview of Upstart vs. launchd).

Finally, you can find out more about **launchd** from the MacOS forge website. **launchd** is also open source under the Apache license, even if written completely by Apple. There was a project to use it for FreeBSD but that apparently fizzled. Perhaps unsurprisingly, because launchd failed to gather meaningful contributions, the last commit in its trunk is from 15 months ago. You can also check out a presentation about **launchd** I held for the Use of Operating Systems course held in Fall 2009 at school (unfortunately, it’s in Romanian and the demo doesn’t work properly).

Are so many **init** replacements a good thing? There only used to be simple init-scripts and you either had BSD-style or SystemV scripts. Easy to use shell scripts. Nowadays, init replacements are vastly different among Unices.

BSDs and most Linux distros continue to use traditional init scripts (their specific flavors anyway), Ubuntu uses **Upstart**, as do Chrome OS and Fedora. Fedora will switch to **systemd** this may with its new release possibly followed by OpenSUSE. It seems that the division between deb-based and rpm-based distributions grows even wider. Mac OS X has never used conventional init-scripts, replacing them from the beginning with SystemStarter which has in turn been replaced as of 2004 (?) by **launchd**. Solaris (from version 10+) on the other hand, uses something *completely different:* the Service Management Facility, you can find out more about it from Oracle’s official documentation here.

Say hello to Fragmentation. Or is it just Healthy Differentiation?

]]>

Well, after passing some of the nastiest exams (or so I’ve been told), I’ve been in Brussels with the ROSEdu crew. Specifically, for FOSDEM, but everyone knows that was just a pretext to visit Belgium (for me at least).

Well, in trying to keep a *professional tone*, let me talk a bit about my favorite talks from this year’s FOSDEM. The keynotes are not available online as of today (February 12th), but I’ll try adding a link as soon as possible.

LLVM, although initially standing for Low Level Virtual Machine is nowadays an umbrella project for an improved compiler infrastructure. Essentially, tools like gcc and gdb have significant portions of common functionality (parsing C/C++ code) and this is done twice, using two different engines.

The talk, by Chris Lattner (mastermind of LLVM) himself was introductory and although it’s not online yet, here’s a link to an interview he had taken for FOSDEM. The amusing part is that you won’t find an actual interview, only a compilation of information he himself provided (*Because of his employer’s *[Apple] *policy, Chris couldn’t be interviewed…).*

The core libraries are based around the *LLVM intermediate representation*, LLVM IR and can be targeted by various compilers (Clang being one of them). So, what happens is that code is compiled to LLVM IR which is then optimized and finally CPU-specific code is generated.

What a compiler does (very high level overview) is:

- parses the source language to an intermediate representation (like an abstract syntax tree)
- transforms the intermediate representation (possibly to some other intermediate representation as the performance is improved and the code is simplified to a lower level to simplify language generation)
- generate machine code from the intermediate representation

The first two steps are performed by the *front-end* and the last step by the *back-end*. The intermediate representation is the crux of the matter. Which to chose so that the parsing and machine code generation can be split? LLVM offers an intermediate representation that is well tested and quite mature now (in development for over 10 years) and that can generate high quality machine code. So, with a back-end in place, there needs to be a front-end for each supported language.

This is where Clang comes in. Clang is a C/C++/Objective-C compiler that is faster than GCC, delivers faster code, provides meaningful error messages. As wonderful as all that sounds (and I will try using it from now on — it’s not like my C/C++ code is *that* complex/exotic), there are cases where GCC and Clang behave differently.

For example, the keyword **inline** is treated differently in Clang than in GCC, but that’s because GCC doesn’t adhere to the C99 standard completely by default. Actually dealing with **inline** is surprisingly tricky apparently… although I never noticed before. I just thought it was a better way of doing macros. Have fun reading the rules here.

At any rate, there are lots of projects that want to compile to LLVM IR like LDC (compiler for the D language with an LLVM backend) or a GHC backend. The GHC backend is the one I think is the most interesting. You can read the thesis I linked to, by David Terei who did all the work by replacing the Cmm language the GHC uses with LLVM IR. At least have a look for the details about the pipeline of the GHC — starting with Haskell code, to HS (an in memory representation of Haskell with syntax represented and on which type checking is performed), Core (version of lambda calculus with some extensions, just large enough to express Haskell), STG (the *Spineless Tagless G-Machine* which is an abstract machine representation of Haskell, I don’t understand, but has an awesome name :), and finally Cmm (a variation of the C- language that represents a Haskell program in a procedural form). This is the representation that is converted into LLVM IR.

In addition, LLDB uses libraries provided by LLVM and Clang to implement a more powerful debugger (faster and more memory efficient than GDB at loading symbols according to the *hype*). Also a new C++ standard library, libc++ is coming down the pipeline. A faster, less memory hungry one apparently.

Now, if this project is so dramatically better, everything is *so much faster*, why doesn’t everyone get to work on using it? Why don’t you hear about Ubuntu preparing to use it?

Well, as far as I can tell, although the LLVM project is under a less restrictive license than the GPL (aka allows proprietary, binary-only extensions, which should be good, right?) and it is still being worked on. For one, C++0x support is unfinished and only recently, on October 26th 2010 to be more specific has Clang built a working Linux Kernel. Other projects like llvm-gcc (GCC 4.2 front-end) apparently work whereas dragonegg (GCC 4.5, GPL3) is still buggy. And projects like libc++ exist (only?) because *“Mainline libstdc++ has switched to GPL3, a license which the developers of libc++ cannot use”*.

So, since Apple is the main sponsor of this project, it’s pretty clear that they don’t really appreciate the GPL3 (although I’m not sure whether or not it actually affects them — possibly forces them to open-source parts of Xcode?). As a consequence of that, all LLVM projects obviously work on Mac OS X and LLVM has already been used successfully to convert some more advanced OpenGL functions not supported by Macs using Intel GMA chipsets to simpler subroutines to ensure correct operation. (of course, one can also ask what in the world is a shitty Intel GPU doing inside a Mac; maybe next time they’ll not use piece of junk hardware instead?)

LLVM is featured most prominently in Xcode, Apple’s IDE for Mac OS X and iOS and it’s quite clear that future versions of both OSes will no longer use GCC. In Xcode 4, there is support for LLVM2.0, Fix-It (which basically detects possible errors at edit thanks to some LLVM magic) and the LLDB debugger.

BSDs will probably follow, pouncing at the opportunity to no longer rely on a GPL3 compiler.

Whether or not this is good in the long-term is still debatable. Some may view Apple’s involvement in the LLVM project with suspicion and there might be some hesitation to switch entire Linux distributions to Clang. Whatever happens, LLVM is clearly here to stay. Hopefully cross-platform support gets better, although right now, it’s pretty clear that Mac OS X is definitely the priority. Here’s to better compilers for everyone!

Oh, and Brussels was a lot of fun!

]]>

In fact, in programming contests, where the pressure is always on winning, getting it, I would feel immensely stupid for not getting it. Anyway, after reading up a bit about how it works from the CLR, I ended up understanding a bit more about how it works.

This post is about a problem given in this month’s USACO contest (Silver Division), called **divgold** where you are tasked with solving the Balanced Partition problem. It’s a fairly known example, but I was unfamiliar with it. You can read all the problems by viewing the January contest on the USACO Contestgate. You need to register though.

You are given numbers, let’s call them , and you must find the number of ways to partition these numbers into two groups and such that:

- ;
- is minimized, where and .

And you must find the minimum difference itself, .

Let’s make some notes before actually talking about the solution (which – surprise! – involves dynamic programming). First of all, it is . This can be of course proven quite easily, and we will actually do it since it contains the key idea that helps solve the problem.

**Warning! Theoretical stuff ahead. If you just want to learn how to solve the problem, read the official analysis!**

So, when proving that Balanced-Partitions we’ll first prove that Balanced-Partitions by devising a non-deterministic algorithm that solves it.

Note however that the problem wants to find the number of ways to partition the numbers. It is **not** a decision problem. We’ll instead redefine it to ask whether you can get a **difference of at exactly ** between the two sets. So we’re completely ignoring the number of ways to get that difference, we’re only interested in whether getting it is possible or not. We’ll see how to actually compute the number of ways later.

We’ll use **choice**, **success** and **fail** to describe it. I think that the simplest way to build the actual partition is to generate all possible 0/1 assignments and get the sums of the two resulting sets of numbers (those who were given a 0, and those who were given a 1).

M-Balanced-Partition(A, N, m): S1 = S2 = 0 for i = 1 .. N: if choice(0, 1) == 1: S1 += A[i] else S2 += A[i] if abs(S1 - S2) == m: success fail

So, what this simple algorithm does is assign each element of a value of either 0 or 1. If is 1, it’s in the first set, otherwise, it’s in the second set. By generating all of the possibilities using** choice **we will clearly determine whether or not obtaining the difference we need is possible.

The actual complexity of the algorithm itself is since we’ll partition the array into one go (lines 3-7). Since is a polynomial, we conclude that Balanced-Partitions .

Let’s suppose that we want to know whether we can get a certain difference, . Let’s first note that if the two sets are (note: we’ll also use when referring to the sums themselves as well!) and we call the smaller sum, and so therefore and . In fact, no matter what we call these two sets, if we can form a set of sum , we’ll clearly have obtained the other set as well (it’s this set’s complement!).

So, in fact, finding whether or not Balanced-Partitions has a solution is kind of equivalent to wanting to find whether we can get a certain sum, let’s call it . This, however, is the Q-Sums problem which is *known* to be . So, we’ll try reducing Q-Sums to Balanced-Partitions. If we prove this to be true, then Balanced-Partitions because it means that for any problem , Q-Sums (Q-Sums is also $NP-Hard$!) So, by the transitive nature of the polynomial reduction relationship (that’s what the funny symbol is) it follows that Balanced-Partitions too.

**Warning! Possible rambling ahead. Feel free to skip the following paragraph (Also, the picture is only chuckle-worthy if visiting from Facebook!) : )**

It’s important to note that the key element is reducing a problem that is already known to be $NP-Hard$ to our problem. That means intuitively that our problem is **at least as hard as** the original problem. If we’d do it in the opposite direction, it would be meaningless! Why? Well, consider the problem (not a decision problem, I know, but I can’t think of a better example, maybe a helpful comment someone?) — sort an array of integers. This clearly has lots of great polynomial time algorithms that solve it (my favorite being quicksort with a randomized pivot). But, we could equally well sort an array of numbers by generating all possible permutations of the numbers and outputting first one that we generate that is sorted. This would run in exponential time though, as generating the permutations of a set is in (to be taken with a grain of salt, since it’s not actually a decision problem). So, **reducing** a problem we know nothing about **to a difficult problem achieves nothing**.

Ok, so, If you’re still following at this point, let’s get on with it and try to do the reduction itself. We must prove that we can solve Q-Sums using Balanced-Partitions and that there can be no false-positive, i.e. for an input , being the set of inputs for Q-Sums and a polynomial time algorithm/function , where is the set of inputs for Balanced-Partitions, .

If we want to find a set of sum , we’ve seen that there existing a difference is equivalent to there being a set of sum . So, we’ll just map and solve Balanced-Partitons for that value of . It’s pretty clear that the two problems are equivalent I would say, to prove it, we just“go back” and forth through the relationship between and .

Now, we’re finally content. Balanced-Partition is both in and in , and we can finally conclude that it is .

Why go through all the trouble proving that it’s $NP-Complete$ anyway? Well, as a nifty exercise for one, but it also provides the essential insight that can help solve the initial problem from the USACO contest, **divgold**. There, we were tasked with finding the minimum difference as well as the number of ways to obtain it. Well, this illustrates the connection between this kind of optimum problems and their associated decision problem pretty clearly. To find the minimum difference, call it , we’ll essentially start with and keep increasing the difference until we find that .

How far are we supposed to go anyway? Well, for one, the difference between the two sets can be at most for a very loose upper bound, (but I really think it should something much tighter, like perhaps the maximum value from , but I didn’t manage to prove this, not at this late hour anyway; I’m not even sure whether or not it’s true to be honest). Anyway, the point is we’ll stop at some point, in at most steps.

How to find out whether we can get a particular difference though? Well, here the insight from the NP-Hardness proof comes in handy since the transformation we’ve done between and is bijective, and that means intuitively that we can also use it to solve Balanced-Partition using Q-Sums. In English, we want to know whether it’s possible to obtain a sum . What would happen if is odd? Well, we’d certainly not be able to get a rational sum out of integers. That doesn’t bother us one bit though as we’ll soon see.

This is where dynamic programming comes in. Q-Sums is a problem with a pseudo-polynomial time algorithm. That simply means that there exists an algorithm that finds the answer in polynomial time dependent on the numeric value of the input. That numeric value is exponential in terms of bits however!

(neat fact: Problems that can be solved by such algorithms are called *weakly NP-Complete*. These are as far as I understand the only NP-Complete problems where dynamic programming can be applied. If however, the numbers in the array had been real numbers, such a solution would not have been possible. And by *real numbers* I mean actual real numbers, not IEE754 floating point, which are in fact rational.)

To use dynamic programming, we need to define an optimal sub-solution’s structure. This structure will be used do determine the larger sub-solutions, until we finally have the whole solution! If I mess up the explanation, feel free to use the extremely helpful tutorials of Brian Dean. He actually has many more dynamic programming examples there you might find interesting.

So, we’re going to solve Q-Sums. We need the following structure (usually a matrix that adequately describes what exactly a sub-problem is with a couple of indices):

= the number of ways to get the sum using the first numbers. To actually solve it, we need a recurrence relation. We need to think how to get from one state to the next, either with a top-down or a bottom-up approach (i.e. either forward or backward). Let’s look “back”, at . We’re at the -th number and we can either chose to add it to the sum or not.

If we don’t add it, then we need to get the same sum , without the current number, , so we add . Otherwise, to get in total, we’ll need :

Recursion formula –

The base case being –

Note that we actually count the number of ways to get the desired sum here. It could have been the same decision problem we talked about if we performed the same calculations modulo 2 (or used a Boolean data type and interpreted the ‘+’ operation to mean logical OR).

At any rate, for an array $A$, of length , the answer to Q-Sums will be in . We want to check for in a loop from . Also, needs to be even.

for (j = 0; (s - j)/2 >= 0; ++ j) if ((s - j) % 2 == 0 && N[n][(s - j)/2] > 0) break;

And so, the minimum difference will be and the number of ways to get will be .

You can in fact optimize this solution space-wise a lot ( although this completely escaped my mind in the contest and I didn’t get max for this problem ). Notice that we don’t use all of the rows in the matrix at once, we only use the final two, and really, we only need one row since we’re only interested in and all of the updates are done in place.

So, we can finally write a much better recurrence:

where

where is the number of ways to get a subset of sum .

You can find all of the implementations here.

Also, *dear reader*, I also need your help with a couple of questions:

- What happens (in the NP-Hardness proof) when Balanced-Partition is taken to require the partitioning into two subsets whose sums have an absolute difference of
**at most**? Basically replace the equality with an inequality.

I was thinking about calling Balanced-Partition twice with and to see if you get different answers, If you do, then you might not get the in the Q-Sum you were looking for. But does that mean it doesn’t exist? - Is it true that the maximum difference between the two partitions in the optimum case can be no larger than the maximum number in the array ?

I feel that this is true, but can’t think of any good reason right now. Assistance appreciated

]]>

It’s called the **Land of Lisp** and is about… well… Lisp! What makes it stand out from those dreadfully boring **Structure and Interpretation of Computer Programs** books they use(d) at MIT or Berkeley is that this one has tons of drawings and is lots of fun to read, even if only as a comic.

So, have a look, at his website which features the most hilarious promotional video ever. And an epic comic as well!

Oh, and be sure to also check out lisperati.com!

How can you resist something that is… made with secret alien technology?

It’s really such a shame that I’ll have my first final exam this Wednesday … I won’t be able to start reading the book.

I also promise to:

Write about the FDC [Free Development Course]… or Victor will have my head

]]>