Friday 17 October 2008

Coding case-study: (re-) writing a clustering tool

[caption id="" align="alignleft" width="300" caption="photo by jimkster"]photo by jimkster[/caption]

Over the summer, Rich has been writing a novel clustering algorithm for use with gene expression and other similar types of data.  There are some interesting facets to this (scientifically-motivated) project which illustrate some of the considerations the scientist-programmer can encounter when working on a software project.  This post is about what we did and why!

The task
To write a software tool that implements a particular probabilistic clustering algorithm.  The method is a neat fusion of computer science (a greedy,agglomerative algorithm) and probability theory (handling data uncertainty and giving us a natural way to choose the number of clusters) and it's going to get used on some pretty big data-sets.  The language of choice is C++, for reasons we've already discussed here.


Starting point
A couple of implementations actually existed already for a special case of the data model in this method (for binomial data rather thanmultinomail, since you asked).  One is written in Matlab , which is too slow for our purposes but served as a useful reference.  The second is written in C++, making it much more useful to us.  It's fairly well written, although not object-oriented, and it's been tested and shown to work.  This then was a good place from which to start coding.


One other important point in this project is that Rich, despite being pretty experienced at scientific programming, hadn't written much C or C++ before.  So, it was very much a learning experience and this has an effect on how ambitious (or not) each step of the project can sensibly be.

Upgrading the data model
The first goal of the project was to produce a version of the code that had the more general data model.  This would make it immediately useful for doing some science, so it was a goal we wanted to reach promptly.  In addition to the data model, there were three important aspects of the code that it might be useful to change.  Firstly, the code structure used to handle the data was a bit inflexible and would need to change to accommodate the more general data type.  Secondly, the variable names weren't always very informative and any subsequent coding would progress more quickly if this was corrected.  Finally, the code wasn't object-oriented, which had implications for how easy the code would be to change/improve/debug.  In particular, there were some longer-term extensions to the method thatOO would help with a lot.


Changing the data structure was necessary to implement to new data model, so we had to do this.  We also decided to use query-replace to change the variable names to be more meaningful (this was actually very helpful in working with the unfamiliar code).  Object-orientation would be a huge change to the code, so the idea was noted down but left for now.

With our objectives decided, we could begin.  A couple of weeks of code-writing, closely allied to reading of books and websites to learn more C++, lead to a modified set of code.  This wasn't the most elegant code ever written, but we'd taken our time and were careful not to overstretch ourselves.  So, we thought we had code that would work.  But now we needed to know, so we needed to test it.

Happily, the new data model is a super-set of the old one.  This is hugely useful here because it means we can run test data from the old model through our new code and we know what answer we should get (assuming the old code gives the correct answer, which is should as it's been well-tested).  After this, we ran some tests with modified data that would use the full range of the new model.  Then finally we ran some real data that our collaborators had already analysed using some standard clustering methods and where the underlying science was well-understood.  This allowed us to examine the results to see if they're scientifically plausible. (they were).

We therefore had produced a scientifically-usable software tool.

We also do requests...
While our scientist-buddies were happily starting to use this tool we'd built, we realised that we already had a need for a version 2.0.  Specifically, there was another data model (time-series) that the biologists would find it really useful to be able to analyse.  This would be quite a bit different to handle in terms of the model itself, but the greedy algorithm is actually unchanged except for a single, data-dependent calculation that it needs to perform a number of times, for different data subsets.


This was what we needed object-orientation for.  By defining a series of DataSet classes (actually a super-class, with one sub-class for each type of data), we could build the whole algorithm once then just add a single new class for each new type of data and hence data model.  Rich was already familiar with object-orientation, so just had to learn the specifics of doing it in C++, meaning that we weren't being over-ambitious in considering this.

Re-building the code was really building version 2.0.  We did reuse various chunks of our code, but the new code base was substantially different.  All told, this was several weeks more work, including the class for the new data model, but we decided that this would be worth our effort.  We now had a nicer code base and one that was flexible enough that we could extend it easily to any new data model we wanted.

The new data model turns out to be very CPU-intensive.  So we needed to think about optimisation. Some quick profiling showed that we weren't doing anything stupid, so we needed to think about algorithmic optimisations to get the speed-up we required (we ideally wanted several orders of magnitude improvement, so we could handle really large data-sets).  A few days of maths found us a way to use particular structure in the data to make the code a couple of orders of magnitude faster, and some smarter numerical optimisation routines had a similar impact.  This still left the largest data-sets as unwieldy (we have some more ideas we're working on), but the tool was now useful for many data-sets and we had made huge improvements.  We had also done so without making the code much less flexible, although it had taken a special new matrix class and about 200 lines of code to do so.

Conclusion
This project is ongoing, as can often be the case, but we've already produced a tool that is being used to do some science and the second version is close to being deployed.  Our users are starting to feed back requests to us as to additional functions they'd like, for example for displaying the results once they're completed.  This is great, because it means we're being led by scientific needs.  And we'd like to think we're being useful because our biologist-friends have actually already asked us about a possible version 3.0.  Busy, busy...

No comments:

Post a Comment