Monday, 16 March 2009

The oddities of being a scientist-programmer

[caption id="" align="alignleft" width="240" caption="Photo by jurvetson"]Photo by jurvetson[/caption]

The scientist-programmer is in many ways a peculiar being.  Most of what you do each day is programming, yet the reason for your (professional) existence is to produce good science.  We think it's worth just taking a moment to think about this.

Lots of prototyping

The nature of science means you'll probably be doing a lot of what is effectively prototyping.  The goal of the scientist-programmer is often to figure out the best way to solve a particular problem, but then they often must move on to the next problem to be solved.  Perhaps you need a good set of scripts to process a given data-set, or perhaps you're trying out ideas for new statistical models.  You may even just be looking for way to efficiently implement known methods on a particularly large data-set.  All prototyping!

Lots of data processing
Scientists handle data.  Often an awful lot of it.  This is even more true for the scientist-programmer, as quite often you will have become one specifically because you have so much data that you need a computer to help you process it.

Particular skills are needed to write code than can handle large volumes of data and/or do complicated things to them.  You may well have to be frugal with your memory and/or CPU resources, because (for example) an extra copy of the data will fill your computer's RAM, making your code grind to a halt.  As a result, you can end up having to think seriously about ways to optimise you code.

There may also be mathematical operations that you use a lot.  Inverting a 104 x 104 matrix requires a lot of computing resources, so you need to pick a good way to do it (hint: use LAPACK, or a similar library).  Or perhaps you're taking Fourier transforms and need to know that Fast Fourier Transform (FFT) algorithms are faster when the size of your data array is 2n (where 'n' is an integer).  These are things with which the scientist-programmer must be intimately familiar.

And of course, you must be extra-careful of the  "10% error" type bugs in your numerical code.  There are many places for such bugs to hide, and consider this:  do you really want to be stood in front of 200 eminent scientists at a conference when the world-leader in your field spots that your graph must be off by 10%?  We thought not.

"Just get the science done"
The primary tension in the life of the scientist-programmer is to "just get the science done".  What we mean by this is the conflict between getting the coding done as quickly as possible so that you can move onto finishing the science, versus spending that extra week (or however long) making sure that your code is neat, tidy, well-tested and generally a glorious triumph of software engineering.

This can be hard for a number of reasons.  Firstly, you may yourself be impatient to get on with the science as that's your ultimate aim.  Fair enough.  Just make sure your results will pass the above "conference test".  By contrast, you might be a bit of a coding perfectionist (we can certainly relate to that) and not want to stop improving the code until it's perfect.  We suggest that this is admirable up to a point, but that you need to know when your code's good enough .

Finally, there's the issue of managing your manager and/or collaborators.  How do you convince them that your code needs more work, despite the fact it seems to be producing results.  This is a tough one and we suspect that ultimately you have to both strike a balance and also communicate well (and regularly) with the people with whom you work.  Explain that another week of testing means you can write a section in the paper proving that the code (and hence all your results) are solid.  If you'll be using the code many times, explain how two days of well-judged optimisation will save many days of run-time at later dates, so all your science will progress more quickly.  If your code can really use the extra time, probably it should be possible to present a case that's compelling to anyone who's reasonably open-minded about it.

On this subject, beware of people making statements like "oh, it only needs *blah* doing to it.", implying your task is small and should be quick to complete.  This is often hard to refute without resorting to "you just don't know what you're talking about!".  Which isn't very polite, even if it's true :-)  Try to see it from the other person's perspective, and remember that estimating the time required for any project is very hard to do so it's not anyone's fault.  But if you think it will take longer, say so!

Legacy code
We've talked before about surviving scientific legacy code .  It's been our experience that such code can be pretty horrific to deal with, but quite often the author will have been highly expert at the implemented method (even if their coding skills were deeply average), so it can be the case that your best course of action is to grin and bear it.  Sorry.

If you're using someone else's legacy code, access to the original author is very helpful.  It might even be vital.  Sometimes you can save hours, days or even more effort by taking half an hour to sit down with the author and have them explain their thinking for a particular part of the code.  Sometimes what looks like a bug or inefficient piece of code is in fact clever and subtle, so much so that you've not spotted what it's doing.  Try not to destroy these, if you can avoid it!

Coming back six months later
You may well put a project down for six months, then come back to it.  Perhaps you've finally heard back from the referees on a paper and need to make some revisions.  Maybe you just got sidetracked onto another project.  Either way, the scientist-programmer can have many projects on the go at once, so it pays to prepare accordingly.  Primarily, this means always leaving your code in a state that makes it easy to pick up again after a break.  Good practice here is vital.  If your code is self-documenting , then you'll find it a lot easier to remember what it is you were doing six months ago when you last worked on the code in question.  This is a case where a bit of investment of time at an early stage can save a much larger amount of time when you come to re-start a project.

Publishing your results
Your results will (hopefully) get published in a scientific paper, so it's worth bearing in mind what you might need in order to do this.  Often, the plots, graphs, tables and/or speed-trials that you might want to include in a paper are also great tests to prove that your code works, so trying to build these in as you go is a good idea.  You do test your code, right?

Making software tools
The scientist-programmer is typically not paid to write code.  That's a by-product of producing science, which is what they're paid to do.  This is not to say, however, that taking the time to turn your prototyped ideas into proper software tools is pointless.  Quite the opposite.

Taking the extra time to produce really awesome (publicly-available) software tools can be a really great thing to do.  It helps the scientific community, because they can benefit from all your hard work in developing the method in the first place.  You can also gain a good level of kudos if you do a good job.  It's a Good Thing to be well-regarded in your field and this is one way to achieve that.  We also think that it's a real shame that some really clever ideas never make it past the stage of being scripts in someone's home directory, because they've written the paper and moved on.  Look at something like the R programming language to see how powerful it can be when a whole community of scientists contribute software packages.

Of course you need to take responsibility for your code.  Be a good maintainer.  If people find bugs in the code, fix them as promptly as you can and thank them for their input; they're only trying to be helpful and everyone benefits as a result.  It's also very useful if there's a published paper that people can refer back to.  And of course, the scientist-programmer is in the business of writing good scientific papers, so it's win-win.


Scientist-programmers are both scientists and programmers.  The best ones work very hard to be world-class at both disciplines.  This isn't easy, but modern science needs people who are expert at both.  There's a lot of enjoyment to be had from the combination, and there's a lot of great science that you can do as a result.


  1. That nearly all scientific programming is really prototyping is an excellent and crucial point. An implication of this that is usually missed is that scientific programming should almost always be done using a programming language suitable for prototyping--a language like Python, Ruby, or Lisp. Production languages like C, C++, and FORTRAN should generally be avoided like the plague, used only to write small extensions in cases where the utmost in speed or space efficiency is required.

  2. Don't forget verifying unreliable input sources. That means pretty much anything you didn't create yourself. Add consistency checks based on how you think the data should act: that will catch both errors in the data and errors in your understanding of the data. I've learned this the hard way:

  3. Totally agree with you Mike. I believe that functional programming languages can be of good use for scientific applications (mathematical expressiveness, numerical tower, automatic memory management etc.)

    I wonder what would be the opinion of scientists on languages like Haskell, Clean and Fortress...

  4. The other oddity that always strikes me is that "feature creep" is always a feature of scientific code, rather than a problem. If your code works well then you frequently solve the problem with it and then move on to the general case/slightly different model/new dataset. That (should) make people think more carefully about extensibility when writing their code, and I think some of the most painful codes we have to work with are the ones that have grown organically from a base that wasn't designed for it.

  5. [...] The oddities of being a scientist-programmer | Programming for Scientists (tags: science) [...]

  6. I did so much programming for my PhD level work that when I got a chance to become a software developer I took it. It was a no-brainer since there were no jobs for physics PhDs in the mid-1990s.

    It sounds like things haven't improved very much in the scientific programming field in 15 years. I was struck by how bad - terrible - most scientists code was. The Numerical Recipes family of books is probably the largest collection of bad code in existence.

    As an experimentalist I saw many experiments ruined by poor technical work in the lab (bad work on vacuum systems, electronics, etc.). There is lots of bad numerical / statistical work done by scientists too.

    In order to "get the science" done the technical work underlying it has to be done correctly.

    One PI that I worked for had a full-time staff of two C programmers to write software for data acquisition on his space flight missions. He made no provision for data analysis, however. He left it up to his graduate students (like me). He told me once that "You should be able to just press one key on your workstation and the result should be graphed on the printer."

    I responded that that was true, but only after several man months of programming and proving the code. "Why do you guys think this stuff takes so much time?" he asked. He was very knowledgeable about electronics for space flight and knew that it took man years to write good data acquisition software, but thought that data analysis software was written by elves in the middle of the night.

    This was partly driven by funding. He could get multi-million dollar grants to build instrumentation for data acquisition, but NASA and the other funding agencies wouldn't give a dollar for data analysis. All data analysis software had to be piggy-backed on the hardware funding.

  7. Beware the "this is just a prototype" mentality. Often today's proof of concept prototype becomes tomorrow's legacy system; once features have been added and multiple scientists have made their changes, the code becomes a jumbled mess of indecipherable code.

    I've found that ensuring that the scientists understand some basic software engineering principles (encapsulation, abstraction and generalization, avoiding copy-and-paste, revision control, use good variable names rather than "d3x", create recurring unit tests, etc.) and having them pair program with a software engineer now and then really improves the quality of the code.

    Once poorly designed code grows to a certain size, making incremental changes to push the science becomes time-consuming, so a little effort in code quality ultimately enables scientists to develop the science faster.

  8. I once heard a comment, "Is your disertation on physics or Linux?" I have written an operating system for nonprofessional programmers called "LoseThos", The good news is I improved the situation for programmers, but the bad news is it's not standard, so your libraries won't port very easily. You might check-out the website and offer advice on how it can better fill the niche of scientist programmers.

  9. Interesting post. There is a community forming between software engineers and computational scientists and engineers. We are having a workshop at the annual International Conference on Software Engineering. I would really be interested in participation from some of those that read this post. If you are interested, the website is

  10. Hi guys, thanks for all the feedback. It's fascinating to see the range of views there are out there! I agree with Jeffrey, there does very much seem to be a community forming between software engineers and computational scientists/engineers. I think previous generation/s of scientists were able to get away with just regarding a computer as a kind of giant, automated calculator (and act accordingly). It's not like that any more, because one can do *so* much more with a computer.