21st Century Scientist: Examples of how to choose a programming language

[caption id="attachment_159" align="alignleft" width="300" caption="Photo by mpov"]

[/caption]

Following on from our articles on choosing a programming language (I and II) and answering a request from one of our readers, this post gives a couple of concrete examples of the process of choosing programming languages. We've taken them from our own experiences, to illustrate the sorts of considerations one might face in the real world.

Science example: a clustering algorithm

The project is to implement a new class of algorithm that will group together (cluster) items of data that are deemed to be similar, for example measuring the gene expression levels in a set of plants subject to different experimental conditions. It's distinctive in that it uses Bayesian inference to define what "similar" means, and it'll need to handle large (gigabyte-scale) scientific data-sets. The mathematics involved are complex and under active development (this is a key part of the scientific project).

For this project, we actually chose a combination of three languages for various parts of the project. This adds complications, both because the code author (Rich, in this case) needs to know all three languages, and also because of things like translating software between languages and/or linking together code written in two or more languages. However, as we'll see, in this instance there are good reasons for using three languages.

The first language is Matlab. Because the underlying mathematics (and hence the computer algorithm) will potentially change a lot as the project develops, we realised that we would want to prototype a lot of our ideas, to try them out, and we'd want to do this as rapidly as possible. Matlab is ideal for this because it's well-suited to the sorts of mathematical operations we're likely to need to do, in particular manipulations of matrices, and it's extensive built-in library support and general syntax mean that it's a quick language in which to code. It's also useful that there's a good level of local support for Matlab (people to ask for help etc) in Rich's department. The main downside of Matlab here is that it will be slow, compared to fully compiled languages such as C or Fortran, which will be a problem when analysing our large data-sets. It is also a commercial language, hence costs money to run, but this isn't an issue here as Rich's department already has Matlab licences.

This leads us to our second language, C++. This is the object-oriented version of C, making it both fast to run and allowing us to use the nice programming features of object-orientation. Our plan is that once we've completed prototyping to our satisfaction, we will re-code the software in C++, giving us a much faster version for use with our gigabyte-scale scientific data-sets. C++ was chosen because it is a language that gives great control over memory etc. It is a mature and stable language with many robust scientific libraries and tools (in this case, the Emacs text editor and the GCC compiler). C++ is a popular language that has many online and offline resources as well as local knowledge. While it's quite a lot of effort to re-code the whole program in C++, it reduces the run-time from minutes to seconds, or weeks to days, depending on the type of algorithm, which is very convenient when analysing a data-set. And again there are some local experts, in case Rich gets stuck.

Finally, there is a third language that we have a use for - R. This is a statistical programming language and the reason it's important for us is that the bioinformatics community, who are a major part of the potential users for this type of algorithm, do a lot of their data analysis in R. This also forms part of the reason for chosing C++, because C++ code can be linked to R and turned into a package, which is an R library that can be loaded automatically into the R environment. Packages are very easy to use and also allow the user to use R's other functions, for example nice plotting routines, to manipulate the results. We would like people to use our software, so it's up to us to make it easy for them to do so.

From beginning-to-end, we have Matlab to prototype our various methods, C++ for the production code and finally R as a good way of giving the bioinformatics community access to the algorithm.

An industry example: Writing video games

C++ is used in the games industry because of the great control over the hardware, especially the memory systems, that it grants the programmer. Having control over how and where memory is assigned from is very important as the size of memory on a typical modern games console is ~500 megabytes but the amount of assets (graphics, sounds, animations etc.) is in the order of 10's of gigabytes. It is obviously impossible to load all the data all the time but the more assets you can have in memory at any one time the richer the user experience. Therefore, it is important to be able to fit as much in as possible to maximise the impact of the game.
C++ is also chosen because professional games programmers are, to one degree or another, familiar with it which provides a larger pool of programmers to draw from. Most console based computer games are developed on the windows platform because of the availability of high quality development software. This software is commercial but its power and easy of use make it well worth the money.

As well as the main game code games studios invest a lot of time in creating software to convert data into the correct format, to create content, provide editing of important game variables and testing. All of these programs are known as tools and they provide the link between the content being created by the artists and designers and the code being produced by the programmers.

We chose C# for tools development because it has a similar syntax to C++ and therefore easier to switch back and forwards between. On Microsoft Windows, C# has a powerful and well integrated development environment (Visual Studio) that is used develop the main C++ game code. This means developers only need a single tool which reduces costs and increases productivity as a developer doesn't need to learn two sets of tools. Most of the tools we make have graphical user interfaces (GUIs), Visual Studio when used with C# has powerful (if slightly flaky!) tools for rapidly making GUIs that reduces our development time and hence costs.

C# is one of several languages that make use of a large framework (.NET). This provides a lot of commonly used functionality such as lists, I/O (Input/Output), Networking etc. that means the programmer can concentrate on the problem rather than having to re-invent the wheel. C# is also a popular language that has a large community to draw help and experience from.

So in this case, C++ is chosen for its control of the hardware and large pool of programmers, C# because of its rich functionality, powerful development environment and ease of use by C++ programmers.

21st Century Scientist

Monday, 1 September 2008

Examples of how to choose a programming language

Science example: a clustering algorithm

An industry example: Writing video games

1 comment:

Search This Blog

.

Recent Posts

About me

Topics