R is a free programming language designed to be good at statistics and graphics. It's downloadable for free and has lots of built-in functionality for maths, statistics and graphics operation. It also has an active community that develops new add-on packages (libraries) and can be a good option to ask for help, if you're really stuck with something. The R community's also great way to get your software 'out there', if you feel you've developed something that people would find useful. It can be a lot slower to run than compiled languages like C and Fortran, except when using certain built-in library functions, but if you're not CPU-limited then it's a language with a lot to offer.
Statistics and data modelling
Statistical computing is where R shines. There are a large number of built-in functions, as well as libraries (called packages, in R parlance) that implement many of the standard statistical models that you might want to use. There is also a lot of functionality that supports the use of these statistical techniques.
For example, if you wanted to cluster some data and display the results as a dendrogram, you could simply do the following:
clusteringResults <- hclust(data)
This code takes your input 'data', applies a default clustering method to it (hclust) and then outputs the results to the 'clusteringResults' object, which has a plotting method that can be called to make the dendrogram plot.
An added advantage of R in this regard is that people are able to contribute packages to R, which can be optionally installed by the user. This means that the R community are continuously adding new statistical methods to the functionality of R.
Avoid FOR loops!
FOR loops are slow to execute in R. This is because they don't benefit from the heavily optimised compilation like languages such as C, C++ and FORTRAN. This is a problem less than you might think, but sometimes your software simply needs a big loop that will end up being so slow as to be noticeable. R does provide vectorised versions of many of its built-in functions, which can redress this problem a lot of the time (for example, multiplying all elements in an array by a common factor). It also has provision for linking to code written in languages like C and FORTRAN, which means that if there is one critical bottleneck in your code, you can consider re-coding that part in a much faster language.
Plotting and graphics
The R project website has this to say about graphics in R.
"One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control."
We think this is a very reasonable statement to make. When you're producing analysis results, you just want to be able to type a one-line command and have the right plot appear on-screen. This is very often possible using R. This is convenient because it means you can keep focusing on the analysis problem rather than having to stop and begin thinking about how to write graphics code. It's also very nice when things just work!
This also means that R makes a decent graphics package for producing scientific plots, even if your results were generated outside R. Write them to a file, read them into R and away you go!
Use the libraries/packages
A real strength of R is the package system, because it allows users of R to contribute their own optional libraries to the R project and makes these packages easy for other users to install and use. This makes R a collaborative project that is constantly adding implementations of new statistical techniques. There are two huge advantages to this. The first is that you can often save a big chunk of time on your project simply because someone has already implemented the technique you require, so all you have to do is install their package. The second is that the package contributors are often experts in the relevant field and are often implementing a technique that has only recently been developed. This means that R is a great way to gain access to implementations of techniques from the cutting edge of statistics research.
Building your own packages
And of course you can also build your own R packages. This can be a very convenient way to distribute new software to your collaborators, as the package system includes a pretty good installation system. And you can also use this as a way to release your code to the wider community. Our experience is that building packages can be a bit tricky if you want to include compiled code, but once you get used to the wrinkles then it's certainly very useful to be able to do this!
Comprehensive R Archive Network (CRAN)
R is supported by CRAN , a website where you can download R, gain access to the online manuals and FAQS, download packages and generally find out information about R. This site is very good and will generally be your first stop for R-related information.
The Bioconductor project is a very powerful set of R packages for bioinformatics work. It includes things like data reduction tools for gene expression data. It also has a very very active community, so its well-supported if you have a question you need answering.
Quirks of the language
Quirks in a language aren't a bad thing - every language has them. You just need to be aware of them, so that they don't slow you down or catch you out. Here's a quick list of some of R's quirks.
- 'Period' (.) is just a character. In some languages (eg. C++ etc), writing myObject.SomeMethod will access the method called 'SomeMethod' associated with the object 'myObject'. This doesn't have to be true in R (although people do use this kind of form), where myObject.SomeMethod is literally just a variable name that happens to include a period as one of its characters. It's perfectly valid in R to have variable names like 'data.input' and 'data.processed'.
- Array indexing starts at 1, but myArray doesn't give an error! This is one to be very careful about. R will allow you to access out-of-range array elements (including element zero, which doesn't exist) and will return a zero-length array of appropriate type (for element zero) or 'NA' if the subscript is too high for the size of array.
- Array elements can be added by just writing myArray = 1 (assuming myArray already exists). But you must be careful with this. For myArray = 1, any elements in the range 1-4 that didn't exist will be created and assigned a value of 'NA'.
R is a great language for getting statistics, data modelling and graphics done. It's fast to code in, plus it has a lot of functionality built-in and a large (and growing) number of optional packages. It will generally be slower to run than fully compiled languages like C or FORTRAN, but for many applications this is fine and, if it is a problem, there are work-arounds that may help. R is free and is supported by an active and helpful community of users, which is great if you've got a question or are stuck.