Monday 8 December 2008

The basics of...R

[caption id="" align="alignleft" width="300" caption="Photo by lizjones112"]Photo by lizjones112[/caption]

R is a free programming language designed to be good at statistics and graphics.  It's downloadable for free and has lots of built-in functionality for maths, statistics and graphics operation.  It also has an active community that develops new add-on packages (libraries) and can be a good option to ask for help, if you're really stuck with something.  The R community's also great way to get your software 'out there', if you feel you've developed something that people would find useful.  It can be a lot slower to run than compiled languages like C and Fortran, except when using certain built-in library functions, but if you're not CPU-limited then it's a language with a lot to offer.

Statistics and data modelling
Statistical computing is where R shines.  There are a large number of built-in functions, as well as libraries (called packages, in R parlance) that implement many of the standard statistical models that you might want to use.  There is also a lot of functionality that supports the use of these statistical techniques.


For example, if you wanted to cluster some data and display the results as a dendrogram, you could simply do the following:

clusteringResults <- hclust(data)
plot(clusteringResults)


This code takes your input 'data', applies a default clustering method to it (hclust) and then outputs the results to the 'clusteringResults' object, which has a plotting method that can be called to make the dendrogram plot.

An added advantage of R in this regard is that people are able to contribute packages to R, which can be optionally installed by the user.  This means that the R community are continuously adding new statistical methods to the functionality of R.

Avoid FOR loops!

FOR loops are slow to execute in R.  This is because they don't benefit from the heavily optimised compilation like languages such as C, C++ and FORTRAN.  This is a problem less than you might think, but sometimes your software simply needs a big loop that will end up being so slow as to be noticeable.  R does provide vectorised versions of many of its built-in functions, which can redress this problem a lot of the time (for example, multiplying all elements in an array by a common factor).  It also has provision for linking to code written in languages like C and FORTRAN, which means that if there is one critical bottleneck in your code, you can consider re-coding that part in a much faster language.


Plotting and graphics
The R project website has this to say about graphics in R.


"One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control."

We think this is a very reasonable statement to make.  When you're producing analysis results, you just want to be able to type a one-line command and have the right plot appear on-screen.  This is very often possible using R.  This is convenient because it means you can keep focusing on the analysis problem rather than having to stop and begin thinking about how to write graphics code.  It's also very nice when things just work!

This also means that R makes a decent graphics package for producing scientific plots, even if your results were generated outside R.  Write them to a file, read them into R and away you go!

Use the libraries/packages
A real strength of R is the package system, because it allows users of R to contribute their own optional libraries to the R project and makes these packages easy for other users to install and use.  This makes R a collaborative project that is constantly adding implementations of new statistical techniques.  There are two huge advantages to this.  The first is that you can often save a big chunk of time on your project simply because someone has already implemented the technique you require, so all you have to do is install their package.  The second is that the package contributors are often experts in the relevant field and are often implementing a technique that has only recently been developed.  This means that R is a great way to gain access to implementations of techniques from the cutting edge of statistics research.

Building your own packages

And of course you can also build your own R packages.  This can be a very convenient way to distribute new software to your collaborators, as the package system includes a pretty good installation system.  And you can also use this as a way to release your code to the wider community.  Our experience is that building packages can be a bit tricky if you want to include compiled code, but once you get used to the wrinkles then it's certainly very useful to be able to do this!

Comprehensive R Archive Network (CRAN)
R is supported by CRAN , a website where you can download R, gain access to the online manuals and FAQS, download packages and generally find out information about R.  This site is very good and will generally be your first stop for R-related information.


Bioconductor
The Bioconductor project is a very powerful set of R packages for bioinformatics work.  It includes things like data reduction tools for gene expression data.  It also has a very very active community, so its well-supported if you have a question you need answering.


Quirks of the language
Quirks in a language aren't a bad thing - every language has them.  You just need to be aware of them, so that they don't slow you down or catch you out.  Here's a quick list of some of R's quirks.


  • 'Period' (.) is just a character.  In some languages (eg. C++ etc), writing myObject.SomeMethod will access the method called 'SomeMethod' associated with the object 'myObject'.  This doesn't have to be true in R (although people do use this kind of form), where myObject.SomeMethod is literally just a variable name that happens to include a period as one of its characters.  It's perfectly valid in R to have variable names like 'data.input' and 'data.processed'.

  • Array indexing starts at 1, but myArray[0] doesn't give an error! This is one to be very careful about.  R will allow you to access out-of-range array elements (including element zero, which doesn't exist) and will return a zero-length array of appropriate type (for element zero) or 'NA' if the subscript is too high for the size of array.

  • Array elements can be added by just writing myArray[5] = 1 (assuming myArray already exists). But you must be careful with this.  For myArray[5] = 1, any elements in the range 1-4 that didn't exist will be created and assigned a value of 'NA'.



In conclusion

R is a great language for getting statistics, data modelling and graphics done.  It's fast to code in, plus it has a lot of functionality built-in and a large (and growing) number of optional packages.  It will generally be slower to run than fully compiled languages like C or FORTRAN, but for many applications this is fine and, if it is a problem, there are work-arounds that may help.  R is free and is supported by an active and helpful community of users, which is great if you've got a question or are stuck.

10 comments:

  1. Thanks for this, I am about to dive into R as a new graduate student, this provided some good perspective.

    ReplyDelete
  2. Glad you found it useful. Please get back to us on how you get along and what you think we should add to the article.

    ReplyDelete
  3. Thank you for the article.
    I am also a newbie to R and I should start studying it already.
    I didn't know of this 'for' loops problem. Are there any testing frameworks for R programs?

    ReplyDelete
  4. There may well be, but I've not come across any (please let us know if you find some!).

    The FOR loops issue can be a real killer, but can be worked around (sometimes thinking in vector operations is a bit...challenging :-) ). I know from experience that Matlab and IDL suffer from exactly the same issue, which is why they also provide vectorised operations. That said, I have found all three languages extremely useful!

    ReplyDelete
  5. [...] R. A statistical programming language.  It has loads of statistical libraries that are either built-in or downloadable.  R is open source and has a community that develops new code for it; for example, the Bioconductor toolbox for analysing gene expression data is very widely used.  It can be slow to run big jobs, unless you use the built-in functions (written in C and hence v. fast) or attach your own C or C++ code to speed up the critical bottlenecks.  Interfacing C/C++ to R in this way take a bit of care, but can be very powerful (See Basics of … R). [...]

    ReplyDelete
  6. [...] Ben wrote an interesting post today onHere’s a quick excerpt [...]

    ReplyDelete
  7. Nice article. Just two comments, both related to your "‘Period’ (.) is just a character." section:
    - as far as I know, it is not a convention to use data.input, data.processed, ...
    - the remark about object oriented programming is not exactly correct. Have a look at the R Language Definition, in particular Section 5.
    http://cran.r-project.org/doc/manuals/R-lang.html

    ReplyDelete
  8. Hi Roland,

    Thanks for your comments! To address them:

    - yes, you're right. What we meant to say was that people often use variable/function names that look like "data.input".

    - Ah yes. What we meant was that if you encounter (in R) something called myObject.someMethod, it *could* be something other than the method "someMethod" associated with the object "myObject". The language definition doesn't stop you using confusing naming conventions for unrelated variables, functions etc.


    Thanks again for the feedback! Tweaks now applied to the article :-)

    ReplyDelete
  9. Hi,

    A few notes for you and your readers:

    (1) I've (relatively) recently stumbled on this handy PDF called "The R Inferno", by Patrick Burns. It highlights many/some of the idiosyncrasies that might trip you up as you get accustomed to R (especially if you have experience w/ other programming languages):

    http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

    It's very helpful, and really an enjoyable read at the same time. Just take a look at its abstract: If you are using R and you think you're in hell, this is a map for you. :-)

    He has other tutorials on his site which might be worth checking out as well.

    (2) @gioby: If, by testing frameworks, you mean unit testing and the like, check out:

    * RUnit: http://cran.r-project.org/web/packages/RUnit/index.html
    * Some hints on debugging with R : http://www.stats.uwo.ca/faculty/murdoch/software/debuggingR/

    ReplyDelete
  10. Thanks to Heather for leaving the following useful info on multi-threading in R, at one of our other posts:

    http://epub.ub.uni-muenchen.de/8991/1/parallelR_techRep.pdf

    (http://www.programming4scientists.com/2009/02/the-basics-ofidl/)

    ReplyDelete