Monday, 23 March 2009

The basics of ... Java

Duke, the Java Mascot, in the waving pose. Duk...
Image via Wikipedia

Java is a object oriented language developed by Sun Microsystems. It was designed from scratch to be simple, secure and platform independent. It has a large and active community and recently most of the code has been made open source. This post looks covers the basics of Java as well as its suitability for use in scientific programming.

Java was internally developed by Sun Microsystems during the early 1990s with version 1.0 being released to the public in 1995. The basic philosophy is 'write once, run anywhere' meaning that Java code written on one system can be run, without modification or even recompilation, on any other than also supports Java. This portability is achieved by the compiler converting the program into bytecode which is then interpreted by the Java Virtual Machine (JVM). Each different platform has a specific implementation of the JVM.

Since 1995 Java has been extended and expanded on numerous occasions. The evolution of Java is controlled by Sun but is partially led by users through the Java Community Process. Between 2006 and 2007 Sun released much of the core into the open source community under the terms of the GNU General Public License.

Java is an object-oriented language with a syntax that is similar to C/C++. It provides a fairly high degree of security through the use of sand-box execution and the JVM provides a way to catch, and potentially recover from, almost any kind of error. Java is a much purer object-oriented language than C++ as everything must be an object. This makes the code easier to compile but can make smaller programs more verbose as it is necessary to add more code to create a procedural-like code (use of statics etc.)

Managed memory - All memory is managed by the garbage collection system and there is no concept of pointers, as this would break the security model and make automatic memory management far, far harder. The upside is that, in most cases, it is impossible to leak memory, a common mistake for programmers of practically all levels that makes programs less stable. Memory leaks in programs that use a lot of memory or take a long time to run can cause the program to crash, losing data and costing time debugging and re-running the programs

Security and stability - All code is run inside the JVM, which controls what the code can and can't do to the underlying system. This 'sandbox' approach makes it very hard for the code to have unintended consequences especially as the managed memory system prevents direct access to memory. The JVM can also catch almost all kinds of errors that would in other languages cause the program to crash. Java's extensive exception handling means that these errors can be dealt with in the code and can either be recovered from or allow data to be saved before the program terminates. It is therefore easier to write programs that will make sure no data is lost if they encounter the unexpected.

Portability - 'Write once, Run anywhere' is the corner stone of Java. Sun and the Java community have put a lot of work into making sure that programs run on different platforms will give the same result. For scientific programmers the biggest headaches of running code on different platforms comes from floating pointers numbers. While there is a standard for floating point numbers (IEEE-754) different platforms may or may not implement it fully and may use different numbers of bits of precision. In Java there is a keyword (strictfp) that forces all platforms to behave the same but this does come with some extra overhead.

Large standard library - Compared to C++ Java has a truly massive standard library, making the development of programs easier as almost anything you could need is already included. For example the Java standard library includes code to handle XML whereas in C++ you would need a third party library. That library might be buggy or might not compile on different platforms!

Easy simple testing - Everything you create in Java is a class and each public class must live in a separate file. If you give each class a public 'main' method then each class can be executed as a small standalone program. This trick means that it is very easy to create simple tests for each of your classes. Obviously, this lacks the power of a full unit testing suite (such as jUnit) but is a quick and easy way to add testing to your Java code.

Managed memory - The downside to managed memory is that you don't have as much control over what goes where. Not being able to directly manipulate memory can make algorithms slower and makes it far harder to port algorithms from C/C++. The garbage collector also takes time to do it's work and you can't always control when a garbage collection cycle will occur. If you are writing code that has lots of creation and deletion of objects then the cost of the garbage collector may become noticeable.  Managed memory means that there are no pointers, ruling out any direct memory manipulation. Managed memory also means that arrays take up more space.

No support for unsigned data types - Java doesn't support unsigned data types (possibly by accident and possibly because it makes life easier for the compiler). This makes receiving data from other sources a pain and means that more memory must be used to store values (ie you need to use a short rather than a byte to store the range 0-255). For scientific code this extra memory cost could be a problem depending upon the amount of data being manipulated.

No operator overloading - Many languages allow the programmer to redefine, either partially, or wholly, the definition of operators (+, -, etc.). This can be extremely useful as it allows the programmer to use operators on new datatypes in a way that makes sense. For scientist the classic example is Complex numbers. Most languages don't support complex numbers as intrinsic data types but with operator overloading they can be used and the resulting code is clear to read. Java doesn't allow operator overloading so the programmer must use methods which makes the code more verbose and harder to read and maintain.

'Pure' OO paradigm -  makes it harder for people to move from Fortran or C. The object oriented paradigm can be hard to get your head around if you are from a pure procedural background. It can seem unnecessarily complex and that there needs to be a lot more boilerplate code to get simple programs up and running.

Speed? - Java suffers from the impression that it is a slow language and in the early versions the cost of running interpreted code was obvious. However, the modern virtual machines have become much faster with the use of Just In Time (JIT) and Hot Spot compilers. JIT compiling takes the byte code and then compiles it into platform specific machine code, removing much of the overhead of the virtual machine. Hot Spot compiling analyses the code, as it is running, and recompiles 'hot spots' to be more efficient. Scientific code generally consists of lots of loops over data and will therefore see a lot of benefits from JIT and Hot Spot compilers. Is Java really 'slow'? No but your code may take longer to run under Java rather than C++, but then again, it might not!

Java is a popular, mature language that is highly portable, features a large standard library and automatic memory handling. On the downside, it's an interpreted language with no support for unsigned types and the automatic memory management prevents fine control over memory use and makes porting of algorithms harder.  Whether Java is a good choice for your scientific programming will depend upon which of these pros or cons is of most benefit or cost.

Reblog this post [with Zemanta]


  1. Very nice article, thanks.
    If I may add, you could cite something about the state of libraries for bioinformatics in Java, e.g. BioJava, libraries for statistics and plotting.

    I have considered learning Java at some point, but the fact is that it seems harder to learn and with not immediate advantages compared, for example, to python.
    I like the python syntax for object oriented programming a lot, since it is easy, has good support for tests and documentation, and it is very easy to read.

    I don't know if Java is better for this than python, but it seems a too much verbose language.

  2. On "means that more memory must be used to store values", I think Java allocates the same size of memory (4 byes) for when you use a boolean, byte, short, and int. It just restricts what the range is.

  3. Great article series, nicely done. Minor nitpick though..

    "Java is a much purer object-oriented language than C++ as everything must be an object."

    The second half of this sentence is a bit misleading, as Java does have 'primitives' which are not objects. This is in contrast to other OO languages, like Ruby for example, where everything truly is an object.

  4. [...] to basic Java introduction for scientist, go no further than Programming for Scientists blog. This blog provides much more than this basic introduction. It also has some interesting [...]

  5. @P Warnes, Java does not allocate 4 bytes for all primitives (although it does allocate an entire byte for booleans).

    With regards to speed, well-written Java code generally isn't much slower than C++ from version 5 up. The main bottleneck, in my experience, is with the maths functionality - particularly trig - but this can often be circumvented with the use of precomputed lookup tables.