For decades computers have got faster by increasing the power of a single processing unit but recently chip makers are hitting limits, mainly cost and heat. The results: instead of one processor getting faster, more are being added. At the same time the advent of cheap commodity computers and fast networking has given rise to various clusters, grids and clouds. This article gives a quick overview of what is available when you need lots and lots of power but don't have lots and lots of money.
In 1965, Gordon Moore, co-founder of Intel, noted that the number of transistors that could be placed cheaply in a chip had doubled roughly every two years. What has since been called Moore's law has remained true for over four decades. Until 2005 these extra transistors went into making a single CPU more and more powerful but due to problems with heat dissipation and the increasing gap between the speed of the CPU and the memory system, chip makers have started creating chips with more than one 'core'. Each core is a CPU, often with associated memory cache, that can execute code independently of the other cores on the chip.
Multiple cores are found in another piece of commodity hardware: graphics chips. Each pixel on a screen is basically independent of all the others and hence they make an easy target for parallel computing. In recent years the power of GPUs (graphics processing units) has become more available for non-graphics related programming, potentially giving even more power to the programmer.
Even before the rise of multiple core machines, people were using commodity hardware to create more powerful machines. In 1994 two NASA scientists created a cluster of commodity computers that could work in parallel on a problem. They called their cluster of computers 'Beowulf' and the name now applies to a class of parallel work machines. Special software is needed to divide the work up into units that can then be sent off to individual computers and then brought back together. There now exists several stacks of programs that make setting up and running a cluster far easier.
Clusters are, loosely, defined by the individual computers being the same, having no screens or keyboards, having a central controlling computer and all the machines being in the same physical space. Beowulf clusters are popular in academia because they are cheaper to build and run than a dedicated super-computer resource.
Grid computing got it's name from the idea that computing power should be available as easily as electricity and that you would be able to pay for what you use without having to bear the setup and maintenance cost of a large cluster or super-computer. It is related to the idea of computing as a utility, the same as power, gas and water, that you tap into.
Grid style distributed computing for science was popularized by SETI@Home which uses spare cycles on volunteers machines to search for extra-terrestrial signals.
SETI@Home demonstrates the main differences between clusters and grids:
- Computers in a cluster are generally identical, grid computers can be very diverse
- Clusters are tightly controlled by the central unit, grids are more loosely controlled depending on the capabilities or load of the individual machines
- Clusters are headless (without monitor or keyboard), grid computers are often regular machines that have other duties
- Clusters are found in a single location, grid computers are spread out through an organization or even the whole world
- The machines in a cluster are trusted, the machines in the grid have to be distrusted as they are more loosely controlled.
The rise of the Internet and spread of personal computers means that there are a very large number of machines that could be connected into a grid. Projects such as Boinc (of which SETI@Home is a part) and Folding@Home have shown that it is possible to gather together truly enormous computing resources by relying solely on volunteer contributions.
Cloud computing is a recent development that is a slightly different take on grid computing. Cloud and grid computing are similar in that they both harness machines spread out around the world but they are different in that cloud computing is more about creating virtual machines in data centres rather than running programs on individual machines. Amazon popularized this approach with their Elastic Compute Cloud (EC2) which allows people to buy time on their servers. Instead of getting a physical machine EC2 allows users access to virtual machines that can be configured with whatever operating system and software they need. As more power is required more instances of this machine can be spawned and the user only pays for what they use.
Multiple cores, clusters, grids and clouds all give the programmer access to large amounts of processing power but the code must be written to take advantage of it. This is a fundamental change in how to program computers and it can be very difficult to bring the power to bear on your problem. Parallelism isn't easy to think about and even the simplest programs can fall prey to subtle bugs relating to how two different threads read and write values (see race conditions and deadlocks).
The problems that most easily make the jump to parallel computing are those that are known as 'embarrassingly parallel'. These are problems where there is no dependency, and hence, communication, needed between different work units. Luckily science has a lot of these kinds of problems, usually data analysis ones, and has therefore been at the forefront of parallel computing.