Monday, 5 January 2009

Order notation, or how to tell how good your algorithm is

[caption id="" align="alignleft" width="300" caption="Photo by Wallula Junction"]Photo by Wallula Junction[/caption]

How can you tell if your shiny new algorithm is better than the one you already have? How do you settle an argument between you and your colleague? Order notation will help you choose between algorithms and settle arguments.

Order notation, otherwise known as Big "O" Notation, is one way of looking at the limiting behaviour of algorithms and therefore provides a way to compare how different algorithms independent of their implementations. It is usually used to show how long an algorithm will take to produce an answer but it can also be used to show how much memory, or other resources, will be used to produce a result. Order notation is used to show the best, average and worst case costs of an algorithm and you should be careful of quoted values as this will refer to the average case and not to the worse case. See later for an example.

Working it out
Consider a sorted list of 100 numbers drawn at random from the numbers 1 to 1 million. Suppose we were given a random number from the same range and wanted to find if that number was in our list.
The simplest way of doing this is to go through the list and check each item in the list against the number we've been given. Since the list contains a very small subset of the total range most of the time the number won't be in the list but we'll have to check all 100 items each time. Occasionally, an item will be in the list but as this happens very infrequently we can ignore this behaviour when working out the average order of this algorithm. This kind of simplification goes on a lot for simple order calculations. If more precision is required then this edge case can be considered.
The speed of the algorithm depends on the size of the list and regardless of any other factors the bigger the list the slower the algorithm: if the list doubles in size the algorithm will take twice as long. This algorithm is therefore an O(n) algorithm where n is the size of the array. It is important to note that if we reduce the size of the source range, to numbers 1 to 200, so that the item is found in the list most of the time this becomes an O(n/2) algorithm.

Now consider looking for a match using a binary search, which we can use as the list is sorted, we will only have to check at most 7 different numbers to find if the given number is in the list as we are halving the search space with each test. Also note that if we double the size of the list we only need to do one more search. This means that the binary search algorithm is O(log n). Note that in computer science log means log2 rather than log10 or loge.

It is now possible to compare the two algorithms rather than comparing implementations of the algorithms. Clearly, as n, the size of the list, grows the binary search algorithm becomes better and better making it the clear winner for large n. However, the simple search is far easier to code as writing a working binary search algorithm is not trivial so it is worth keeping it simple until the speed increase is needed (see the post on Optimization).

Interestingly, the space usage of both algorithms is the same as they work on the list of numbers directly and allocate no new space of their own.
In order notation this is known as O(1) or O(k) as the cost is independent of the number of items.

When analysing more complex algorithms break them into simpler parts, work out the order of each part and put them back together. For instance an algorithm that searches an input for numbers in a range and then sorts that range would be O(n + m log m) where n is the size of the input and m is the size of the subset. Note that for very large inputs the searching part will come to dominate and therefore this can be considered an O(n) algorithm.

Different classes
Algorithms are generally divided into different classes depending on the largest power of n. Thus, if you have an algorithm that is O(n2 + m) it is in the class of O(n2) algorithms.

For the most part, when considering different algorithms, forget about space complexity. Time complexity is the important one. You can often trade off time for space, though, so be aware of how much you're using.

  • O(1), or constant time, is the best option. Accessing an array is O(1), so is using a hash table. Unfortunately, hash tables usually incur a space penalty as they need to have empty space to minimize hash collisions (different inputs generating the same hash). Also a poor hashing function will kill you as resolving a collision is an O(m) operation where m is the number of elements sharing the same hash.

  • O(log n) is the next best option. Binary search is pretty much the only thing that is O(log n). Useful to remember if you can maintain a sorted list and are doing lots of searching.

  • O(n) is not bad, if you have a small set of things to iterate through. O(n) is very easy to think about so these should be your "go to" algorithms.

  • O(n log n) is usually only encountered in sort as lots of sorting algorithms have this order. QuickSort is a particularly good sorting algorithm that has O(n log n) as it's average behaviour.

  • O(n²) is getting quite bad, and should only be used when there is no alternative or the set is very small. Sorting algorithms often have O(n²) as worse cases. Look at how to cut down the search space in at least one dimension to try and tame O(n²) algorithms.

  • O(n3) and higher (ie O(2n), O(n!) etc.) are bad but might be the only way to solve the problem. Do you need the exact solution or will a 'good' one be enough? Use a simpler, less accurate algorithm, or some kind of heuristic. See travelling salesman problem for examples of tacking a problem where the simple solution is O(n!).

Being able to analyze algorithms, rather than implementations of  algorithms, is an important tool in the programmers toolkit. It allows you to make informed choices as well as settle arguments. The next time you and a colleague are going head to head over whose algorithm is best break out the order notation and settle the argument once and for all.

1 comment: