## Introduction

Welcome to my first contribution to the thus far excellent personal blog.

On my travels around the games industry I have noticed that although many people know about the existence of the Radix Sort, most know that it’s typically quick (some even known it’s non-comparison based, and linear time). What a great number of individuals I’ve met don’t seem to know however, is the nuts and bolts of how it actually works! Given this, I thought I’d have a crack at explaining Radix Sort for us mere mortals. With any luck, if you’re scratching your head at the notion of a sort that doesn’t do any comparisons or wondering how one is able to break free from the shackles of that O(n log n) thing that your Comp Sci. professor told you about, then this post will help you through it.

The high-level concept of Radix Sort can be imagined by thinking about what you could do if you had an array of, say 128, unique 16bit integers that you wanted to sort. What is perhaps the most obvious way to do it, if you didn’t care about memory utilisation or locality of your results? A simple way would be to simply use each value in the array as an index into an array. Something akin to the following:

`for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {`

``` ```

`     uint32_t value= unsortedArray[currentValue];     sortedArray[value] = value; }`

Obviously this is pretty wasteful in this form, but this extremely basic concept gives rise to the foundation of how Radix Sort works, put another way, we simply aim to just put the values into the correct places straight away without any need to compare with any other values. Obviously in the above example there are numerous points of failure and problems that we need to deal with, but we’ll get to those shortly.

Okay, so we’ve got the concept of roughly what a Radix Sort aims to do; now how do we go about solving the problems with our naive first-pass implementation? Like all good programmers, we’ll begin by defining the problems we’re trying to solve. Perhaps the most fatal blow to the code above is the larger the values you have in the list of unsorted values, the greater amount of memory you will need to accommodate their placement directly into the correct spot in the results array. This is because of the way we’re placing the values, we just use the value itself to determine where to store it in the corresponding results array. Another problem is collisions. The word “collision” here simply means, “When two items map to the same location in memory in the results array,” for example, when we have the same value twice (or more) in the input array that we’re attempting to sort. For the acquainted, the concept is analogous to that of a collision in the context of a hash map. Another problem that we actually wouldn’t have with the above (due to it working on 16bit unsigned integers) is what to do with negative values or values of a different type, such as floating point numbers. It is not a stretch for the imagination to contrive use-cases in which our sort should be robust enough to deal with this type of data.

## Analysing Data Using Elementary Statistical Tools

One of the reasons I like Radix Sort is that the solutions to these seemingly death-dealing problems come in the form of some mathematical tools you were likely taught at primary school (for those of you that only speak American, read “elementary school”). By performing some rudimentary analysis on the data, we’re able to sort it robustly and efficiently. The analysis that I speak of is something called a histogram, a simple mathematical tool used to depict the distribution of a dataset. In its traditional form a histogram is just a bar chart where the height of the column represents the number of times a given value is present in a set. Check out the example below: To calculate a histogram for an arbitrary set of values we can simply iterate over all the values in our set. For each value in our array we maintain a running total of the number of times it has been encountered… That’s it! (My apologies if you were expecting something more complex). We have a histogram for our array, and we’re well on our way to conquering Radix Sort. Pause for a moment here, and actually take a look at the histogram for the dataset you’re sorting. It can be interesting and very illuminating to see just how your data is spread out and you may gain some interesting insight.

An astute reader may have spotted that I actually glossed over an important implementation detail here. How exactly do you keep track of a running total for each unique value? Ideally we’d like this to be nice and quick, so we don’t want to use some nefarious mapping data structure to keep track of the histogram. Besides, I did promise that I’d keep things simple so how about a nice, simple, flat array? The problem is if we’re sorting, for example, 32bit integers and want to use the values themselves as array indices denoting the counter for a specific value, we’re left with the rather inconvenient problem of requiring 232 elements just to store the histogram! Not to fear though, as this is where the concept of the radix enters the fray.

If we have a smaller radix, of say 8 bits, and multiple histograms (one for each byte of the 32bit integers we’re sorting, a total of four), then we can use a relatively small amount of memory for the histograms which is proportional to the assumable range of the radix. Calculation of multiple histograms can be done in parallel (or in the same loop) no matter how many histograms you seek to calculate for the dataset, you just shift and mask off the number of bits you’re interested in for each histogram. Here’s a quick example of what I mean, for an 8bit radix, you’d simply do: `(x & 0xff)`, `((x>>8)&0xff)`, `((x>>16)&0xff)` and `(x>>24)` to access each byte of the 32bit value individually. This type of bit shifting should be immediately familiar to any graphics coders out there as it is often used to access individual channel in a 32bit colour value. The code to calculate four histograms (one for each byte) from our 32bit integers ends up looking a little something like this:

` for(uint32_t currentItem=0; currentItem<maxCount; ++currentItem) {`

``` ```

`     const uint32_t value = unsortedArray[currentItem];     const uint32_t value0 = value & 0xff;     const uint32_t value1 = (value>>0x8) & 0xff;     const uint32_t value2 = (value>>0x10) & 0xff;     const uint32_t value3 = (value>>0x18);     histogram0[value0]++;     histogram1[value1]++;     histogram2[value2]++;     histogram3[value3]++; } `

At this point I’d like to stress that there is absolutely no reason why you must have an 8bit radix. It is common, yes, but 11bit, 16bit, or whatever you want will also work. When you’re actually implementing this algorithm you should probably try out a few different radix sizes to see which gives you the best results on your target hardware. It’s essentially a trade-off between cache performance accessing the supporting data structures and doing less passes over the input array. Increasingly from my experience the more cache optimal solution (i.e.: the smaller Radix size and hence histogram/offset table) tends to perform best as memory latency increases relative to CPU performance. It’s also worth noting that histograms for different subsets of a full dataset can be summed in order to produce the histogram for the entire set, this trick can be useful when doing this type of thing in parallel, but we’ll get to that in due course.

## Offset Calculation

If you cast your mind back to the beginning of this post, I stated that the guiding principle of Radix Sort was to place values at the correct places in the output array immediately without performing any comparisons. To do this we need to know at what offset into our output array we should use to start placing writing to for each unique value in our input array. The histogram was actually an intermediate step in calculating this list of offsets. To illustrate this I will use a worked example, consider the following list of unsorted numbers:

1, 2, 4, 3, 1, 1, 3, 1, 7, 6, 5

If we wanted to place the value ‘2’ directly into the output array at its correct place we would actually need to place it at index 4 (i.e.: the 5th slot). So the table we’re computing will simply tell us that for any values of ‘2’ begin placing them at index 4. We then increment the offset for the value we just placed as we go, so that any subsequently occurring values which are the same (assuming there are any) go just after the one we’ve placed. The offset table we want to calculate for the above example would look something like this:

0 – N/A (There are no 0′s in the set, so it doesn’t matter!)
1 – 0
2 – 4
3 – 5
4 – 7
5 – 8
6 – 9
7 – 10

So how do we calculate this offset table for a histogram? That’s easy; each location in the table is just the running total of each value in the histogram at that point. So in this example, the offset for ‘2’ would be 4, because we have no ‘0’s, but then four ‘1’′s. This unsurprisingly is a total of 4! The next offset, for ‘3’, is 5 because in our data set we only have one ‘2’, and 0+4+1 is 5. The technique of increasing the offset for a given value as you place it in the output array gives rise to a very subtle, but important property of Radix Sort that is vital for implementations which begin at the least significant byte (LSB Radix Sort) —the sort is stable. In the context of sorting algorithms, a sort is said to be stable if the ordering of like values in the unsorted list is preserved in the sorted one, we shall see why this is so important for LSB Radix Sort later on. Incidentally don’t worry about what LSB means just yet, we’ll get to that! A quick note at this point, the application of these types of analysis tricks can also be done offline to help you better design a good solution around the data you’re trying to process. I’m hoping that a certain Mr. Acton might be kind enough to share some the increasingly important statistical tools that have made their way into his bag of tricks at some point on this blog in the future, :).

## What Have We Got So Far?

The data flow chart below shows what we’ve got so far. We’ve successfully applied some elementary data analysis to our data, and in the process computed the only supporting data structure we need: a table of offsets for each radix which tells us where to begin inserting each datum in our output array. I find a fun and useful way to visualise what we’ve just done is to imagine taking the bars of the histogram for each number in our data set, rotate them 90 degrees, and then lay them end to end. You can imagine these bars come together to form a contiguous block of memory, with each bar being big enough to hold all the occurrences of the particular value it represents. The offset table is just the indices along this array where each bucket begins. Now comes the fun part, actually moving the data. Things are going to get a little more complicated here, but only very, very slightly I promise. There are actually two ways to approach data movement in Radix Sort. We can either start with the least significant bits or with the most significant bits (LSB or MSB). Most people with a serial programming mindset tend to go with LSB, so we’ll start there and then cover MSB later on as that has some advantages (predominately for parallel programming).

## Serial Offender!

Radix Sort is typically not an in-place sort, that is to say implementations typically requires some auxiliary storage which is proportional to the number of elements in the original, unsorted list in order to operate. To move the data around we need to do one pass through our unsorted array for each radix (the number of passes will change depending on the size of each datum being sorted and the size of your radix, of course). Each time we perform a pass, we are actually sorting the data with respect to the particular radix we are considering, and I’ll begin by discussing this for LSB and then discuss MSB.

The first pass will typically sort by the first byte, the second by the second (but respecting the positioning of those elements with respect to the first) and so on. Hopefully at this point you are able to see both why the stable property of LSB Radix Sort is so important and where it comes from. The data movement pass is just a single pass over the input array for each radix as mentioned. For each pass, you just read the value in the input array, mask off the bits that are relevant to the particular radix you are considering and then look-up the offset from the table of indices we computed earlier. This offset tells you where in the corresponding output array we should move the value to. When you do this, you need to be sure to increment the offset table so that the next value from the input array with the same binary signature will be written to the next element in the array (not on top of the one you just wrote!). That’s all there is to it really! You just do this once for each radix, something like this:

` for(uint32_t currentValue=0; currentValue<maxCount; ++currentValue) {`

``` ```

`    const uint32_t value = unsortedArray[currentValue];    const uint32_t offsetTableIndex = value & 0xff;    const uint32_t writePosition = offsetTable0[offsetTableIndex];    offsetTable0[offsetTableIndex]++;    auxiliaryArray[writePosition] = value; } `

So that’s our LSB Radix Sort for integers, but what if you want to sort floating point numbers, or indeed negative numbers? I’m sure if you’re reading this you’re probably aware of the common storage formats of floating point values (the most common of which being IEEE 754) or two’s complement for signed integers. How does Radix Sort deal with these things? The answer is perfectly well if you apply a simple transformation to the input data, in time-honoured fashion I’ll leave this as an exercise for the reader as I don’t have time to delve into this now. If you’re really struggling then please leave a comment or something and I’ll try and find time to write it up for you :)

## Making the Best of a Parallel Situation

So we’ve covered how to Radix Sort works when starting from the LSB, seems to work nicely, why would we want to complicate things any further? This part will talk about how sorting data beginning with the most significant digits changes the nature of the problem and how we can use this to our advantage in this increasingly parallel world.

If you think about what you actually have at each stage of the data movement passes, you will see that using MSB Radix gives rise to a particularly interesting property, the first pass over the data actually produces n independent sorting problems, where n is the assumable range of the radix. If you can’t see why the parallel programmer in you is now rubbing his/her hands together, don’t worry, here’s why!

The first pass over the data in the context of an MSB Radix Sort actually has the effect of roughly bucketing the data according it’s most significant byte (or whatever size radix you’re working with), to put it another way, you know that after each pass of MSB Radix Sort, none of the values in the buckets will ever need to be moved into other buckets (which is of course not the case with LSB Radix Sort). This important property actually opens up the potential for us to distribute the sorting of the different buckets over multiple processing elements, crucially with each element working on independent data, making synchronisation much simpler. To perform MSB in serial you just re-apply Radix sort to each bucket independently. Another interesting side effect of doing MSB Radix Sort is that each application of Radix Sort makes the data more and more cache coherent for subsequent passes.

## Wrapping it up…

So, there you have it. Radix Sort explained (hopefully well enough) for humans! I hope after reading this you are able to have a strong mental model of how Radix Sort works in practise and moves data around in memory, and importantly how it is able to break free of the O(n log n) lower limit on comparison-based sorting algorithms. Just a quick disclaimer, for some of my more performance minded readers: I make no guarantees as to the optimality of my implementation; the purpose here was to explain the concepts behind the algorithm, not to write an optimal implementation.