What makes a Sort utility Fast? Big Sort.
Just about anyone can write an efficient sort when the entire set of unsorted items can fit in memory. But what can be done to make a Big Sort fast – especially when it is too large to fit in memory. And what if it is too large to fit only the keys in memory (a tag sort). Well of course, you have to use disk files and that can take considerable amount of performance away from the sort utility.
Big Sort in memory?
Even the fastest disks are only a fraction of the speed at which memory to memory operations can be done. So the goal of a disk based “big sort” is to use the disk as little as possible. The absolute minimum it can be used is to read the unsorted records once and then to write them back to disk after they are sorted. Unfortunately, that would imply that the entire unsorted file can fit in memory, and that is not a Big Sort. We have to accept the fact that the disk will need to be used to hold temporary sub-sets of the original data during the sorting process.
Since the data will need to be written to disk as temporary work files anyway, then we need to make an attempt at writing as little data as possible. Each unsorted item will be “represented” in the work file by it’s key and it’s location in the original, unsorted file. This “proxy” item will be sorted instead of the original item. Then after it is sorted, during the last phase of the big sort, the original row will be reread via the location information on the proxy. This original data will be written in order to the output file.
When disk is needed to sort very big files the way to make it fast is to transfer only the minimum amount of data as few times as possible.
