This is an update on my Faster than the speed of binary search entry. Maybe it would be good to refresh your memory before proceeding, brave reader ...
Last time I promised some hard data for multi-GB files. Finally, I had some time to do the measurements. I was doing the usual space & time profiling (we are talking about 30 - 60 minute execution times here so each 10% helps a lot) and this was a natural extension of tests.
Sadly (from the viewpoint of a perspective author that introduced the improved binary search algorithm into the application) I have redesigned the most time-consuming part of our app and the biggest list (requiring most lookups) is not used anymore. Therefore, numbers are not so large as I expected three months ago. Data is still interesting, especially as the analysis exposed two anomalous cases.
Let's first take a look at a 'normal' case. The list had between 1.596,266 and 1.803,027 entries (lists are live and in this case entries were removed from the list during the execution) and program executed 171,191 searches. Standard binary search needed 20.8 rounds (bisection steps) to resolve search query on average while my 'bias' version needed only 6.7 rounds. A more detailed distribution of number of rounds is shown in the following graph.
All good and well then, but this was exactly the situation for which the algorithm was designed. Let's take a look at the first anomalous case, where biased search exhibits really bad behaviour.
As you can see, the list had from 0 to 37,300 entries (0 at beginning) and there were 37,301 searches. In fact, each addition to the list was followed by a search request which always found the result somewhere in last 30 entries. Standard binary search needed 13.9 rounds on average while biased version needed 13.5 rounds, which is almost the same (and is in fact slower due to the slower inner loop of the biased version).
The distribution of rounds in biased search is also weird. It must either be a bug in the algorithm or a weird data pattern. It turned out to be the latter. In this case, the list contained data in clusters with big differences between each cluster. Something like this list with 40 entries: 1, 2, 3 ... 20, 1.000,000 1.000,001 .. 1.000,019 (notice the big jump between 20 and 1.000,000). Even worse, the code tries to find smallest number (or a number very close to it) in last cluster (1.000,000 in this case). My biased version first sets lower and upper bound of the search and then calculates the bias (1000000 - 1)/(1000019 - 1), which is in this case 0.999981 or almost 1 [see the code in previous article]. That causes selected 'midpoint' to be equal to the higher bound - 1. Search loop is then repeated, bias is almost the same and upper bound is again reduced by 1. This repeats until upper bound reaches the element we are searching for.
As you've seen, the search degrades to a simple linear search (with complicated calculations behind) in this case. Which is good when we want to find the last element (that happens when last cluster has only one element) or a element very much near the end of the list (when last cluster is small), but degrades quite badly when last cluster is large.
The second anomalous case is actually a special case of the first one, but in this case the cluster was always just one element big. In other words, the code always searched for the largest value in the list. In this case, biased search almost always retrieved the correct number in the first attempt, which is significantly faster than the standard binary search (16.5 rounds on average).
So what can we do to improve the situation? One way is to add a special search mode for my 'weird' data, but I really dislike this idea, as I don't know what kind of data will be passed to my (fairly generic) classes in the future.
Another idea was to limit the maximum bias. In standard binary search, the bias is always 0.5 (midpoint between low and high bound) while in the problematic case the bias is almost 1. If we limit the maximum bias to, say, 0.9, the code never degenerates into linear search.
When such clamping is applied, first anomalous case behaves much better (average number of rounds has been reduced to 8.2).
That's fine, but how would that change affect the other two cases? The 'normal' case behaves worse than before (average number of rounds went from 6.7 to 9.6).
That could be acceptable, but in the second anomalous case the number of rounds increases almost by factor 6.
It's clear that this simple solution indeed improves the worst case but degrades all other searches by a noticeable amount.
I had other ideas on how to improve the search algorithm but at the end I left it as it is now (the 'bias' version). After all, in the worst case biased search still makes less rounds than standard search; in the normal case, biased search works as expected and in the best case, biased search works much much better than the standard search.
I hope that I made my point clear - always, but really always analyse the program behaviour on real data, not on test cases.
The other point that is not so obvious is - Excel 2007 charting capabilities rock :)