Friday, January 26, 2007

Fast binary searching, part 2: Hard data

This is an update on my Faster than the speed of binary search entry. Maybe it would be good to refresh your memory before proceeding, brave reader ...

Last time I promised some hard data for multi-GB files. Finally, I had some time to do the measurements. I was doing the usual space & time profiling (we are talking about 30 - 60 minute execution times here so each 10% helps a lot) and this was a natural extension of tests.

Sadly (from the viewpoint of a perspective author that introduced the improved binary search algorithm into the application) I have redesigned the most time-consuming part of our app and the biggest list (requiring most lookups) is not used anymore. Therefore, numbers are not so large as I expected three months ago. Data is still interesting, especially as the analysis exposed two anomalous cases.

Let's first take a look at a 'normal' case. The list had between 1.596,266 and 1.803,027 entries (lists are live and in this case entries were removed from the list during the execution) and program executed 171,191 searches. Standard binary search needed 20.8 rounds (bisection steps) to resolve search query on average while my 'bias' version needed only 6.7 rounds. A more detailed distribution of number of rounds is shown in the following graph.

standard vs. biased search

All good and well then, but this was exactly the situation for which the algorithm was designed. Let's take a look at the first anomalous case, where biased search exhibits really bad behaviour.

anomalus behaviour, bad biased search

As you can see, the list had from 0 to 37,300 entries (0 at beginning) and there were 37,301 searches. In fact, each addition to the list was followed by a search request which always found the result somewhere in last 30 entries. Standard binary search needed 13.9 rounds on average while biased version needed 13.5 rounds, which is almost the same (and is in fact slower due to the slower inner loop of the biased version).

The distribution of rounds in biased search is also weird. It must either be a bug in the algorithm or a weird data pattern. It turned out to be the latter. In this case, the list contained data in clusters with big differences between each cluster. Something like this list with 40 entries: 1, 2, 3 ... 20, 1.000,000 1.000,001 .. 1.000,019 (notice the big jump between 20 and 1.000,000). Even worse, the code tries to find smallest number (or a number very close to it) in last cluster (1.000,000 in this case). My biased version first sets lower and upper bound of the search and then calculates the bias (1000000 - 1)/(1000019 - 1), which is in this case 0.999981 or almost 1 [see the code in previous article]. That causes selected 'midpoint' to be equal to the higher bound - 1. Search loop is then repeated, bias is almost the same and upper bound is again reduced by 1. This repeats until upper bound reaches the element we are searching for.

As you've seen, the search degrades to a simple linear search (with complicated calculations behind) in this case. Which is good when we want to find the last element (that happens when last cluster has only one element) or a element very much near the end of the list (when last cluster is small), but degrades quite badly when last cluster is large.

The second anomalous case is actually a special case of the first one, but in this case the cluster was always just one element big. In other words, the code always searched for the largest value in the list. In this case, biased search almost always retrieved the correct number in the first attempt, which is significantly faster than the standard binary search (16.5 rounds on average).

anomalus behaviour, good biased search

So what can we do to improve the situation? One way is to add a special search mode for my 'weird' data, but I really dislike this idea, as I don't know what kind of data will be passed to my (fairly generic) classes in the future.

Another idea was to limit the maximum bias. In standard binary search, the bias is always 0.5 (midpoint between low and high bound) while in the problematic case the bias is almost 1. If we limit the maximum bias to, say, 0.9, the code never degenerates into linear search.

When such clamping is applied, first anomalous case behaves much better (average number of rounds has been reduced to 8.2).

anomalus behaviour, bad biased search, clipping

That's fine, but how would that change affect the other two cases? The 'normal' case behaves worse than before (average number of rounds went from 6.7 to 9.6).

standard vs. biased vs. clipped biased search

That could be acceptable, but in the second anomalous case the number of rounds increases almost by factor 6.

anomalus behaviour, good biased search, clipped

It's clear that this simple solution indeed improves the worst case but degrades all other searches by a noticeable amount.

I had other ideas on how to improve the search algorithm but at the end I left it as it is now (the 'bias' version). After all, in the worst case biased search still makes less rounds than standard search; in the normal case, biased search works as expected and in the best case, biased search works much much better than the standard search.

I hope that I made my point clear - always, but really always analyse the program behaviour on real data, not on test cases.

The other point that is not so obvious is - Excel 2007 charting capabilities rock :)

 

7 comments:

  1. Anonymous23:46

    I enjoyed reading your blog on improving the binary search, something which has always been dear to my heart.

    I have found the best way to improve the binary search is to find some way to limit what you are searching and plan in advance.

    For example, if I have to perform searches against a national database of addresses, it makes more sense to separate my files by SCF (the first 3 digits of a zip code) and then perform binary searches through that. To search by state, I just create a lookup table for each state which lists the SCF's in that state.

    Other performance gains can be made by keeping the search data in a separate file with only a record pointer to where the actual data is located. A simple index using this method then can be read in "chunks". Each chunk would have the high and low keys first followed by the rest of the chunk in order. The end result is less disk activity when the search finds the correct page.

    ReplyDelete
  2. Yes, that is also a quite effective approach.

    Wasn't usable in my case, though.

    ReplyDelete
  3. Anonymous17:46

    Less search steps stop being a good thing if it takes longer than more search steps.

    Runtimes here would be more valuable than # of steps.

    ReplyDelete
  4. I agree fully.

    Maybe I'll publish run times sometime - but that is harder to measure as I must keep one computer free of all other tasks for one hour (times the number of different scenario I want to measure).

    ReplyDelete
  5. First of all, good hashing will most probably always give you better speed for this.

    To get you into some other areas within binary search: if your data don't change often you could have vectored "waypoints" computed for your data giving a better average of your section/subsection and in that way give you more precise jumps. Once your subsection involves less than 64 elements you could switch to ordinary binary search. Also compute average without "special case" items. Let's say you have 1,3,7,9, 100. In this sample 100 is a "special case" and should be dropped.

    ReplyDelete
    Replies
    1. Hashing is good but it cannot solve all problems. In my original problem, I barely had enough memory to store very packed representation of the data. There was no way to put a hash of it in the memory.

      Sometimes you also don't know what is in your data but you are still searching in it. If this is the case, linear search of any variety is much faster than building a hash of everything in your data.

      Delete
    2. Size is ofcourse a major player when the amount of data is very large. But you could gain some speed from using vectored "waypoints" computed while you sort your data.

      For the other point, unless your data must be sorted, you also gain a lot in this area using hashing due to the fact that you don't need to sort it.

      Delete