Friday, October 27, 2006

Faster than the speed of binary search

[Updated 2006-10-28, new text and code at the end.]

Everybody (at least everybody competent enough to be worth mentioning) know how to search in sorted array with the binary search algorithm. But everybody (or almost everybody) thinks that this is the fastest approach to searching. Is everybody (or almost everybody) correct regarding that topic? It depends.

On general, the answer is true - binary search is fastest. But that only holds if we know nothing about the data stored in the array - in other words, if we pretend that it is random (but sorted, of course).

Want an example? Take an array [1..99] of byte. Fill it with numbers in the range 1..100 but leave (random) one out. Use each number only once. Sort it. You'll get an array with first N numbers in the same position as the array index (A[i] = i for all i in 1..N) and the rest of it will be off by one (A[i] = i+1 for all i in N+1..99). Do you think that binary search is fastest possible search in that case? Of course no - to find a random number (say m) you only have to check A[m] and A[m-1]. Two lookups, that's all.

Are there more general guidelines to speeding up a binary search. Yes, it's quite simple - we only have to be able to guess the position of the value we're searching for better than with the "it's somewhere in the middle" approach. In a way, that's exactly what we did in the example above.

This approach is very simple to use if data in the array is distributed uniformly. In that case, we can take the value at low and high end of the interval (not the indices as in the binary search, but values at those indices) and use a linear interpolation to guess a possible position of the value.

Something similar happend to me last week. I was solving some DVB-related problem and built an array of video fragments (video PES, if you know anything about the DVB internals) in a video file (actually I only stored file offset for each fragment). Later, I needed to find (many times) a fragment closest to some file offset (every time a different offset). Of course, I could use a simple binary search, but it was such a nice opportunity to try out a better approach and run some tests ...

For test data I used a small, 40 MB video file. There were 51737 entries in the array and the code executed 3128 searches. With standard bisection, the code did 15.45 rounds through the bisection loop per search (on average). With interpolated search, this number was decreased to 4.13 rounds per search. That's better by a factor 3.7! And in the real life, the file will be some 50x larger - around 2 GB - and difference will be even bigger.

The number of rounds/number of searches distribution for both approaches:

We can see that standard binary search needed 16 rounds for majority of the searches and almost no searches were completed in less than 15 rounds.

On the other hand, interpolated binary search shows almost gaussian distribution from 2 to 6 rounds.

But as the code is more complicated (see below), the total speedup was not that impressive, only somewhere around 1.6 times. Still, it was a personal satisfaction to see the theoretically better code run faster in reality, too :)

function TGpTSPIDIndex.Find(transportStreamOffset: int64; var index: integer): boolean;
var
L, H, I, C: integer;
begin
{More work is needed to check for boundary conditions}
if Count = 0 then begin
index := 0;
Result := false;
end
else if transportStreamOffset <= Chains[0].TransportStreamOffset then begin
index := 0;
Result := (transportStreamOffset = Chains[0].TransportStreamOffset);
end
else if transportStreamOffset = Chains[Count-1].TransportStreamOffset then begin
index := Count-1;
Result := true;
end
else if transportStreamOffset >= Chains[Count-1].TransportStreamOffset then begin
index := Count;
Result := false;
end
else begin
Result := false;
L := 0;
H := Count - 1;
while Chains[L].TransportStreamOffset <= Chains[H].TransportStreamOffset do begin
//instead of
//I := (L + H) shr 1;
//we're using
if L = H then
I := L
else begin
I := Round((transportStreamOffset - Chains[L].TransportStreamOffset) /
(Chains[H].TransportStreamOffset - Chains[L].TransportStreamOffset) * (H-L)) + L;
if I < L then
I := L
else if I > H then
I := H;
end;
//offsets are int64 numbers
C := Int64Compare(Chains[I].TransportStreamOffset, transportStreamOffset);
if C < 0 then
L := I + 1
else begin
H := I - 1;
if C = 0 then begin
Result := true;
L := I;
end;
end;
end;
index := L;
end;
end; { TGpTSPIDIndex.Find }


[Updated 2006-10-28]


While thinking on the comment posted by the anonymous I looked at the code from a new view and suddenly I realized that I'm an idiot. The 'while' condition in the bisection loop in my version should not differ from the condition in standard binary search! When I put this right, I could suddenly remove all extra tests at the beginning as they were no longer needed. New code (tested to work the same as the old code and as the standard binary search - at least on my data) is posted below.


A word of warning: all this works only if there are no repetitive elements in the sorted array! If this condition is not satisfied, you'll soon get a 'division by 0' exception in the calculation of the weighted midpoint.

function TGpTSPIDIndex.Find(transportStreamOffset: int64; var index: integer): boolean;
var
L, H, I, C: integer;
begin
Result := false;
L := 0;
H := Count - 1;
while L <= H do begin
//standard binary search
//I := (L + H) shr 1;
//weighted version
if L = H then
I := L
else begin
I := Round((transportStreamOffset - Chains[L].TransportStreamOffset) /
(Chains[H].TransportStreamOffset - Chains[L].TransportStreamOffset) * (H-L)) + L;
if I < L then
I := L
else if I > H then
I := H;
end;
C := Int64Compare(Chains[I].TransportStreamOffset, transportStreamOffset);
if C < 0 then
L := I + 1
else begin
H := I - 1;
if C = 0 then begin
Result := true;
L := I;
end;
end;
end;
index := L;
end; { TGpTSPIDIndex.Find }

8 comments:

  1. Anonymous06:58

    Binary searchs are complete searches, where as your routine is a subset search. If you can be 100% sure that the data you seek will always be in that subset, it makes no sense to search elsewhere.

    If you have ANY doubt at all, stick with the complete search. IN fact, if you fail, you might want to fall back to complete search anyways just to make sure you don't have faulty assumption about your dataset.

    Data can be notioriously tricky that way.

    ReplyDelete
  2. Why do you think it is a subset search? It just selects a midpoint that is "slightly" off the center.

    But in general I do agree. However, tricks like this are a great way to quickly narrow down the search range and then use a binary search to find the data.

    ReplyDelete
  3. Anonymous17:40

    hy!

    have you to optimize this code?

    the most important things is to find if a value in "in the list", when not insert with with a new rangeitem or "expand" an already exists range.

    unit Unit_Int64List;

    interface

    uses
    Classes;

    type
    TInt64RangeItem = class
    private
    public
    StartIndex: Int64;
    Start: Int64;
    Stop: Int64;
    class function NewInstance: TObject; override;
    procedure FreeInstance; override;
    protected
    constructor Create(const _Start: Int64; const _Stop: Int64);
    published
    end;

    type
    TInt64List = class
    private
    FList: TList;
    FUpdateCount: Integer;
    FCount: Int64;
    function GetCount: Int64;
    function GetRangeCount: Integer;
    function FindRangeIndex(const Value: Int64; var Index: Integer): Boolean;
    procedure CalcStartIndex;
    function GetRange(Index: Integer): TInt64RangeItem;
    public
    constructor Create;
    destructor Destroy; override;
    function Add(const Value: Int64): Int64;
    procedure DeleteRange(const Index: Integer);
    function Find(const Value: Int64; var Index: Int64): Boolean;
    function FindGetPath(const Value: Int64): String;
    function IndexOf(const Value: Int64): Int64;
    procedure Clear;
    procedure Compact(const ReCalcStartIndex: Boolean = False);
    procedure SaveToStream(F: TStream);
    procedure LoadFromStream(F: TStream);
    //
    property Count: Int64 read GetCount;
    property RangeCount: Integer read GetRangeCount;
    property Ranges[Index: Integer]: TInt64RangeItem read GetRange;
    end;

    implementation

    uses
    SysUtils, Unit_Utils;

    const
    TIntegerSize = SizeOf(Integer);

    // -----------------------------------------------------------------------------
    class function TInt64RangeItem.NewInstance: TObject;
    type
    PClass = ^TClass;
    var
    P: Pointer;
    begin
    GetMem(P, Self.InstanceSize); //Allocate Memory for this object
    Result := TObject(P); // Set the result
    InitInstance(P); // Initialise the Instance
    PClass(Result)^ := Self; // Init VMT, don't init fields.
    end;

    procedure TInt64RangeItem.FreeInstance;
    begin
    CleanupInstance; // Clean Up
    FreeMem(Pointer(Self), Self.InstanceSize); // Free Memeory
    end;

    constructor TInt64RangeItem.Create(const _Start: Int64; const _Stop: Int64);
    begin
    inherited Create;
    Start := _Start;
    Stop := _Stop;
    end;

    // -----------------------------------------------------------------------------
    constructor TInt64List.Create;
    begin
    inherited Create;
    FList := TList.Create;
    FUpdateCount := 0;
    FCount := 0;
    end;

    destructor TInt64List.Destroy;
    begin
    Clear;
    FreeAndNil(FList);
    inherited Destroy;
    end;

    function TInt64List.Add(const Value: Int64): Int64;
    var
    R: TInt64RangeItem;
    RangePos: Integer;
    begin
    if (FindRangeIndex(Value, RangePos) = True) then
    begin
    R := FList.List^[RangePos];
    Result := R.StartIndex + (Value - R.Start);
    end else
    begin
    Result := 0;
    if (FList.Count > 0) then
    begin
    if (RangePos < FList.Count) then
    begin
    R := FList.List^[RangePos];
    if ((Value+1) = R.Start) then
    R.Start := Value
    else
    if (RangePos > 0) then
    begin
    R := FList.List^[RangePos-1];
    if ((R.Stop+1) = Value) then
    R.Stop := Value
    else
    FList.Insert(RangePos, TInt64RangeItem.Create(Value, Value));
    end else
    FList.Insert(0, TInt64RangeItem.Create(Value, Value));
    end else
    begin
    R := FList.List^[RangePos-1];
    if ((R.Stop+1) = Value) then
    R.Stop := Value
    else
    FList.Add(TInt64RangeItem.Create(Value, Value));
    end;
    end else
    FList.Add(TInt64RangeItem.Create(Value, Value));
    Inc(FCount);
    end;
    end;

    procedure TInt64List.DeleteRange(const Index: Integer);
    var
    R: TInt64RangeItem;
    begin
    if (Index >= 0 ) and
    (Index < FList.Count) then
    begin
    R := FList.List^[Index];
    FList.Delete(Index);
    FreeAndNil(R);
    end;
    end;

    function TInt64List.FindRangeIndex(const Value: Int64; var Index: Integer): Boolean;
    var
    Low,High,Mid: Integer;
    RangeCursor: TInt64RangeItem;
    begin
    Low := 0;
    High := FList.Count - 1;
    Mid := 0;
    Result := False;
    while not Result and ( Low <= High ) do
    begin
    Mid := (Low + High) shr 1;
    RangeCursor := FList.List^[Mid];
    if (RangeCursor.Start <= Value) and (Value <= RangeCursor.Stop) then
    Result := True
    else
    if (Value < RangeCursor.Start) then High := Mid - 1 else
    if (Value > RangeCursor.Stop ) then Low := Mid + 1;
    end;
    if Result then Index := Mid
    else Index := Low;
    end;

    function TInt64List.Find(const Value: Int64; var Index: Int64): Boolean;
    var
    R: TInt64RangeItem;
    RangePos: Integer;
    begin
    Index := -1;
    Result := FindRangeIndex(Value, RangePos);
    if (Result = True) then
    begin
    R := FList.List^[RangePos];
    Index := R.StartIndex + (Value - R.Start);
    end;
    end;

    function TInt64List.FindGetPath(const Value: Int64): String;
    var
    Low,High,Mid: Int64;
    RangeCursor: TInt64RangeItem;
    F: Boolean;
    V: Int64;
    begin
    Result := '';
    // Find range index ----------------------------------------------------------
    F := False;
    High := FList.Count - 1;
    Low := 0;
    Mid := 0;
    while not F and ( Low <= High ) do
    begin
    Mid := (Low + High) div 2;
    RangeCursor := FList.List^[Mid];
    if (RangeCursor.Start <= Value) and (Value <= RangeCursor.Stop) then
    begin
    Result := Result + '11';
    F := True;
    end else
    if (Value < RangeCursor.Start) then
    begin
    Result := Result + '10';
    High := Mid - 1;
    end else
    if (Value > RangeCursor.Stop ) then
    begin
    Result := Result + '01';
    Low := Mid + 1;
    end;
    end;
    if (F = True) then
    begin
    // Find index IN range -----------------------------------------------------
    Result := Result + '|';
    F := False;
    Low := 0;
    High := TInt64RangeItem(FList.List^[Mid]).Stop - TInt64RangeItem(FList.List^[Mid]).Start;
    while not F and ( Low <= High ) do
    begin
    Mid := (Low + High) div 2;
    V := TInt64RangeItem(FList.List^[Mid]).Start + Mid;
    if (Value = V) then
    begin
    Result := Result + '11';
    F := True;
    end else
    if (Value < V) then
    begin
    Result := Result + '10';
    High := Mid - 1;
    end else
    if (Value > V) then
    begin
    Result := Result + '01';
    Low := Mid + 1;
    end;
    end;
    end else
    Result := '|';
    end;

    function TInt64List.IndexOf(const Value: Int64): Int64;
    begin
    Find(Value, Result);
    end;

    procedure TInt64List.Clear;
    var
    I: Integer;
    begin
    if (FList.Count > 0) then
    for I := (FList.Count-1) downto 0 do
    begin
    TInt64RangeItem(FList.List^[I]).Free;
    //Dispose(FList.List^[I]);
    FList.Delete(I);
    end;
    FList.Pack;
    FList.Clear;
    end;

    procedure TInt64List.Compact(const ReCalcStartIndex: Boolean = False);
    var
    I: Integer;
    C,N: TInt64RangeItem;
    begin
    if (FList.Count > 1) then
    begin
    I := 0;
    repeat
    C := TInt64RangeItem(FList.List^[I ]);
    N := TInt64RangeItem(FList.List^[I+1]);
    //
    if (N.Start <= (C.Stop+1)) then
    begin
    if (N.Start <= C.Start) then
    C.Start := N.Start;
    C.Stop := N.Stop;
    DeleteRange(I+1);
    end else
    Inc(I);
    until (I = (FList.Count-1));
    if (RecalcStartIndex = True) then
    CalcStartIndex;
    end;
    end;

    function TInt64List.GetCount: Int64;
    begin
    Result := FCount;
    end;

    function TInt64List.GetRangeCount: Integer;
    begin
    Result := FList.Count;
    end;

    procedure TInt64List.CalcStartIndex;
    var
    I: Integer;
    R: TInt64RangeItem;
    begin
    FCount := 0;
    if (FList.Count > 0) then
    for I := 0 to (FList.Count-1) do
    begin
    R := FList.List^[I];
    R.StartIndex := FCount;
    Inc(FCount, (R.Stop - R.Start + 1));
    end;
    end;

    function TInt64List.GetRange(Index: Integer): TInt64RangeItem;
    begin
    Result := nil;
    if (0 <= Index) and (Index < FList.Count) then
    Result := FList.List^[Index];
    end;

    procedure TInt64List.SaveToStream(F: TStream);
    var
    I: Integer;
    R: TInt64RangeItem;
    V: packed record
    V1: Int64; // +8
    V2: Int64; // +8
    end;
    M: TMemoryStream;

    procedure FlushBuffer;
    begin
    M.Seek(0,0);
    F.CopyFrom(M, M.Size);
    M.Size := 0;
    end;

    begin
    Compact;
    if (FList.Count > 0) then
    begin
    M := TMemoryStream.Create;
    for I := 0 to (FList.Count-1) do
    begin
    R := FList.List^[I];
    V.V1 := R.Start;
    V.V2 := R.Stop;
    M.Write(V, 16);
    if (M.Size > 65536) then
    FlushBuffer;
    end;
    FlushBuffer;
    FreeAndNil(M);
    end;
    end;

    procedure TInt64List.LoadFromStream(F: TStream);
    var
    I,C: Integer;
    V: packed record
    V1: Int64; // +8
    V2: Int64; // +8
    end;
    M: TMemoryStream;
    begin
    Clear;
    M := TMemoryStream.Create;
    M.Size := F.Size;
    M.CopyFrom(F, F.Size);
    M.Seek(0,0);
    C := (M.Size div 16);
    if (C > 0) then
    begin
    FList.Capacity := C;
    for I := 1 to C do
    begin
    M.Read(V, 16);
    FList.Add(TInt64RangeItem.Create(V.V1, V.V2));
    end;
    end;
    FreeAndNil(M);
    CalcStartIndex;
    end;

    end.

    thx:
    xcluster

    ReplyDelete
  4. I just did the same thing, both binary and interpolation search are logn, but interpolation is about 3~4 times faster than binary search. Hard to beat the logn bound.

    ReplyDelete
  5. "This approach is very simple to use if data in the array is distributed uniformly."

    Here's an idea how to make any data uniformly distributed. Instead of using raw keys, use hashed keys. Hash functions like MD5 will convert any type of data to a random uniformly distributed one. Of course this adds some overhead as hash values need to be computed.

    ReplyDelete
  6. Anonymous20:35

    <> In that case, I think you mean 'check A[m] and A[m+1]'

    ReplyDelete
  7. A binary search is going to be O(log n), whereas a hash lookup will be O(1), amortized. You would have to have a pretty terrible hash function to get worse performance than a binary search.

    ReplyDelete
  8. Anonymous09:32

    Hi - Check-out NoChop/STree on www.agdresearch.com - its already 350% faster than binary-search since being recently discovered

    ReplyDelete