Tuesday, March 16, 2010

Speed comparison: Variant, TValue, and TOmniValue

When I read TValue is very slow! at TURBU Tech blog earlier today, I immediately wondered about how fast is TOmniValue (the basic data-exchange type in the OmniThreadLibrary) in regards to Variant and TValue. What else could I do but write a benchmark?!

I choose to test the performance in a way that is slightly different from the Mason’s approach. My test does not measure only store operation but also load and (in some instances) add. Also, the framework is slightly different and decouples time-management code from the benchmark.

const
CBenchResult = 100*1000*1000; //100 million
procedure TfrmBenchmark.Benchmark(const benchName: string;
benchProc: TBenchProc);
var
benchRes : integer;
stopwatch: TStopWatch;
begin
stopwatch := TStopWatch.StartNew;
benchProc(benchRes);
stopwatch.Stop;
Assert(benchRes = CBenchResult);
lbLog.Items.Add(Format('%s: %d ms',
[benchName, stopwatch.ElapsedMilliseconds]));
lbLog.Update;
end;
procedure TfrmBenchmark.btnBenchmarkClick(Sender: TObject);
begin
Benchmark('Variant', TestVariant);
Benchmark('TValue', TestTValue);
Benchmark('TOmniValue', TestTOmniValue);
end;
procedure TfrmBenchmark.TestTOmniValue(var benchRes: integer);
var
counter: TOmniValue;
i : integer;
begin
counter := 0;
for i := 1 to CBenchResult do
counter := counter.AsInteger + 1;
benchRes := counter;
end;
procedure TfrmBenchmark.TestTValue(var benchRes: integer);
var
counter: TValue;
i : integer;
begin
counter := 0;
for i := 1 to CBenchResult do
counter := counter.AsInteger + 1;
benchRes := counter.AsInteger;
end;
procedure TfrmBenchmark.TestVariant(var benchRes: integer);
var
counter: Variant;
i : integer;
begin
counter := 0;
for i := 1 to CBenchResult do
counter := counter + 1;
benchRes := counter;
end;

As you can see, all three tests are fairly similar. They count from 0 to 100.000.000 and the counter is stored in a Variant/TValue/TOmniValue. The Variant test follows the same semantics as if the counter variable would be declared integer, while the TValue and TOmniValue tests require some programmer’s help to determine how the counter should be interpreted (AsInteger).

The results were interesting. TValue is about 5x slower than the Variant, which is 7x slower than the TOmniValue.

bench

Of course, I was interested in where this speed difference comes from and I looked at the assembler code.

Digging into the assembler

Variant

Unit32.pas.87: counter := counter + 1;
004B1232 8D55F0           lea edx,[ebp-$10]
004B1235 8D45E0           lea eax,[ebp-$20]
004B1238 E817AAF6FF       call @VarCopy
004B123D 8D45D0           lea eax,[ebp-$30]
004B1240 BA01000000       mov edx,$00000001
004B1245 B101             mov cl,$01
004B1247 E8DCF8F6FF       call @VarFromInt
004B124C 8D55D0           lea edx,[ebp-$30]
004B124F 8D45E0           lea eax,[ebp-$20]
004B1252 E8F523F7FF       call @VarAdd
004B1257 8D55E0           lea edx,[ebp-$20]
004B125A 8D45F0           lea eax,[ebp-$10]
004B125D E8F2A9F6FF       call @VarCopy

Very straightforward code. Variant is copied into a temporary location, number 1 is converted into Variant, those two variants are added and result is stored back into the counter variable. As you can see, Variant calculations are really clumsy. It would be much faster to convert Variant to integer, add one and convert the result back. Like this.

procedure TfrmBenchmark.TestVariant2(var benchRes: integer);
var
counter: Variant;
i,j : integer;
begin
counter := 0;
for i := 1 to CBenchResult do begin
j := counter;
counter := j + 1;
end;
benchRes := counter;
end;

This modified version generates much faster code.

Unit32.pas.100: j := counter;
004B1355 8D45F0           lea eax,[ebp-$10]
004B1358 E863B2F6FF       call @VarToInteger
004B135D 8BF0             mov esi,eax
Unit32.pas.101: counter := j + 1;
004B135F 8D45F0           lea eax,[ebp-$10]
004B1362 8D5601           lea edx,[esi+$01]
004B1365 B1FC             mov cl,$fc
004B1367 E8BCF7F6FF       call @VarFromInt

Benchmarking proves my theory. Optimized version needed only 1220 ms to complete the test which made it almost 5x faster than the original Variant code.

TValue

Unit32.pas.76: counter := counter.AsInteger + 1;
004B11A1 8D45E8           lea eax,[ebp-$18]
004B11A4 E86B96FFFF       call TValue.AsInteger
004B11A9 40               inc eax
004B11AA 8D55D0           lea edx,[ebp-$30]
004B11AD E8A695FFFF       call TValue.&op_Implicit
004B11B2 8D55D0           lea edx,[ebp-$30]
004B11B5 8D45E8           lea eax,[ebp-$18]
004B11B8 8B0D4C9F4A00     mov ecx,[$004a9f4c]
004B11BE E8D567F5FF       call @CopyRecord

The TValue code is quite neat. Counter is converted to an integer, one is added, result is converted into a temporary TValue and this temporary TValue is copied back into counter. Why then is TValue version so much slower? We’ll have to look into implementation to find the answer. Let’s find out first why TOmniValue is so fast.

TOmniValue

Unit32.pas.65: counter := counter.AsInteger + 1;
004B10AA 8D45F3           lea eax,[ebp-$0d]
004B10AD E8FAF3FFFF       call TOmniValue.IsInteger
004B10B2 84C0             test al,al
004B10B4 740E             jz $004b10c4
004B10B6 8B45F3           mov eax,[ebp-$0d]
004B10B9 8945E8           mov [ebp-$18],eax
004B10BC 8B45F7           mov eax,[ebp-$09]
004B10BF 8945EC           mov [ebp-$14],eax
004B10C2 EB32             jmp $004b10f6
004B10C4 8D45F3           lea eax,[ebp-$0d]
004B10C7 E8D8F3FFFF       call TOmniValue.IsEmpty
004B10CC 84C0             test al,al
004B10CE 7410             jz $004b10e0
004B10D0 C745E800000000   mov [ebp-$18],$00000000
004B10D7 C745EC00000000   mov [ebp-$14],$00000000
004B10DE EB16             jmp $004b10f6
004B10E0 B94C114B00       mov ecx,$004b114c
004B10E5 B201             mov dl,$01
004B10E7 A16CD14000       mov eax,[$0040d16c]
004B10EC E82747F6FF       call Exception.Create
004B10F1 E8D247F5FF       call @RaiseExcept
004B10F6 8B45E8           mov eax,[ebp-$18]
004B10F9 8BF0             mov esi,eax
004B10FB 8D55F3           lea edx,[ebp-$0d]
004B10FE 8D4601           lea eax,[esi+$01]
004B1101 E8AEF3FFFF       call TOmniValue.&op_Implicit

Weird stuff, huh?  Counter is converted to an integer, then a bunch of funny code is executed and the result is converted back to a a TOmniValue. The beginning and the end are easy to understand but what’s going on in-between?

The answer is – inlining. Much of the TOmniValue implementation is marked inline and what we are seeing here is the internal implementation of the AsInteger property.

I’ll return to this later but first let’s check what happens if all this inline modifiers are removed.

Unit32.pas.65: counter := counter.AsInteger + 1;
004B10EF 8D45F3           lea eax,[ebp-$0d]
004B10F2 E865F4FFFF       call TOmniValue.GetAsInteger
004B10F7 40               inc eax
004B10F8 8D55E0           lea edx,[ebp-$20]
004B10FB E8A4F4FFFF       call TOmniValue.&op_Implicit
004B1100 8D55E0           lea edx,[ebp-$20]
004B1103 8D45F3           lea eax,[ebp-$0d]
004B1106 8B0D5CF84A00     mov ecx,[$004af85c]
004B110C E88768F5FF       call @CopyRecord

The generated code is now almost the same as in the TValue case, only stack offsets are different. It is also much slower, instead of the 839 ms the code took 3119 ms to execute and was only twice as fast as the original Variant code (and much slower than the modified Variant code). Inlining the AsInteger couldn’t make such big change. It looks like the CopyRecord is the culprit for the slowdown. I didn’t verify this by measurement but if you look at the _CopyRecord implementation in the System.pas it is obvious that the record copying cannot be very fast.

The Delphi compiler team would do much good if in the future versions the compiler would generate custom code adapted to each record type to do the copying.

Use the source, Luke!

What’s left for me is to determine the reason for the big speed difference between TValue and TOmniValue. To find it, I had to dig into the implementation of both records. Of the biggest interest to me were the AsInteger getter and Implicit(from: integer) operator.

TOmniValue

TOmniValue lives in OtlCommon.pas. AsInteger getter GetAsInteger just remaps the call to the GetAsInt64 method. Similarly, Implicit maps to SetAsInt64.

type
  ovData: int64;
  ovType: (ovtNull, ovtBoolean, ovtInteger, ovtDouble, ovtExtended, 
           ovtString, ovtObject, ovtInterface, ovtVariant, 
           ovtWideString, ovtPointer);

function TOmniValue.GetAsInt64: int64;
begin
  if IsInteger then
    Result := ovData
  else if IsEmpty then
    Result := 0
  else
    raise Exception.Create('TOmniValue cannot be converted to int64');
end; { TOmniValue.GetAsInt64 }

procedure TOmniValue.SetAsInt64(const value: int64);
begin
  ovData := value;
  ovType := ovtInteger;
end; { TOmniValue.SetAsInt64 }

The code is quite straightforward. Some error checking is done in the getter and the value is just stored away in the setter. Now the assembler code from the first TOmniValue example makes some sense – we were simply looking at the implementation of those GetAsInt64. (Implicit operator was not inlined.)

TValue

The TValue record lives in RTTI.pas. AsInteger getter gets remapped to the generic version AsType<Integer> which calls TryAsType<T>. In a slightly less roundabout manner Implicit calls From<Integer>.

function TValue.TryAsType<T>(out AResult: T): Boolean;
var
val: TValue;
begin
Result := TryCast(System.TypeInfo(T), val);
if Result then
val.Get<T>(AResult);
end;
class function TValue.From<T>(const Value: T): TValue;
begin
Make(@Value, System.TypeInfo(T), Result);
end;

It’s quite obvious that the TValue internals are not optimized for speed. Everything is mapped to generics and the RTTI system which is fast, but not really that fast that it could be used for computationally-intensive code.

Conclusion

  1. Don’t use TValue for counting. Heck, don’t even use Variant or TOmniValue for counting – they were not designed for that purpose!
  2. TValue may look slow but in fact it is not. It is able to count from 1 to over three millions in one second. That’s not slow. It’s just not as fast as the register-based counter is. But that’s OK as you should always remember rule 1.
  3. TValue is incredibly powerful. Just look at its implementation. Therefore, it could afford to be a tad slower than other multi-purpose storage mechanisms.
  4. TOmniValue is very fast, but most of its speed (compared to the Variant) comes from the inlining and the compiler being smart enough not to call CopyRecord in this case.
  5. Delphi compiler should really be improved to generate custom CopyRecord for each record type.
  6. Assembler code tells a lot. Source code tells even more.

P.S.

Using OtlCommon won’t bring in any other parts of the OTL library. It will requires following units to compile: DSiWin32, GpStuff, and GpStringHash. Nothing from those units will be linked in as TOmniValue implementation doesn’t depend on them. The simplest way to get them all is to download the latest stable OmniThreadLibrary release.

8 comments:

  1. Huh, as always great and in depth article. I only scratched the surface and didn't have time to dig further. Well now I won't have to :)

    I didn't know inlining contributes so much to the result.

    ReplyDelete
  2. Anonymous23:31

    After having read your great article, I've tried to rewrite the _CopyRecord function of the system.pas unit, with speed in mind.

    Here is the resulting code, which should work from Delphi 7 up to 2009: http://blog.synopse.info/post/2010/03/23/CopyRecord-faster-proposal

    ReplyDelete
  3. Nice work! I'd be curious to know how this test compares when run on Delphi XE. I'm using Variants a lot in my UI stuff and am curious to know if they have seen optimizations.

    ReplyDelete
  4. @Eric: I've reconstructed my tested under XE and I'm getting similar results except that TOmniValue is only twice as fast as Variant.

    ReplyDelete
  5. I think you are wrong about Variant type. You should change benchmark of Variant type by direct casting it to TVarData and using VInteger member of structure assuming that only integer values are involved.
    Here is my optimized version of Variant test, try it:
    procedure TfrmBenchmark.TestVariant3(var benchRes: Integer);
    var
    counter: Variant;
    i: Integer;
    begin
    counter := Integer(0);
    for i := 1 to CBenchResult do
    TVarData(counter).VInteger := TVarData(counter).VInteger + 1;
    benchRes := counter;
    end;

    ReplyDelete
  6. Anonymous12:12

    Delphi Seattle, current version of libraries:

    Variants: 1263
    TValue: 2017
    TAnyValue: 1137
    TOmniValue: 4166
    TVariableRec: 798

    ReplyDelete
  7. CPU: i5-4430
    Delphi Berlin
    TValue: 2928ms
    Variant: 3820ms

    Delphi 10.2
    TValue: 2908ms
    Variant: 3921ms

    ReplyDelete