Tuesday, September 26, 2006

GpStructuredStorage internals

[This is a second article in the 'embedded file system' series. If you missed the first part, you can read it here.]

GpStructuredStorage compound file is organized in 1 KB blocks. First block contains header, then the content alternates between a file allocation table (FAT) fragment block and 256 blocks managed by the preceeding FAT fragment. Each block can be represented by a number - header block is block #0, first fat fragment is block #1 and so on.

[header:HEADER:1024]
[fat entry:FATENTRY:1024]
256 x [block:FOLDER/FILE:1024]
[fat entry:FATENTRY:1024]
256 x [block:FOLDER/FILE:1024]
...
[fat entry:FATENTRY:1024]
<=256 x [block:FOLDER/FILE:1024]

Header starts with a signature, which must always be 'GpStructuredStorage file'#13#10#26#0.

HEADER
[signature:32] // PChar
[unused:964]
[storage attribute file:4]
[storage attribute file size:4]
[first FAT block:4]
[first unused block:4]
[root folder:4]
[root folder size:4]
[version:4] // storage system version
Header ends in:
  • block number and size of the internal file containing global storage attributes
  • block number of the first FAT block (always 1)
  • block number of the first unused file block
  • block number and size of the root folder
  • storage system version (at the moment $01000200, or 1.0.2.0)

Each FAT fragment contains 256 32-bit numbers, one for each of the 256 following file blocks. This number is either 0 if block is last in the FAT chain, or it is a number of the next file block in the chain. Each block is linked into exactly one chain. It can either belong to a file/folder or to a empty block list. Address of the first block in the empty list is stored in the header ([first unused block] entry).

FATENTRY
256*[next block:4] // next-pointers for this block; 0 = unused
FAT structure defines the limitations of my file format - last block must have number that is less than 4294967294. As block are 1 KB in size, total file size cannot exceed 4294967294 * 1 KB, which is slightly less than 4 TB. Enough for all practical purposes, I think.

Folders are simple - each folder is just a file. It contains file information records and is terminated with two 0 bytes.

FOLDER //potentially split over several blocks
[FILE_INFO]
[FILE_INFO]
...
[FILE_INFO]
[0:2]
File information record is a variable length record containing file name (up to 64 KB), attributes, length, and address of the first file block (additional blocks are reached by following the FAT chain).
FILE_INFO
[file name length:2]
[file name:1..65535]
[file attributes:ATTRIBUTES:4]
[file length:4] // 4 GB per file
[first file block:4]
At the moment, only two attributes are defined. One specifies that file is actually a subfolder, and another designates a special file containing file attributes (for discussion of attributes see the previous article).
ATTRIBUTES
$0001 = attrIsFolder
$0002 = attrIsAttributeFile
That's just about everything that is to tell about the compound file format. Armed with this knowledge, one can easily write a compound file browser/repair utility. 

11 comments:

  1. Brilliant. I found this in GExperts code and decided to use it in a NewsReader program i'm writing. These posts will surely come in handy.

    Hvala, pozdrav :)

    ReplyDelete
  2. Thanks for the thumbs up!

    (In hvala ;) )

    ReplyDelete
  3. good work, but i find a bug.

    try to increase file size in test:

    function TForm1.TestGSSBigAndSmall: boolean;
    var
    storage: IGpStructuredStorage;
    begin
    Result := false;
    Log('Testing big and small files');
    try
    storage := CreateStructuredStorage;
    storage.Initialize(CStorageFile, fmCreate);
    // 0 bytes
    TestFile(storage, '/small.dat', true, -1);
    TestFile(storage, '/small2.dat', true, -1);
    TestFile(storage, '/small2.dat', true);
    // cross the 257-block boundary
    TestFile(storage, '/large.dat', true, 64);
    storage := CreateStructuredStorage;
    storage.Initialize(CStorageFile, fmOpenRead);
    TestFile(storage, '/small.dat', false, -1);
    TestFile(storage, '/small2.dat', false);
    TestFile(storage, '/large.dat', false, 64);
    Result := true;
    except
    on E: Exception do
    Log(' '+E.Message);
    end;
    end; { TForm1.TestGSSBigAndSmall }


    Self test fail!
    I use delphi 2007 Version 11.0.2804.9245
    on XP

    ReplyDelete
  4. I can confirm the problem. I'll fix it as soon as I have a little spare time.

    Thanks for reporting this!

    ReplyDelete
  5. It turned out to be only a bug in the test suite.

    When you passed a value larger than 63 as a fourth parameter to the TestFile method, TestFile made some incorrect assumptions and compared value that was larger than $FFFF to another value that was truncated to two bytes. As a result of that, error was raised (incorrectly).

    I have updated test suite at http://gp.17slon.com/gp/files/testgpstructuredstorage_src.zip.

    ReplyDelete
  6. Anonymous13:43

    Hi;

    Does it possible to store an executable file (.exe) in this storage and
    execute from there?

    ReplyDelete
  7. Store, yes. Execute, no.

    ReplyDelete
  8. How can I estimate the size of a new GpStructuredFile BEFORE adding files to it?

    ReplyDelete
  9. Let's see ...

    Header takes 1 KB. Always.
    There is at least one FAT entry (1 KB) and it can address 256 x 1 KB blocks.
    There is at least the root folder (1 KB).

    A back of the envelope calculation:
    - Take your file sizes and round them up to the nearest 1 KB multiple.
    - Multiply by 1.004 (FAT blocks).
    - Add 1 KB for each 25 files (folder blocks).
    - Add 3 KB (header, root FAT and root folder).

    ReplyDelete
  10. Are there any limitations on :
    - number of files in root?
    - number of folders in root?
    - number of files in subfolders?

    And, why did you mention "1 KB for each 25 files"?

    Thanks.

    ReplyDelete
  11. a) All directories are infinitely extendable, even root.
    b) A folder is just a file so a) applies.
    c) See a).

    1 KB per 25 files was just an approximation. A file entry in the folder uses 14 bytes + length of the file name. If your files have 13-character names (all names are Unicode), 25 files would use 25 * (14 + 13*2) = 1000 bytes.

    ReplyDelete