Friday, September 01, 2017

Writing a Simple DSL Compiler with Delphi [3. Tokenizer]

This article provides a description of a tokenizer used in my toy language project. If you are new to this series, I would recommend to start reading with this post.

Please note that this article describes an initial implementation of the tokenizer. If you want to browse the code while reading the article, make sure that you have switched to branch dsl_v1.

With this article I'm moving into the gritty part of the project - the code which reads the source code and turns it into a beautiful abstract syntax tree. In other words, I'll be talking about the parser.

I must admit that I spent as little time on the parser as I could. After all, my main goal was to convert AST into an executable code, not to parse input. Still, one cannot write a compiler without writing a parser, so ... here I am.

If you want to do anything with any data, you have to a) know in what format it is written and b) write some code that reads the input, decodes the format and creates internal memory structures filled with data. As the first task of any good programmer is to create a library that will do the hard work instead of her/him ;) it was not long before such tools appeared also in the area of compiler writing.

Programmers had also understood very soon that parsing is actually a two-part process. In first step you want to split data into lexical tokens. Each token represents a smallest part of the input that has a specific meaning. For example, if we split a pascal statement a := func(i+42); into separate tokens, we would get: identifier:a whitespace becomes whitespace identifier:func left-parent identifier:i plus number:42 right-parent semicolon. Some characters are mapped to their own tokens (for example, ";" becomes a semicolon) and some are grouped into appropriate units ("42" becomes a number with value 42).

The second part - a parser - uses the sequence of tokens produced by the tokenizer and tries to understand its semantics - that is, how the tokens fit together into a formal language specification. That will be the topic of the next article.

Because of this separation, the supporting tools also split into two areas - one to support creation of tokenizers and another of parsers. In Unix world, for example, we have Lex for lexicographic analysis, a.k.a. tokenization, and Yacc (Yet Another Compiler Compiler) for doing semantic analysis. In pascal world we also have our share of compiler-generating tools. There's a pretty decent port to Delphi, Plex+Pyacc for FreePascal, quite old an unsupported (but free) Delphi Compiler Generator and probably some other tools which I failed to find with my pretty superficial search.

Let's get back to the topic ...

In my case, the language is extremely simple and so is the tokenizer. Instead of using some specialized tool I just went ahead and coded the whole thing. As you'll see, it is pretty simple.

Tokens

Let's take a look at a demo program from part 1:

fib(i) {
  if i < 3 {
    return 1
  } else {
    return fib(i-2) + fib(i-1)
  }
}

From this example we can guess almost all tokens used in The Language: identifier ("fib", "i", "if", "return" ...), number ("1", "2", "3"), whitespace, left parenthesis ("("), right parenthesis (")"), left curly bracket ("{"), right curly bracket ("}"), less than ("<"), plus ("+") and minus ("-").

There are only two tokens this example fails to cover - a comma (",") and a semicolon (";"). The former is used to separate parameters in function definitions and in function calls and the latter is used to separate statements but doesn't appear in the sample code as it is optional immediately before the right curly bracket.

The following type from unit SimpleDSL.Compiler.Tokenizer enumerates all possibilitis. In addition to already mentioned token types, tkUnknown is used to represent unexpected input and tkEOF is used to signal that end of input stream was reached.

type
  TTokenKind = (tkUnknown, tkWhitespace,
                tkIdent, tkNumber,
                tkLeftParen, tkRightParen, tkLeftCurly, tkRightCurly,
                tkLessThan, tkPlus, tkMinus,
                tkComma, tkSemicolon,

                tkEOF);

Interface

The tokenizer is accessed through a very simple interface

ISimpleDSLTokenizer = interface
  function  CurrentLocation: TPoint;
  function  GetToken(var kind: TTokenKind; var identifier: string): boolean;
  procedure Initialize(const code: string);
  function  IsAtEnd: boolean;

end;

The most important function here is GetToken. It returns next token from the source stream - kind contains the token kind and identifier contains the sequence of characters representing the token. The function returns False when end of stream is reached.

Other functions are all just simple helpers. IsAtEnd returns True when end of stream is reached, CurrentLocation returns current line and column, which is useful for reporting errors, and Initialize initializes the tokenizer.

procedure TSimpleDSLTokenizer.Initialize(const code: string);
begin
  FProgram.Text := code;
  FNextLine := 0;
  FNextChar := 1;
  FLookahead := #0;
  FLastLine := FProgram.Count - 1;
  if FLastLine >= 0 then begin
    FLastLineLen := Length(FProgram[FLastLine]);
    FCurrentLine := FProgram[FNextLine];
  end;

end;

A program is internally stored in a string list FProgram. Other variables track the current location and are mostly used in the method GetChar.

Reading input

Let's take a look at the GetToken method.

function TSimpleDSLTokenizer.GetToken(var kind: TTokenKind; var identifier: string): boolean;
var
  ch: char;
begin
  identifier := '';
  Result := GetChar(ch);
  if not Result then begin
    kind := tkEOF;
    Exit;
  end;
  case ch of
    '(': kind := tkLeftParen;
    ')': kind := tkRightParen;
    '{': kind := tkLeftCurly;
    '}': kind := tkRightCurly;
    '+': kind := tkPlus;
    '-': kind := tkMinus;
    '<': kind := tkLessThan;
    ',': kind := tkComma;
    ';': kind := tkSemicolon;
    else if ch.IsLetter then begin
      kind := tkIdent;
      identifier := ch + GetIdent;
    end
    else if CharInSet(ch, ['0'..'9']) then begin
      kind := tkNumber;
      identifier := ch + GetNumber;
    end
    else if ch.IsWhiteSpace then begin
      kind := tkWhitespace;
      SkipWhitespace;
    end
    else
      kind := tkUnknown;
  end;

end;

Firstly it reads next character from the stream (via GetChar) and exits if there is no next character. Then it handles all one-character tokens internally. At the end it handles the specific job of reading identifiers, numbers, and whitespace to GetIdent, GetNumber, and SkipWhitespace, respectively.

GetIdent and GetNumber are very similar, so I'll only focus on one of them.

function TSimpleDSLTokenizer.GetIdent: string;
var
  ch: char;
begin
  Result := '';
  while GetChar(ch) do begin
    if ch.IsLetter or ch.IsNumber or (ch = '_') then
      Result := Result + ch
    else begin
      PushBack(ch);
      Exit;
    end;
  end;

end;

As the GetToken has already read the first character of the identifier, GetIdent collects together all following characters that are either a letter, a number, or an underline. (And as you can see, the code is using character class helpers - IsLetter and IsNumber - so identifiers are actually Unicode-aware.)

When a non-identifier character appears, the simplest solution is just to say "Oooops, I've read too far, please push this last read character back to the processing."

One-character buffer

As we only ever read one character too far, this "please undo the last GetChar" operation is handled with a simple one character buffer used in PushBack and GetChar.

procedure TSimpleDSLTokenizer.PushBack(ch: char);
begin
  Assert(FLookahead = #0, 'TSimpleDSLTokenizer: Lookahead buffer is not empty');
  FLookahead := ch;
end;
 
function TSimpleDSLTokenizer.GetChar(var ch: char): boolean;
begin
  if FLookahead <> #0 then begin
    ch := FLookahead;
    FLookahead := #0;
    Result := true;
  end
  else begin
    Result := not IsAtEnd;
    if Result then begin
      ch := FCurrentLine[FNextChar];
      Inc(FNextChar);
      if FNextChar > Length(FCurrentLine) then begin
        Inc(FNextLine);
        if FNextLine < FProgram.Count then
          FCurrentLine := FProgram[FNextLine];
        FNextChar := 1;
      end;
    end;
  end;
end;

PushBack just stores the current buffer in the FLookahead buffer (unless the buffer is not empty which can only happen if there's a bug in the tokenizer). GetChar has a bit more work as in addition to handling the FLookahead buffer it must also handle 'end of line of text' condition.

This 'push back' approach is also used in GetNumber and SkipWhitespace (for details see the code) and in the parser, as we'll see very soon.


3 comments:

  1. Nice Article. Just some notes from my end:
    Your Identifiers can not start with an underscore. Is this intended?
    Your Tokenizer actually treats whitespace as a token. I'd suggest, unless your language is whitespace sensitive(sorry if i missed that somewhere) to simply skip whitespace until the next printable character comes. Otherwhise whitspace handling is transfered to the parser.
    You fetch characters one by one. I'd suggest to first determine the range of a token and then copy the range from the stream. Otherwhise when an identifier has a length of 12, you (re)allocate 12 strings. Or if delphi is smart it just reallocates the same over and over again.

    ReplyDelete
    Replies
    1. Underscores: intentional.
      Whitespace: I agree. Just didn't cross my mand.
      Fecth one by one: I was just lazy.

      PRs are welcome :)

      Delete
    2. Funny thing is, by using isLetter, your Parser might be more "accurate" then the Delphi one. The DelphiCompiler accepts anything which is > 255 and fits into UCS2 as valid Letter(which is something i preferred to do on my parsers :P)

      LeftRight-Override, Hearts and other funny stuff in Identifiers :D

      Delete