Friday, September 29, 2017

Writing a Simple DSL Compiler with Delphi [4. Parser]

This article provides a description of a parser used in my toy language project. If you are new to this series, I would recommend to start reading with this post.

Please note that this article describes an initial implementation of the parser. If you want to browse the code while reading the article, make sure that you have switched to branch dsl_v1.

After a short break I'm returning to my "toy compiler" series. This time I'll describe the working of the parser - the part of the code which reads the input (usually in a tokenized form) and generates internal representation of the program (in my case an abstract syntax tree - AST).

The goal of my project was to study the compilation step and parser was just a necessary evil that I had to deal with. That's why it is written in a pretty primitive manner, without using any improvements like a Pratt parser.


My parser is exposed as a very simple interface. It will receive the code to be parsed (as a string), a reference to the tokenizer that should be used for reading the input, and a reference to the root node of the resulting AST. The function will return False if parsing fails, in which case the caller can cast the parser interface to ISimpleDSLErrorInfo to get more information about the error.

type
  ISimpleDSLParser = interface ['{73F3CBB3-3DEF-4573-B079-7EFB00631560}']
    function Parse(const code: string; const tokenizer: ISimpleDSLTokenizer;
      const ast: ISimpleDSLAST): boolean;
  end;

 
As a program in my language consists only of functions, the implementation of Parse is very simple. It will set up some fields so that any part of the parser can access the relevant information, initialize the look-ahead buffer (FLookaheadIdent) and parse function after function until the parsing fails or the tokenizer reports that it has reached the end of the code.

function TSimpleDSLParser.Parse(const code: string;   const tokenizer: ISimpleDSLTokenizer;
  const ast: ISimpleDSLAST): boolean;
begin
  Result := false;
  FTokenizer := tokenizer;
  FAST := ast;
  FLookaheadIdent := #0;
 
  tokenizer.Initialize(code);
  while not tokenizer.IsAtEnd do
    if not ParseFunction then
      Exit;
 
  Result := true;
end;

I'm not going to show all of the parser code in this short article, just the ParseFunction method. It expects to find an identifier (function name), followed by a left parenthesis, followed by a potentially empty list of comma-delimited identifiers (function parameters), followed by a right parenthesis and lastly followed by a block (function statements).

ParseFunction first grabs an identifier (function name). Then it creates an entry in the global function table because we may need it in some other part of the parser if the function recursively calls itself. For the same reason it will also store away an interface of the function AST node into FContext.CurrentFunc.

Next it will check for the tkLeftParen token. If it is present, it will enter a loop which looks for either an identifier (parameter name) or right parenthesis (end of parameter list) and store all parameter names into the AST (func.ParamNames.Add).

If everything is fine, it will call ParseBlock to parse the function body and store its output into the AST (func.Body := block).

function TSimpleDSLParser.ParseFunction: boolean;
var
  block   : IASTBlock;
  expected: TTokenKinds;
  func    : IASTFunction;
  funcName: string;
  ident   : string;
  token   : TTokenKind;
begin
  Result := false;
 
  /// function = identifier "(" [ identifier { "," identifier } ] ")" block
 
  // function name
  if not FetchToken([tkIdent], funcName, token) then
    Exit(token = tkEOF);
 
  func := AST.CreateFunction;
  func.Name := funcName;
  
  // we might need this function in the global table for recursive calls
  AST.Functions.Add(func);
 
  FContext.CurrentFunc := func;
  try
 
    // "("
    if not FetchToken([tkLeftParen]) then
      Exit;
 
    // parameter list, including ")"
    expected := [tkIdent, tkRightParen];
    repeat
      if not FetchToken(expected, ident, token) then
        Exit;
      if token = tkRightParen then
        break //repeat
      else if token = tkIdent then begin
        func.ParamNames.Add(ident);
        expected := expected - [tkIdent] + [tkComma, tkRightParen];
      end
      else if token = tkComma then
        expected := expected + [tkIdent] - [tkComma, tkRightParen]
      else begin
        LastError := 'Internal error in ParseFunction';
        Exit;
      end;
    until false;
 
    // function body
    if not ParseBlock(block) then
      Exit;
 
    func.Body := block;
    Result := true;
  finally
    FContext.CurrentFunc := nil;
  end;
end;

Tokens are always fetched with the FetchToken function which accepts a list of valid tokens, skips all white space (as the language is designed, all white space can be ignored) and returns found token/identifier pair (or sets error information with code location if wrong token is encountered).

function TSimpleDSLParser.FetchToken(allowed: TTokenKinds; var ident: string;
  var token: TTokenKind): boolean;
var
  loc: TPoint;
begin
  Result := false;
  while GetToken(token, ident) do
    if token in allowed then
      Exit(true)
    else if token = tkWhitespace then
      // do nothing
    else begin
      loc := FTokenizer.CurrentLocation;
      LastError := Format('Invalid syntax in line %d, character %d',         [loc.X, loc.Y]);
      Exit;
    end;



end;

Similarly to the tokenizer, parser uses a 'push back' approach. When if finds out that it had read a token too many, it can push it back by calling PushBack. The GetToken function first looks into this buffer and calls tokenizer.GetToken only if the buffer is empty.

procedure TSimpleDSLParser.PushBack(token: TTokenKind; const ident: string);
begin
  Assert(FLookaheadIdent = #0, 'TSimpleDSLParser: Lookahead buffer is not empty');
  FLookaheadToken := token;
  FLookaheadIdent := ident;
end;

function TSimpleDSLParser.GetToken(var token: TTokenKind; var ident: string): boolean;
begin
  if FLookaheadIdent <> #0 then begin
    token := FLookaheadToken;
    ident := FLookaheadIdent;
    FLookaheadIdent := #0;
    Result := true;
  end
  else
    Result := FTokenizer.GetToken(token, ident);
end;


As you can see, a parser is really a very dull piece of code, following the "this is the expected token, what next" approach. Of course, life gets complicated when dealing with a more complicated syntax where sometimes the parser cannot uniquely decide which path to follow and must try several, but that is a topic for a different time - and a different blog.



No comments:

Post a Comment