The Delphi Geek: Writing a Simple DSL Compiler with Delphi [4. Parser]

This article provides a description of a parser used in my toy language project. If you are new to this series, I would recommend to start reading with this post.

Please note that this article describes an initial implementation of the parser. If you want to browse the code while reading the article, make sure that you have switched to branch dsl_v1.

After a short break I'm returning to my "toy compiler" series. This time I'll describe the working of the parser - the part of the code which reads the input (usually in a tokenized form) and generates internal representation of the program (in my case an abstract syntax tree - AST).

The goal of my project was to study the compilation step and parser was just a necessary evil that I had to deal with. That's why it is written in a pretty primitive manner, without using any improvements like a Pratt parser.

My parser is exposed as a very simple interface. It will receive the code to be parsed (as a string), a reference to the tokenizer that should be used for reading the input, and a reference to the root node of the resulting AST. The function will return False if parsing fails, in which case the caller can cast the parser interface to ISimpleDSLErrorInfo to get more information about the error.

type

ISimpleDSLParser = interface ['{73F3CBB3-3DEF-4573-B079-7EFB00631560}']

function Parse(const code: string; const tokenizer: ISimpleDSLTokenizer;

const ast: ISimpleDSLAST): boolean;

end;

As a program in my language consists only of functions, the implementation of Parse is very simple. It will set up some fields so that any part of the parser can access the relevant information, initialize the look-ahead buffer (FLookaheadIdent) and parse function after function until the parsing fails or the tokenizer reports that it has reached the end of the code.

function TSimpleDSLParser.Parse(const code: string; const tokenizer: ISimpleDSLTokenizer;

const ast: ISimpleDSLAST): boolean;

begin

Result := false;

FTokenizer := tokenizer;

FAST := ast;

FLookaheadIdent := #0;

tokenizer.Initialize(code);

while not tokenizer.IsAtEnd do

if not ParseFunction then

Exit;

Result := true;

end;

I'm not going to show all of the parser code in this short article, just the ParseFunction method. It expects to find an identifier (function name), followed by a left parenthesis, followed by a potentially empty list of comma-delimited identifiers (function parameters), followed by a right parenthesis and lastly followed by a block (function statements).

ParseFunction first grabs an identifier (function name). Then it creates an entry in the global function table because we may need it in some other part of the parser if the function recursively calls itself. For the same reason it will also store away an interface of the function AST node into FContext.CurrentFunc.

Next it will check for the tkLeftParen token. If it is present, it will enter a loop which looks for either an identifier (parameter name) or right parenthesis (end of parameter list) and store all parameter names into the AST (func.ParamNames.Add).

If everything is fine, it will call ParseBlock to parse the function body and store its output into the AST (func.Body := block).

function TSimpleDSLParser.ParseFunction: boolean;

var

block : IASTBlock;

expected: TTokenKinds;

func : IASTFunction;

funcName: string;

ident : string;

token : TTokenKind;

begin

Result := false;

/// function = identifier "(" [ identifier { "," identifier } ] ")" block

// function name

if not FetchToken([tkIdent], funcName, token) then

Exit(token = tkEOF);

func := AST.CreateFunction;

func.Name := funcName;

// we might need this function in the global table for recursive calls

AST.Functions.Add(func);

FContext.CurrentFunc := func;

try

// "("

if not FetchToken([tkLeftParen]) then

Exit;

// parameter list, including ")"

expected := [tkIdent, tkRightParen];

repeat

if not FetchToken(expected, ident, token) then

Exit;

if token = tkRightParen then

break //repeat

else if token = tkIdent then begin

func.ParamNames.Add(ident);

expected := expected - [tkIdent] + [tkComma, tkRightParen];

end

else if token = tkComma then

expected := expected + [tkIdent] - [tkComma, tkRightParen]

else begin

LastError := 'Internal error in ParseFunction';

Exit;

end;

until false;

// function body

if not ParseBlock(block) then

Exit;

func.Body := block;

Result := true;

finally

FContext.CurrentFunc := nil;

end;

Tokens are always fetched with the FetchToken function which accepts a list of valid tokens, skips all white space (as the language is designed, all white space can be ignored) and returns found token/identifier pair (or sets error information with code location if wrong token is encountered).

function TSimpleDSLParser.FetchToken(allowed: TTokenKinds; var ident: string;

var token: TTokenKind): boolean;

var

loc: TPoint;

begin

Result := false;

while GetToken(token, ident) do

if token in allowed then

Exit(true)

else if token = tkWhitespace then

// do nothing

else begin

loc := FTokenizer.CurrentLocation;

LastError := Format('Invalid syntax in line %d, character %d', [loc.X, loc.Y]);

Exit;

end;

Similarly to the tokenizer, parser uses a 'push back' approach. When if finds out that it had read a token too many, it can push it back by calling PushBack. The GetToken function first looks into this buffer and calls tokenizer.GetToken only if the buffer is empty.

procedure TSimpleDSLParser.PushBack(token: TTokenKind; const ident: string);

begin

Assert(FLookaheadIdent = #0, 'TSimpleDSLParser: Lookahead buffer is not empty');

FLookaheadToken := token;

FLookaheadIdent := ident;

end;

function TSimpleDSLParser.GetToken(var token: TTokenKind; var ident: string): boolean;

begin

if FLookaheadIdent <> #0 then begin

token := FLookaheadToken;

ident := FLookaheadIdent;

FLookaheadIdent := #0;

Result := true;

end

else

Result := FTokenizer.GetToken(token, ident);

end;

As you can see, a parser is really a very dull piece of code, following the "this is the expected token, what next" approach. Of course, life gets complicated when dealing with a more complicated syntax where sometimes the parser cannot uniquely decide which path to follow and must try several, but that is a topic for a different time - and a different blog.

The Delphi Geek

Friday, September 29, 2017

Writing a Simple DSL Compiler with Delphi [4. Parser]

No comments:

Post a Comment