Monday, August 28, 2017

Writing a Simple DSL Compiler with Delphi [2. Abstract Syntax Tree]

This article provides a description of an abstract syntax tree used to represent "The Language". If you are new to this series, I would recommend to start reading with this post. 

Please note that this article describes an initial implementation of the AST. If you want to browse the code while reading the article, make sure that you have switched to branch dsl_v1.

An abstract syntax tree (AST) is, simply put, a symbolic representation of the program in a form of a tree.

While the textual representation of a program is great for us, humans, computers have hard time dealing with it. Because of that, a special part of any interpreter/compiler, called parser, reads the input and converts it into a computer-readable format - AST. This tree can then be used for multiple purposes. We can, for example, feed to into an interpreter which will then run the program for us, or we can feed it into a compiler to generate an executable program or cross-compiler to generate an equivalent program in a different programming language.
In reality, this process is typically even more complex. A parser usually uses a specialized input code called tokenizer to read the input and compiler usually doesn't create executable directly, but emits multiple files which are linked into a final program.

In the case of my toy compiler project, parser uses a separate tokenizer, while all output processes (such as compiler) generate the code without any additional steps (linking).



It is pretty obvious that the AST is central to the whole project and that's why I decided to present it before the tokenizer and the parser.

In my case (and let me remind you again that all the following description applies to branch dsl_v1), AST for a program starts with a very simple interface ISimpleDSLAST. (I'll be showing all types in a shortened forms without getters and setters.)

IASTFunctions = interface
  function  Add(const func: IASTFunction): integer;
  function  Count: integer;
  function  IndexOf(const name: string): integer;
  property Items[idxFunction: integer]: IASTFunction read GetItems; default;
end; { IASTFunctions }
ISimpleDSLAST = interface
   property Functions: IASTFunctions read GetFunctions;
end;

A program in "The Language" is nothing more than a collection of functions and that interface reflects that.

Each function has a name, a list of parameters, and a body.

TParameterList = TList<string>;
IASTFunction = interface ['{FA4F603A-FE89-40D4-8F96-5607E4EBE511}']
   property Name: string read GetName write SetName;
   property ParamNames: TParameterList read GetParamNames;
   property Body: IASTBlock read GetBody write SetBody;
end;

A function body is nothing more than a collection of statements.

TStatementList = TList;
IASTBlock = interface
   property Statements: TStatementList read GetStatements;
end;

A statement is either an if statement or a return statement. Nothing else.

TASTStatementType = (stIf, stReturn);
IASTStatement = interface
end; { IASTStatement }

IASTStatement is just a parent interface for all statement interfaces and is never instantiated as such.

A return statement has only one part - an expression which has to be evaluated to calculate the value which is then returned as a function result.

IASTReturnStatement = interface(IASTStatement)
   property Expression: IASTExpression read GetExpression write SetExpression;
end;

An if statement is more complicated. It has a condition (which is in itself an expression) followed by then and else blocks (and, as we already know, a block is just collection of statements and so on).

IASTIfStatement = interface(IASTStatement)
   property Condition: IASTExpression read GetCondition write SetCondition;
   property ThenBlock: IASTBlock read GetThenBlock write SetThenBlock;
   property ElseBlock: IASTBlock read GetElseBlock write SetElseBlock;
end;

An expression can be either a simple term or a binary operation of two terms. Only three operations are supported at the moment: addition, subtraction, and comparison. "The Language" has no unary operands so you cannot write a statement such as return -3; but you have to use more convoluted form return 0-3;

TBinaryOperation = (opNone, opAdd, opSubtract, opCompareLess);
IASTExpression = interface
   property Term1: IASTTerm read GetTerm1 write SetTerm1;
   property Term2: IASTTerm read GetTerm2 write SetTerm2;
   property BinaryOp: TBinaryOperation read GetBinaryOp write SetBinaryOp;
end;

A term can be either a constant, a variable, or a function call.

TASTTermType = (termConstant, termVariable, termFunctionCall);

IASTTerm = interface
end;

Similarly to IASTStatement, an IASTTerm is a parent interface which is never instantiated.

A constant just contains a constant numeric value, calculated by the parser.

IASTTermConstant = interface(IASTTerm)
   property Value: integer read GetValue write SetValue;
end; 

A variable is actually a misnomer. The Language doesn't support variables as such. The code can only refer to a function parameter. As there is no support for nested functions, each parameter can be represented by its index in the parameter list (calculated again by the parser).

IASTTermVariable = interface(IASTTerm)
   property VariableIdx: integer read GetVariableIdx write SetVariableIdx;
end;

And the last part of our AST, a function call contains an index (into the ISimpleDSLAST.Functions property) of the function we are calling and a list of parameters we are passing to the function. Each parameter in turn is an expression.

TExpressionList = TList;
IASTTermFunctionCall = interface(IASTTerm)
   property FunctionIdx: integer read GetFunctionIdx write SetFunctionIdx;
   property Parameters: TExpressionList read GetParameters;
end;

And that, as they say, is all she wrote. There really is nothing more to the SimpleDSLCompiler.AST unit as a bunch of interfaces and very trivial object implementations for them. This information is enough to present the semantics of the original program (the formatting itself is lost during parsing) and can be used as a source for compiler etc.

No comments:

Post a Comment