Please note that this article describes an initial implementation of the tokenizer. If you want to browse the code while reading the article, make sure that you have switched to branch dsl_v1.
With this article I'm moving into the gritty part of the project - the code which reads the source code and turns it into a beautiful abstract syntax tree. In other words, I'll be talking about the parser.
I must admit that I spent as little time on the parser as I could. After all, my main goal was to convert AST into an executable code, not to parse input. Still, one cannot write a compiler without writing a parser, so ... here I am.
If you want to do anything with any data, you have to a) know in what format it is written and b) write some code that reads the input, decodes the format and creates internal memory structures filled with data. As the first task of any good programmer is to create a library that will do the hard work instead of her/him ;) it was not long before such tools appeared also in the area of compiler writing.
Programmers had also understood very soon that parsing is actually a two-part process. In first step you want to split data into lexical tokens. Each token represents a smallest part of the input that has a specific meaning. For example, if we split a pascal statement a := func(i+42); into separate tokens, we would get: identifier:a whitespace becomes whitespace identifier:func left-parent identifier:i plus number:42 right-parent semicolon. Some characters are mapped to their own tokens (for example, ";" becomes a semicolon) and some are grouped into appropriate units ("42" becomes a number with value 42).
The second part - a parser - uses the sequence of tokens produced by the tokenizer and tries to understand its semantics - that is, how the tokens fit together into a formal language specification. That will be the topic of the next article.
Because of this separation, the supporting tools also split into two areas - one to support creation of tokenizers and another of parsers. In Unix world, for example, we have Lex for lexicographic analysis, a.k.a. tokenization, and Yacc (Yet Another Compiler Compiler) for doing semantic analysis. In pascal world we also have our share of compiler-generating tools. There's a pretty decent port to Delphi, Plex+Pyacc for FreePascal, quite old an unsupported (but free) Delphi Compiler Generator and probably some other tools which I failed to find with my pretty superficial search.
Let's get back to the topic ...
In my case, the language is extremely simple and so is the tokenizer. Instead of using some specialized tool I just went ahead and coded the whole thing. As you'll see, it is pretty simple.
Tokens
Let's take a look at a demo program from part 1:fib(i) {
if i < 3 {
return 1
} else {
return fib(i-2) + fib(i-1)
}
}
From this example we can guess almost all tokens used in The Language: identifier ("fib", "i", "if", "return" ...), number ("1", "2", "3"), whitespace, left parenthesis ("("), right parenthesis (")"), left curly bracket ("{"), right curly bracket ("}"), less than ("<"), plus ("+") and minus ("-").
There are only two tokens this example fails to cover - a comma (",") and a semicolon (";"). The former is used to separate parameters in function definitions and in function calls and the latter is used to separate statements but doesn't appear in the sample code as it is optional immediately before the right curly bracket.
The following type from unit SimpleDSL.Compiler.Tokenizer enumerates all possibilitis. In addition to already mentioned token types, tkUnknown is used to represent unexpected input and tkEOF is used to signal that end of input stream was reached.
type
  
TTokenKind 
= 
(tkUnknown, 
tkWhitespace,
               
tkIdent, 
tkNumber,
               
tkLeftParen,
tkRightParen,
tkLeftCurly,
tkRightCurly,
               
tkLessThan,
tkPlus, 
tkMinus,
               
tkComma, 
tkSemicolon,
               
tkEOF);
Interface
The tokenizer is accessed through a very simple interface
ISimpleDSLTokenizer 
= 
interface
 
function 
CurrentLocation: 
TPoint;
  function 
GetToken(var 
kind: 
TTokenKind;
var 
identifier:
string): 
boolean;
  procedure 
Initialize(const 
code: 
string);
  function 
IsAtEnd: 
boolean;
end;
The most important function here is GetToken. It returns next token from the source stream - kind contains the token kind and identifier contains the sequence of characters representing the token. The function returns False when end of stream is reached.
Other functions are all just simple helpers. IsAtEnd returns True when end of stream is reached, CurrentLocation returns current line and column, which is useful for reporting errors, and Initialize initializes the tokenizer.
procedure 
TSimpleDSLTokenizer.Initialize(const 
code: 
string);
begin
  
FProgram.Text 
:= 
code;
  
FNextLine 
:= 
0;
  
FNextChar 
:= 
1;
  
FLookahead 
:= 
#0;
  
FLastLine 
:= 
FProgram.Count 
- 
1;
  
if 
FLastLine 
>= 
0 
then 
begin
    
FLastLineLen 
:= 
Length(FProgram[FLastLine]);
    
FCurrentLine 
:= 
FProgram[FNextLine];
  
end;
end;
A program is internally stored in a string list FProgram. Other variables track the current location and are mostly used in the method GetChar.
Reading input
Let's take a look at the GetToken method.
function
TSimpleDSLTokenizer.GetToken(var
kind:
TTokenKind;
var
identifier:
string):
boolean;
var
  
ch:
char;
begin
  
identifier
:=
'';
  
Result
:=
GetChar(ch);
  
if
not
Result
then
begin
    
kind
:=
tkEOF;
    
Exit;
  
end;
  
case
ch
of
    
'(':
kind
:=
tkLeftParen;
    
')':
kind
:=
tkRightParen;
    
'{':
kind
:=
tkLeftCurly;
    
'}':
kind
:=
tkRightCurly;
    
'+':
kind
:=
tkPlus;
    
'-':
kind
:=
tkMinus;
    
'<':
kind
:=
tkLessThan;
    
',':
kind
:=
tkComma;
    
';':
kind
:=
tkSemicolon;
    
else
if
ch.IsLetter
then
begin
      
kind
:=
tkIdent;
      
identifier
:=
ch
+
GetIdent;
    
end
    
else
if
CharInSet(ch,
['0'..'9'])
then
begin
      kind
:=
tkNumber;
      
identifier
:=
ch
+
GetNumber;
    
end
    
else
if
ch.IsWhiteSpace
then
begin
      
kind
:=
tkWhitespace;
      
SkipWhitespace;
    
end
    
else
      
kind
:=
tkUnknown;
  
end;
end;
Firstly it reads next character from the stream (via GetChar) and exits if there is no next character. Then it handles all one-character tokens internally. At the end it handles the specific job of reading identifiers, numbers, and whitespace to GetIdent, GetNumber, and SkipWhitespace, respectively.
GetIdent and GetNumber are very similar, so I'll only focus on one of them.
function 
TSimpleDSLTokenizer.GetIdent:
string;
var
  
ch: 
char;
begin
  
Result 
:= 
'';
  
while 
GetChar(ch) 
do 
begin
    
if 
ch.IsLetter 
or 
ch.IsNumber 
or 
(ch 
= 
'_') 
then
      
Result 
:= 
Result 
+ 
ch
    
else 
begin
      
PushBack(ch);
      
Exit;
    
end;
  
end;
end;
As the GetToken has already read the first character of the identifier, GetIdent collects together all following characters that are either a letter, a number, or an underline. (And as you can see, the code is using character class helpers - IsLetter and IsNumber - so identifiers are actually Unicode-aware.)
When a non-identifier character appears, the simplest solution is just to say "Oooops, I've read too far, please push this last read character back to the processing."
One-character buffer
As we only ever read one character too far, this "please undo the last GetChar" operation is handled with a simple one character buffer used in PushBack and GetChar.
procedure 
TSimpleDSLTokenizer.PushBack(ch: 
char);
begin
  
Assert(FLookahead 
= 
#0, 
'TSimpleDSLTokenizer: 
Lookahead buffer is not empty');
  
FLookahead 
:= 
ch;
end;
function 
TSimpleDSLTokenizer.GetChar(var 
ch: 
char): 
boolean;
begin
  
if 
FLookahead 
<> 
#0 
then 
begin
    
ch 
:= 
FLookahead;
    
FLookahead 
:= 
#0;
    
Result 
:= 
true;
  
end
  
else 
begin
    
Result 
:= 
not 
IsAtEnd;
    
if 
Result 
then 
begin
      
ch 
:= 
FCurrentLine[FNextChar];
      
Inc(FNextChar);
      
if 
FNextChar 
> 
Length(FCurrentLine) 
then 
begin
       
Inc(FNextLine);
       
if 
FNextLine 
< 
FProgram.Count 
then
         
FCurrentLine 
:= 
FProgram[FNextLine];
       
FNextChar 
:= 
1;
      
end;
    
end;
  
end;
end;
PushBack just stores the current buffer in the FLookahead buffer (unless the buffer is not empty which can only happen if there's a bug in the tokenizer). GetChar has a bit more work as in addition to handling the FLookahead buffer it must also handle 'end of line of text' condition.
This 'push back' approach is also used in GetNumber and SkipWhitespace (for details see the code) and in the parser, as we'll see very soon.
Nice Article. Just some notes from my end:
ReplyDeleteYour Identifiers can not start with an underscore. Is this intended?
Your Tokenizer actually treats whitespace as a token. I'd suggest, unless your language is whitespace sensitive(sorry if i missed that somewhere) to simply skip whitespace until the next printable character comes. Otherwhise whitspace handling is transfered to the parser.
You fetch characters one by one. I'd suggest to first determine the range of a token and then copy the range from the stream. Otherwhise when an identifier has a length of 12, you (re)allocate 12 strings. Or if delphi is smart it just reallocates the same over and over again.
Underscores: intentional.
DeleteWhitespace: I agree. Just didn't cross my mand.
Fecth one by one: I was just lazy.
PRs are welcome :)
Funny thing is, by using isLetter, your Parser might be more "accurate" then the Delphi one. The DelphiCompiler accepts anything which is > 255 and fits into UCS2 as valid Letter(which is something i preferred to do on my parsers :P)
DeleteLeftRight-Override, Hearts and other funny stuff in Identifiers :D