November 25, 2020
39 min read
With the syntax, we could
split the input and replace
const respectively but everyone can do that. Let's try something else.
Don't get too scared, it will be a very small and tiny one. For simplicity, our compiler will only support
Different compilers work in different ways but break down to the three primary stages:
Parsing: takes the raw code and turning it into an abstract representation known as an Abstract Syntax Tree (AST)
Transformation: takes the abstract representation and transforms and modifies it into another abstract representation of the target language.
Code Generation: takes the transformed abstract representation and generates the new code based on the given abstract representation.
Parsing also gets broken down into two stages.
Lexical Analysis (lexing/ tokenizing) and
Lexical Analysis takes the raw code and turn each character it into a
token with the lexer/tokenizer. The tokenizer returns an array of all the tokens for a given syntax.
tokenizer will return the array below.
Tokens are an array of tiny little objects that describe an isolated piece of the syntax.
Each token is an object with a
value property. The
type holds the type of the current character or set of characters being passed.
value property stores the value of the character being passed.
Syntactic Analysis then takes the tokens and transforms them with a parser function to an abstract representation of the tokens in relation to each other. Usually, we would have two ASTs where one is from our language and the other is for the target language, but for simplicity again, we will build a single AST modify the same one to produce a different AST.
The parser will return the object below.
The next stage for our compiler is transformation. Taking the AST and transforming it into a totally new AST for any programming language or just modifying the same one. We won't generate a new AST, we will just modify it.
On our AST, we have at each level an object with a
type property. These are known as AST Node. These nodes have defined properties on them that describe one isolated part of the tree.
Fortunately for us, we are doing only one thing with our AST, that is Variable Declaration. Let's see how we will modify our AST.
VariableDeclaration node, we have a
kind property that contains the current keyword being used. So we will
traverse the tree and
visit each node until have a Node with
VariableDeclaration and set the
kind property to what keyword we want.
Now that we have our new AST, we can now generate our code. Our new AST has everything we need. The keyword, the variable name and the value assigned to the variable. The name and value can be found in the
Now that's it. A general idea of compilers and how they work. Not all compilers work like this but most certainly do. That's backbone and skeleton of our compiler. If our compiler was a website, all the above will be the HTML.
Let's write some code. 😋
We won't use any external libraries, we will write everything from scratch😺. Also, you must have Node.js installed on your local system. Use any text editor or IDE of your choice.
Create a new directory and run
In general, we will have 5 main functions in our code
We will first declare a
tokenizer function with a parameter of
input, the inital code we are going to pass to our compiler as a string. Then initialize a
current for the current location in the input and
tokens will be an array that will hold the tokens for each individual
token. Then we will add a
whitespace character to the end.
After the initial declarations in the
tokenizer, we come to the main part. We will have a
while loop that will loop over all the characters in the
input and while there is a character available, we will check for the type of the character and add it to a
token and add the
token to the
We now have check in place for semicolons and whitespaces but there are four more to go. Our compiler supports
null. We will now check for the following types. Remmember we are dealing with single characters so we willl need to put some checks in place else we will be pushing single characters as
Still in the while loop
Now that we have numbers underway, the next on our list is
null values. If we used the same approach for the semicolon and add a token for every character, we could face the same problem where we won't the full token value so we will a different approach similiar to the number check.
Strings will be easy to tackle with first. Each string starts and ends with a
" so based on the same approach for numbers, we check if a character is a
", If it is, we will add every value that comes after the quote(
") until we meet another quote indicating the end of the string.
The last check and we are done with our
tokenizer. The check for letters.
null and the keywords,
define all have characters that will test true for letters so we will use the same approach as the numbers. If the current character is a letter, we will add it to a new variable and check of the next character is also a letter until we meet a non-letter character then we will return.
At this point, we have our
letters value but we cannot add it to the
tokens array yet. Each token must have a
type and a
value but for letters, they could be different. Our letters could be
false which will have a type of
boolean or the letters could be
define which could have a type of
keyword, so we need another check to check the letters and assign thier token the respective type.
At this point, we are done checking but if the character isn't recognized the our
while loop will be stuck so we need some error checking in place and finally return the
tokens from the tokenizer.
We are done with the
tokenizer. All the code at this point can be found here.
Now that the heavy lifting has been done for us in the
tokenizer, we move to the
parser takes the
tokens produced by the
tokenizer and modifies them into an AST. Out parser will have a
walk function. The
walk function will take the current
token and return the AST Node for that specific
If we had a
The AST Node will be:
The code for our
walk function will be a recursive function. We first get the current
token, check the
type of the
token and return an AST Node based on the
We have checks for
number token types. Let's focus on the remaining ones,
ident will always have a value of
as so we won't need a node for it. We will just skip it.
semi also indicates the end of the code so we will ignore it too. We will focus on the
We are done with the
walk function, but the function is just declared in the
parser, it's not being used by the
parser so we have to use it.
There you have it, the
parser in the flesh. You can use the test case for the
tokenizer above and pass the tokens to the parser and log the results for yourself. You can get all the code up to this point here
It's time for our
traverser will take the
ast from the
parser and a
visitor will have objects with names of the various AST Node types and each object will have an
enter method. While traversing the AST, when we get to a node with a matching visitor object, we call the
enter method on that object.
traverser will have two main methods,
traverseArray will call
traverseNode on each node in a node array.
traverseNode will take an node and it's parent node and call the visitor method on the node if there is one.
Now that we have the
traverseArray, we can proceed to the main
That's it for our
traverser. You can get all the code up to this point here.
Next is our
transformer which will take the AST and modify the AST and return it. Our
transformer will have a
visitor object and it will traverse the AST passed as an argument with the visitor and return the modified AST
Since we are only dealing with Variable Declaration's, our visitor will have only one object,
VariableDeclaration and will change the value of the
kind to the respective equivalent.
That's it for our
visitor. Although we could have done more, like things not related to variable declaration. We could have added a
NumberLiteral object to multiply every number by 2 or another method to make every string in a
visitor is where the mutations and the modifications take place.
We are done with the
visitor but not the whole
transformer. We need to use the
visitor we created with the
traverser to modify our AST and return the modified AST
We are done with the
transformer, you can get all the code up to this point here.
You can test your
transformer with an ast generated by the
parser and compare the difference.
We are done with two phases of our compiler, Parsing and Transformation. It's left with the last phase, Code Generation. We will only have one function for this phase,
generator will be recursively call itself at each node till we get a giant string of all the values. At each node, we will either return a call to another child node or return a value if the node has no children.
Finally, we are done with our
generator and all the three stages. You can get all the code up till this point here.
Congratulations if you really made it this far. There's only one thing left to do. We need to link all the functions we created and combine it into one single function. We'll name it as the
We can now test our baby