Menu Close


Introduction to Compilers

Warning: Creating a new computer language is very difficult and a bad idea. However in this article I will present the basic concepts for creation of a new computer language. Learning this may be helpful for understanding better how existing languages are working.

Formal grammar

In computer science a language is defined by formal grammar. In formal language theory, a grammar is a set of production rules for strings that form a language. The rules describe how to form strings from the language alphabet that are valid according to the language syntax.

See also: wikipedia formal grammar

Grammar description

To explain language grammar we can use specific notation:

  1. We enclose identifier names in angle brackets <>;
  2. Identifiers are starting with a letter and contain letter or numbers;
  3. We enclose optional elements in squared brackets [];
  4. We represent alternative (or) using the vertical bar “|”;
  5. A repeating sequence of elements is represented using 3 dots: “…”

Mathematical Expressions

Is a finite combination of symbols that is well-formed according to rules that depend on the context. Mathematical symbols can designate numbers elements, operators or tokens: {constants, variables, operations, functions, punctuation, grouping}, and some other aspects of logical syntax.

Well-formed (syntax)

A well-formed expression is a syntactic concept. A syntax is a set of rules, principles, and processes that govern the structure of a given language.

In a correct expression the operators must have the correct number of inputs in the correct places, the characters that make up these inputs must be valid. Enumeration of symbols that violate syntax rules are not well-formed and are not valid mathematical expressions.

For example, in the usual notation of arithmetic, the expression (2 + 3) is well-formed, while the following expression is not:

Expression form (semantic)

Mathematical expressions include multiple forms of expressions: {arithmetic, polynomial, algebraic, closed-form, and analytical}. The form of an expression is a semantics concept.Semantics is the study of meaning. Formal semantics is about attaching meaning to expressions.

Expression evaluation

An expression may be used to designate a value, which might depend on variables occurring in the expression. The determination of this value depends on the semantics attached to the symbols of the expression.

Certain expressions do not designate any value, for instance when they involve division by 0; such expressions are said to have an undefined value, but they are well-formed expressions nonetheless.

Logical expressions

In general the meaning of expressions is not limited to designating values; for instance, an expression might designate a condition, or an equation that is to be solved, or it can be viewed as an object in its own right that can be manipulated according to certain rules.

How compiler works?

 In order to build a compiler for Level I have started my research several weeks ago. Until now I have understood some basic concepts and I have started to implement the compiler components. Next I wil present the compiler components created for project Level-PY. That is a python compiler for Level-3.

Compiler Components

Most compilers are using components to split the logic of compilation into steps.Every component has a job to perform and then pass control to next component. Components are working together and collaborate using messages: function calls or data structures.


A scanner read the source file one character at a time. It also keep track of line number and character position in the line. A scanner usually read forward one character but sometimes scanner can read backwards or can read one character ahead.

For example:

Will be broken by scanner into characters:

(I have used “_” to show the spaces. Scanner will return all spaces and end of line characters. I have used \n for new line character.).

I have implemented the scanner using a closure. This is a function that return a function generator. The generator is using yield() to return characters one at a time.

Level scanner is reading forward two characters: one is the current character and one is the look ahead character. With this we can recognize digraph symbols like: “–“, “{*”, “>=” and so on.


This component will create distinct words, symbols or group of symbols named “tokens”. Every token belong to a token category. Lexer is using the scanner to get characters one by one and decide to add a new character to the current token or to finalize and produce a new token.

For example:

Will be split by lexer into tokens:

Lexer will determine the category of the token: keyword, symbol, identifier. Tokens are recognized because are separated by spaces or symbols. Sometimes the lexer do not know what the symbol represents until the next character is received. For this the lexer can request next character before deciding the type of the token.

Usually the lexer will eliminate indentations, white spaces and comments. For some languages indentation is important and must be analyzed by the parser. Level lexer is storing indentation information for every token:

  • Indent =-1 represents no indentation;
  • Indent = 0 is the beginning of a new line;
  • Indent > 0 is number of spaces at beginning of line.

I have implemented the lexer using closure. This is a function that return a generator. The generator is using multiple yield() statements to return tokens one at a time from a source file.

Sometimes Lexer is using multiple nested loops to create tokens. For Level I have used a single loop with inside decisions and continuation points. This makes the Lexer program very short and comprehensive.


This component call the Lexer to get tokens one by one and use them to verify the language syntax. For this the parser is creating in memory collections to organize the tokens into logical structures:

  • Symbol Table;
  • Abstract Syntax Tree

Compiler can be created using a producer-consumer design pattern. Parser is a consumer while Lexer is the producer. Every time a token is produced by lexer, parser will consume this token and request a new token.

Code Generator

This component use logical structures created by the parser to convert the program into other language. Some compilers create intermediate language (byte-code) or convert the code into machine code instructions.

Some compilers/interpreters are able to execute code in the parser. Before making a code generator I will try to execute the program directly from AST and symbolic tables in memory without making any intermediate code.

In the past people use to be worried about computer memory. Compilers use to create intermediate files and use multiple passes to compress the code as much as possible. This is not one of my concerns. This days the computer memory is large and there should be space for most programs.

Main program

The parser must use the lexer that use the scanner. Therefore the main program must call only the parser. The parser will have a “driver function” that will prepare the lexer and the lexer will have a “driver function” to prepare the scanner.

Symbols and categories

It is a good design to organize symbols and tokens into categories. For every category you created a set of elements. Then you can check if a token belong or not to a particular category.

Research Projects

To learn more about compile we have created two computer languages. On GitHub you can find many more open source projects that can be cloned or you can start your own compiler from scratch.

For these two languages we have created Wiki documentation. You can learn the syntax and the project structure. If you like the subject feel free to clone the repository and check the source code.