-
Couldn't load subscription status.
- Fork 14
3.1 | ANTLR and code generation
ANTLR is a parser generator, which can basically generate a state machine to create a tree-like structure of something, for example, from the source code of a programming language. In order to generate this state machine, we have to write a so-called ANTLR grammar. ANTLR grammars have their own language, and they consist of 2 parts: a lexer, and a parser.
The lexer's input is the source code (basically a string), and its outputs are the tokens (the smallest building blocks of the programming language), for example: keywords, identifiers, literals, etc. It contains regular expressions to find the tokens, and it can also skip parts of the code that are not important, for example, comments.
The parser's input is the lexer's output: the tokens, and its output is a tree-like structure, where the tree's leaves are the tokens. The parser contains a context-free language and its goal is to group closely related tokens together. For example an int type name, an i identifier, and a ; - all 3 are tokens - make a declaration statement, which can be a node of our tree. Then we can group statements together in a function, etc. until at the end everything is grouped together under the root node. After the tree is complete, we can inspect the nodes of the tree, and collect data from it to create our own data structure.
The ANTLR grammar files, AnlrGlslLexer.g4, and AntlrGlslParser.g4 are located in the syntaxes folder. Whenever you change something in these files, you have to generate code, by running the compile-antlr-windows, or the compile-antlr-linux script, based on the operating system. The generated TypeScript code will appear in the src/_generated folder, and other files, necessary for the generation in the syntaxes/.antlr folder, but you never have to touch these files.
In the lexer file, I capture the keywords, qualifiers, builtin types, literals, operators, and other characters, while skipping preprocessor lines, white spaces, and comments. Rules, denoted by the fragment keyword are not complete tokens, just parts of tokens to make it easier to organize the rules. Keep in mind that the order of the rules is important.
In the parser file, I group together tokens and parser rules to create GLSL language constructs like type declarations, while loops, or function calls. Uppercase words are lexer rules (tokens), while lowercase words are parser rules. Keep in mind that I could have created stricter parser rules to better follow the GLSL specification. For example, in the GLSL specification, there are only 3 qualifiers allowed for function return types: lowp, mediump, and highp. Yet, in the parser rules, I allow any number of all possible qualifiers. This is intentional because our goal in the parser is not to validate the source code's correctness, that's the compiler's task. Our goal is to recognize, that this is a function declaration, even if it's slightly incorrect with 2 qualifiers. You don't want to skip a language construct, because it's not perfect.
In the project's source code, the DocumentInfo objects create the lexer and the parser. The tree traversal is done in the GlslVisitor class, using the visitor design pattern.