Compiler restructuring #1

Jackojc · 2022-01-13T16:22:13Z

The compiler is in an incomplete state but can compile to the textual stack-based IR at the moment.

I've reworked most of the compiler in order to make it more maintainable and less complex. The new compiler has a text based IR which allows for a modular architecture. This will allow the user to plug in their own passes easily and choose the ordering of existing passes etc. Currently, there is no codegen for x86-64 implemented but this is just a foundation for the compiler going forward. Plans are to implement another IR in three-address-code form for backends to consume which should ease register allocation and lowering to assembly. - New modular architecture - Simplified implementation - Stack based IR with textual format

Renamed the core words to try and avoid too many naming conflicts but also to describe their intent better. Added a very basic stdlib file which gives access to some common kind of arithmetic and stack manipulation words. Updated the syntax highlighting file in accordance with the above and also added a new highlighting file for the klaxon IR format (KIR). Renamed the compile.sh script to klx and we also run the m4 preprocessor on the source file before passing it to klaxon to enable use of include and macros. Fixed an issue when parsing type annotations. Previously, annotations which had no out values would cause the compiler to generate an error that it expected an identifier but returning no values is valid. Added a dedicated locale string for type annotation errors. Removed extraneous space being printed in the KIR serialiser.

Print formatting previously would not escape sequences of closing braces correctly. For example: `printlnfmt("{}}}", "foo");` _should_ have produced `foo}` but instead produced `foo`.

Removed the arg and out instructions in favour of just keeping the cp, mv and rm instructions around for the backend to work with. Loops had incorrect code generation.

Added a new optimisation pass to collapse indirect jumps to a block which then unconditionally jumps to another block. This is very useful for heavily nested if-else chains. Give up when trying to do constant folding beyond a function call. The problem with trying to fold beyond the bounds of a call is that the function call itself may produce some values that are only known at compile time in which case there is nothing for the constant folder to reduce at compile time. Removed any notion of arg and out instructions.

…rator

…he IR Unified Tokens, Ops and IR Tokens into a single enum class so we longer need to do pesky mappings between them. We can just use the same enum value right through from the lexer to the IR generation. Merged the lexer implementation for the IR and source representations into the same class and now just use a templated flag to pick the implementation we want which reduces a lot of code duplication. Added some constructor overloads for Op so that we can construct instructions that need both a string view and integer field. Blocks, calls and definitions now have a stack effect annotation in the IR for simplifying consumption by a backend. Added instruction_block and instruction_end functions to simplify annotating blocks with their stack effect during code generation. Block numbers are now function local and start from zero instead of being globally numbered like before. Use more consistent naming for library functions and types. Renamed EOF to TERMINATOR to avoid conflicting with the standard macro of the same name.

Function inlining and dead code elimination were previously broken and the code was awkward to work with due to having to try and preserve consistency in the same buffer. Three issues have been addressed in this commit: 1. Functioning inlining now renumbers blocks correctly 2. Dead code elimination now only retains functions with "main" as an ancestor 3. Iterators to the IR are now stable due to the use of a double buffer approach Function inlining was broken previously due to not renumbering blocks after they were inlined. This would mean that multiple calls to the same function which had been inlined woudl result in duplicate blocks which would break the control flow of the program. The new inliner also doesn't count block/def/end/ret instructions. Dead code elimination was previously broken due to preserving functions which were called but not by a common ancestor ("main" in this case). All it would take to preserve a function was to call it _anywhere_ in the program even if the parent function of that call was itself dead. We now use a double buffering like approach to optimisation passes. The original IR is passed in and supposed to remain unchanged while the output IR is supposed to be mutated and will become the next input buffer for the next pass. This gives us some rather nice properties like stability of reference which makes inlining in particular very easy. Constant folding and indirect branch elimination have yet to be moved over to the new architecture but should be fairly easy.

Updated the CFG generator to work with relative block numbers by concatenating the function name to the block ID. Also added weights to the nodes to try and make the generated graphs look a bit nicer.

Using a simpler architecture for the lexer which allows us to specify token strings in a single place and have it work everywhere. This is in contrast to the previous lexer which required updating multiple unrelated pieces of the code in order to change tokens. Switch from shorthand names `cp`, `mv` and `rm` to `copy`, `move` and `remove`. Fixed issue where the IR parser would only accept non-keyword identifiers for user defined functions. This meant functions named "block" for example would result in a parsing error. "copy", "move" and "remove" are now considered proper keywords and as such cannot be used as identifiers. Removed any hard-coding of strings in calls to error functions and instead look up the appropriate string representation of a token instead.

Jackojc added 6 commits January 13, 2022 16:04

Update README

763de2e

Update grammar and add IR grammar

42c237a

Update syntax highlighting for kakoune

9c443c1

Remove old includes

48124f4

Update examples

9d8132d

Jackojc self-assigned this Jan 13, 2022

Jackojc added 23 commits January 16, 2022 02:41

Add CFG visualiser

8b231c9

Update KIR highlighter

e1ee9f3

Fix a bug in print formatting

1c6f083

Print formatting previously would not escape sequences of closing braces correctly. For example: `printlnfmt("{}}}", "foo");` _should_ have produced `foo}` but instead produced `foo`.

Add new ops to stdlib

5f6fc82

Remove arg & out instructions and fixed loop code gen

856464d

Removed the arg and out instructions in favour of just keeping the cp, mv and rm instructions around for the backend to work with. Loops had incorrect code generation.

Update klx script

c9acdb6

Fix codegen for loops which contain branches

5d1540b

Always emit edges from south of a node to north of a node in CFG gene…

41980ee

…rator

Update grammar files to reflect recent changes

ee4ba78

Always inline functions called exactly once

878c39e

Update Makefile to automatically discover and build source files

d6aca50

Kakoune highlighter better number matching

e088654

Restructuring optimiser

eab1444

Update klx runner script

9f28215

Rename main.cpp to klx.cpp

58d2857

Update control flow graph generator to use relative blocks

d233401

Updated the CFG generator to work with relative block numbers by concatenating the function name to the block ID. Also added weights to the nodes to try and make the generated graphs look a bit nicer.

Update klaxon highlighter for kakoune

37a28dc

Rename primitives in stdlib.klx

2684e78

Better symbol names

bd65f1a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compiler restructuring #1

Compiler restructuring #1

Jackojc commented Jan 13, 2022

Compiler restructuring #1

Are you sure you want to change the base?

Compiler restructuring #1

Conversation

Jackojc commented Jan 13, 2022