Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler restructuring #1

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

Compiler restructuring #1

wants to merge 30 commits into from

Conversation

Jackojc
Copy link
Owner

@Jackojc Jackojc commented Jan 13, 2022

The compiler is in an incomplete state but can compile to the textual stack-based IR at the moment.

I've reworked most of the compiler in order to make it more
maintainable and less complex.

The new compiler has a text based IR which allows for a
modular architecture. This will allow the user to plug
in their own passes easily and choose the ordering of
existing passes etc.

Currently, there is no codegen for x86-64 implemented but
this is just a foundation for the compiler going forward.

Plans are to implement another IR in three-address-code form
for backends to consume which should ease register allocation
and lowering to assembly.

- New modular architecture
- Simplified implementation
- Stack based IR with textual format
@Jackojc Jackojc self-assigned this Jan 13, 2022
Renamed the core words to try and avoid too many
naming conflicts but also to describe their intent
better.

Added a very basic stdlib file which gives access to some
common kind of arithmetic and stack manipulation words.

Updated the syntax highlighting file in accordance
with the above and also added a new highlighting file
for the klaxon IR format (KIR).

Renamed the compile.sh script to klx and we also run
the m4 preprocessor on the source file before passing
it to klaxon to enable use of include and macros.

Fixed an issue when parsing type annotations.
Previously, annotations which had no out values would
cause the compiler to generate an error that it expected
an identifier but returning no values is valid.

Added a dedicated locale string for type annotation errors.

Removed extraneous space being printed in the KIR serialiser.
Print formatting previously would not escape
sequences of closing braces correctly.

For example: `printlnfmt("{}}}", "foo");` _should_
have produced `foo}` but instead produced `foo`.
Removed the arg and out instructions in favour of
just keeping the cp, mv and rm instructions around
for the backend to work with.

Loops had incorrect code generation.
Added a new optimisation pass to collapse indirect
jumps to a block which then unconditionally jumps
to another block.

This is very useful for heavily nested if-else chains.

Give up when trying to do constant folding beyond
a function call. The problem with trying to fold
beyond the bounds of a call is that the function
call itself may produce some values that are only
known at compile time in which case there is nothing
for the constant folder to reduce at compile time.

Removed any notion of arg and out instructions.
…he IR

Unified Tokens, Ops and IR Tokens into a single enum class
so we longer need to do pesky mappings between them. We
can just use the same enum value right through from the lexer
to the IR generation.

Merged the lexer implementation for the IR and source
representations into the same class and now just use
a templated flag to pick the implementation we want
which reduces a lot of code duplication.

Added some constructor overloads for Op so that
we can construct instructions that need both a
string view and integer field.

Blocks, calls and definitions now have a stack effect
annotation in the IR for simplifying consumption by
a backend.

Added instruction_block and instruction_end functions to
simplify annotating blocks with their stack effect during
code generation.

Block numbers are now function local and start from zero
instead of being globally numbered like before.

Use more consistent naming for library functions and types.

Renamed EOF to TERMINATOR to avoid conflicting with
the standard macro of the same name.
Function inlining and dead code elimination were previously
broken and the code was awkward to work with due to having
to try and preserve consistency in the same buffer.

Three issues have been addressed in this commit:
1. Functioning inlining now renumbers blocks correctly
2. Dead code elimination now only retains functions with "main" as an ancestor
3. Iterators to the IR are now stable due to the use of a double buffer approach

Function inlining was broken previously due to not renumbering blocks
after they were inlined. This would mean that multiple calls to the
same function which had been inlined woudl result in duplicate
blocks which would break the control flow of the program.

The new inliner also doesn't count block/def/end/ret instructions.

Dead code elimination was previously broken due to preserving
functions which were called but not by a common ancestor
("main" in this case). All it would take to preserve a function
was to call it _anywhere_ in the program even if the parent function
of that call was itself dead.

We now use a double buffering like approach to optimisation passes.
The original IR is passed in and supposed to remain unchanged while
the output IR is supposed to be mutated and will become the next
input buffer for the next pass. This gives us some rather nice
properties like stability of reference which makes inlining in
particular very easy.

Constant folding and indirect branch elimination have yet to be
moved over to the new architecture but should be fairly easy.
Updated the CFG generator to work with relative block numbers
by concatenating the function name to the block ID.

Also added weights to the nodes to try and make the generated
graphs look a bit nicer.
Using a simpler architecture for the lexer which allows us
to specify token strings in a single place and have it work
everywhere. This is in contrast to the previous lexer which
required updating multiple unrelated pieces of the code in
order to change tokens.

Switch from shorthand names `cp`, `mv` and `rm` to
`copy`, `move` and `remove`.

Fixed issue where the IR parser would only accept non-keyword
identifiers for user defined functions. This meant functions
named "block" for example would result in a parsing error.

"copy", "move" and "remove" are now considered proper keywords
and as such cannot be used as identifiers.

Removed any hard-coding of strings in calls to error functions
and instead look up the appropriate string representation of
a token instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant