Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Splitting the generated C++ Code into multiple files to improve compile-time #2215

Merged
merged 39 commits into from
Jun 16, 2022

Conversation

julienhenry
Copy link
Member

@julienhenry julienhenry commented Mar 14, 2022

On large souffle projects, the compile time of the generated C++ can be very long. Every single small change in datalog requires recompiling the entire C++ code, which is very inefficient.
This pull request is an attempt to have a more flexible synthesiser, that allows to generate the C++ code in multiple files: the code generation decomposes the code in several classes:

  • one class per specialized datastructure, as it was already the case before
  • one class per subroutine
  • one class for the "main" class, i.e. the one that is calling the subroutines in the right order.

The goal is to have all these classes as decorrelated as possible. In particular, we don't want to recompile a subroutine 'i' if we only changed the datalog code of subroutine 'j'. Consequently, subroutine classes are constructed by passing them a reference to the relations and user-defined functors that they need to use. A datalog code change that does not modify a given subroutine will produce the exact same C++ file for that subroutine.
Also, moving the implementation of the specialized data-structures in their own cpp files will prevent from needing to compile them each time we include their header.

When generating the code in multiple files, the first compilation takes longer than with a single file, but once this is done, and when changing only a few rules/relations of the datalog code and recompile, the recompilation is much faster than the single-file version.
To give an idea on our Souffle code :

  • single-file compilation time: ~15min
  • multiple-file compilation time (first time) and -j8 : ~10min
  • multiple-file incremental compilation time (after small edits in .dl files) and -j8 : 1min30s

@julienhenry julienhenry changed the title Refactoring of the synthesizer WIP Refactoring of the synthesizer Mar 14, 2022
@julienhenry julienhenry force-pushed the modular-synthesiser branch 2 times, most recently from d4bf910 to c2a081b Compare March 16, 2022 18:58
@codecov
Copy link

codecov bot commented Mar 21, 2022

Codecov Report

Merging #2215 (1cb26e6) into master (e745510) will increase coverage by 0.33%.
The diff coverage is 94.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2215      +/-   ##
==========================================
+ Coverage   76.91%   77.25%   +0.33%     
==========================================
  Files         455      458       +3     
  Lines       28652    29153     +501     
==========================================
+ Hits        22038    22521     +483     
- Misses       6614     6632      +18     
Impacted Files Coverage Δ
src/ast/analysis/IOType.h 93.75% <ø> (ø)
src/ast/analysis/UniqueKeys.h 100.00% <ø> (ø)
src/ast2ram/ClauseTranslator.h 100.00% <ø> (ø)
src/ast2ram/provenance/UnitTranslator.h 100.00% <ø> (ø)
src/ast2ram/seminaive/ClauseTranslator.cpp 97.95% <ø> (ø)
src/ast2ram/seminaive/ClauseTranslator.h 100.00% <ø> (ø)
src/ast2ram/seminaive/UnitTranslator.h 100.00% <ø> (ø)
src/ast2ram/utility/TranslatorContext.h 100.00% <ø> (ø)
src/include/souffle/RamTypes.h 100.00% <ø> (ø)
src/include/souffle/datastructure/BTreeUtil.h 97.05% <ø> (ø)
... and 31 more

@b-scholz
Copy link
Member

b-scholz commented Mar 31, 2022

That is great work. I just wonder about the gains w.r.t. compile-speed for small changes in programs.

We have typedefs for data-structures; we have strata with relations as member variables; we have rules that use either relations from the own strata or relations from already computed strata.

If I understand your WIP correctly, the gain is to compile the strata whose rules have changed and not the whole program.
By doing so, the compiler has only a subset of data-structures, relations, and rules to compile.

Can we do another PR for the new functor interface? That is better in terms of SWENG.

@julienhenry
Copy link
Member Author

Yes, I will discard the changes related to the functors interface and do a dedicated PR later.
Indeed, this PR was intended to split the generated code into several pieces to help with "incremental" compilations where only a small amount of relations/rules have changed.
The typical scenarios while developping a soufflé analysis where we want to avoid a full recompilation of the entire datalog code are:
(1) When I create a new relation / remove a relation
(2) When I write a new rule / remove a rule / slightly change an existing rule
(3) When I add/remove a new user-defined functor

Isolating each stratum in its own c++/header file is a good first step to address the above 3 points, because it ensures that any stratum that is not affected by some datalog changes will produce the exact same c++ code.
Strata classes will have as member variables a reference to:

  • relations that are used
  • relations that are populated
  • user-defined functors used

To make sure the generated C++ code of an un-modified stratum does not change, I had to get rid of some non-determinism at some places in Soufflé in particular:

  • avoid the use of std::set<Relation*> whose order depends on memory addresses
  • avoid naming c++ identifiers with a counter : rel_1_A, rel_2_B, rel_3_C ,etc. , since adding/removing a relation will change the name of many other relations => rel_1_A, rel_2_AA, rel_3_B, rel_4_C ...

@b-scholz
Copy link
Member

b-scholz commented Apr 4, 2022

Is it possible to keep the old functionality (source code in a single file) and to have the new one (splitting the C++ for each stratum) as well? Or is this too much of an effort?

@julienhenry
Copy link
Member Author

Is it possible to keep the old functionality (source code in a single file) and to have the new one (splitting the C++ for each stratum) as well? Or is this too much of an effort?

Yes, the idea is to have two modes: a single-file output and a multiple-files output. The single-file output should remain the default, while the other could be for more advanced users ?

@julienhenry julienhenry marked this pull request as draft April 12, 2022 15:30
@b-scholz
Copy link
Member

Is it done?

@julienhenry
Copy link
Member Author

Is it done?

Sorry, not yet.... I still need to clean up a bit the code generation of the data-structures.

@julienhenry julienhenry marked this pull request as ready for review May 16, 2022 12:53
@julienhenry
Copy link
Member Author

julienhenry commented May 16, 2022

@b-scholz I believe this pull request is now in a reasonably good state and can be reviewed

@b-scholz
Copy link
Member

Great job - I will have a look!

@b-scholz b-scholz changed the title WIP Refactoring of the synthesizer Splitting the generated C++ Code into multiple files to improve compile-time May 17, 2022
@b-scholz
Copy link
Member

I look at the changes now. Great work. I like the refactoring.

Have you done performance testing? I also do not understand why the interpreter is affected by this change.

@julienhenry
Copy link
Member Author

julienhenry commented Jun 13, 2022

The interpreter is affected because I renamed subroutines/strata to avoid using their SCC id number in their name, as the SCC id number changes too often when we modify the datalog code.
I did not notice any performance changes in our code, but I haven't done a proper performance evaluation to see how that affects the performance of the generated code... Intuitively I don't see why that would change anything (the generated code of subroutines has not changed)

Copy link
Member

@b-scholz b-scholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@b-scholz b-scholz merged commit 429e168 into souffle-lang:master Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants