Splitting the generated C++ Code into multiple files to improve compile-time #2215

julienhenry · 2022-03-14T19:31:55Z

On large souffle projects, the compile time of the generated C++ can be very long. Every single small change in datalog requires recompiling the entire C++ code, which is very inefficient.
This pull request is an attempt to have a more flexible synthesiser, that allows to generate the C++ code in multiple files: the code generation decomposes the code in several classes:

one class per specialized datastructure, as it was already the case before
one class per subroutine
one class for the "main" class, i.e. the one that is calling the subroutines in the right order.

The goal is to have all these classes as decorrelated as possible. In particular, we don't want to recompile a subroutine 'i' if we only changed the datalog code of subroutine 'j'. Consequently, subroutine classes are constructed by passing them a reference to the relations and user-defined functors that they need to use. A datalog code change that does not modify a given subroutine will produce the exact same C++ file for that subroutine.
Also, moving the implementation of the specialized data-structures in their own cpp files will prevent from needing to compile them each time we include their header.

When generating the code in multiple files, the first compilation takes longer than with a single file, but once this is done, and when changing only a few rules/relations of the datalog code and recompile, the recompilation is much faster than the single-file version.
To give an idea on our Souffle code :

single-file compilation time: ~15min
multiple-file compilation time (first time) and -j8 : ~10min
multiple-file incremental compilation time (after small edits in .dl files) and -j8 : 1min30s

codecov · 2022-03-21T10:40:33Z

Codecov Report

Merging #2215 (1cb26e6) into master (e745510) will increase coverage by 0.33%.
The diff coverage is 94.85%.

@@            Coverage Diff             @@
##           master    #2215      +/-   ##
==========================================
+ Coverage   76.91%   77.25%   +0.33%     
==========================================
  Files         455      458       +3     
  Lines       28652    29153     +501     
==========================================
+ Hits        22038    22521     +483     
- Misses       6614     6632      +18

Impacted Files	Coverage Δ
src/ast/analysis/IOType.h	`93.75% <ø> (ø)`
src/ast/analysis/UniqueKeys.h	`100.00% <ø> (ø)`
src/ast2ram/ClauseTranslator.h	`100.00% <ø> (ø)`
src/ast2ram/provenance/UnitTranslator.h	`100.00% <ø> (ø)`
src/ast2ram/seminaive/ClauseTranslator.cpp	`97.95% <ø> (ø)`
src/ast2ram/seminaive/ClauseTranslator.h	`100.00% <ø> (ø)`
src/ast2ram/seminaive/UnitTranslator.h	`100.00% <ø> (ø)`
src/ast2ram/utility/TranslatorContext.h	`100.00% <ø> (ø)`
src/include/souffle/RamTypes.h	`100.00% <ø> (ø)`
src/include/souffle/datastructure/BTreeUtil.h	`97.05% <ø> (ø)`
... and 31 more

b-scholz · 2022-03-31T11:31:31Z

That is great work. I just wonder about the gains w.r.t. compile-speed for small changes in programs.

We have typedefs for data-structures; we have strata with relations as member variables; we have rules that use either relations from the own strata or relations from already computed strata.

If I understand your WIP correctly, the gain is to compile the strata whose rules have changed and not the whole program.
By doing so, the compiler has only a subset of data-structures, relations, and rules to compile.

Can we do another PR for the new functor interface? That is better in terms of SWENG.

julienhenry · 2022-04-04T12:24:00Z

Yes, I will discard the changes related to the functors interface and do a dedicated PR later.
Indeed, this PR was intended to split the generated code into several pieces to help with "incremental" compilations where only a small amount of relations/rules have changed.
The typical scenarios while developping a soufflé analysis where we want to avoid a full recompilation of the entire datalog code are:
(1) When I create a new relation / remove a relation
(2) When I write a new rule / remove a rule / slightly change an existing rule
(3) When I add/remove a new user-defined functor

Isolating each stratum in its own c++/header file is a good first step to address the above 3 points, because it ensures that any stratum that is not affected by some datalog changes will produce the exact same c++ code.
Strata classes will have as member variables a reference to:

relations that are used
relations that are populated
user-defined functors used

To make sure the generated C++ code of an un-modified stratum does not change, I had to get rid of some non-determinism at some places in Soufflé in particular:

avoid the use of std::set<Relation*> whose order depends on memory addresses
avoid naming c++ identifiers with a counter : rel_1_A, rel_2_B, rel_3_C ,etc. , since adding/removing a relation will change the name of many other relations => rel_1_A, rel_2_AA, rel_3_B, rel_4_C ...

b-scholz · 2022-04-04T22:55:02Z

Is it possible to keep the old functionality (source code in a single file) and to have the new one (splitting the C++ for each stratum) as well? Or is this too much of an effort?

julienhenry · 2022-04-05T12:34:44Z

Is it possible to keep the old functionality (source code in a single file) and to have the new one (splitting the C++ for each stratum) as well? Or is this too much of an effort?

Yes, the idea is to have two modes: a single-file output and a multiple-files output. The single-file output should remain the default, while the other could be for more advanced users ?

b-scholz · 2022-04-28T22:44:42Z

Is it done?

julienhenry · 2022-04-29T11:53:54Z

Is it done?

Sorry, not yet.... I still need to clean up a bit the code generation of the data-structures.

julienhenry · 2022-05-16T12:54:38Z

@b-scholz I believe this pull request is now in a reasonably good state and can be reviewed

b-scholz · 2022-05-16T13:07:22Z

Great job - I will have a look!

b-scholz · 2022-06-12T12:31:19Z

I look at the changes now. Great work. I like the refactoring.

Have you done performance testing? I also do not understand why the interpreter is affected by this change.

julienhenry · 2022-06-13T08:44:27Z

The interpreter is affected because I renamed subroutines/strata to avoid using their SCC id number in their name, as the SCC id number changes too often when we modify the datalog code.
I did not notice any performance changes in our code, but I haven't done a proper performance evaluation to see how that affects the performance of the generated code... Intuitively I don't see why that would change anything (the generated code of subroutines has not changed)

b-scholz

Thanks for the contribution!

julienhenry changed the title ~~Refactoring of the synthesizer~~ WIP Refactoring of the synthesizer Mar 14, 2022

julienhenry added 13 commits March 15, 2022 05:53

refactoring generated code

4008db6

new synthesiser

4c72c86

functors in synthesiser

e286a40

functors_signatures

b099974

synthesiser

8f87a74

synthesiser

d6a821f

substr_wapper in synthesiser

f8f24c9

refactoring synthesiser

7d648a1

clang-format

f724d1f

copyrights on new files

8f1fa96

fixes

38d15a5

sythesize multiple files

7b0ad64

clang-format

c2a081b

julienhenry force-pushed the modular-synthesiser branch 2 times, most recently from d4bf910 to c2a081b Compare March 16, 2022 18:58

julienhenry added 4 commits March 21, 2022 05:12

fixing RAM_BIT_SHIFT_MASK

e277009

Merge branch 'master' into modular-synthesiser

84e8977

clang-format

aca3c65

fix

025376f

julienhenry added 9 commits March 25, 2022 11:29

clean up

e6854b7

deterministic ordering on ram::Relation sets

0337d7e

removed some non-determinism

13396c7

better file names for synthesiser

0510a26

added missing const

153f3c1

Merge branch 'master' into modular-synthesiser

b0389a6

turn on some automated tests with the sythesiser emitting multiple files

aa87484

clang-format

952560c

fixes for clang

dea35b2

julienhenry added 3 commits April 4, 2022 06:00

make sure tmp directory is unique

526bac3

Merge branch 'master' into modular-synthesiser

db23f98

clang format

f48bac1

fix msvc

f66c048

julienhenry added 2 commits April 5, 2022 07:57

extract datastructures from CompiledSouffle

cd2c38b

clang-format

9c2dd06

julienhenry added 3 commits April 5, 2022 11:41

moved CompiledOptions

d71449b

clang-format...

4b7bc62

remove code related to the functors API, will be in its own PR

e3e3d30

julienhenry marked this pull request as draft April 12, 2022 15:30

julienhenry added 2 commits April 25, 2022 10:48

proper output name

57d980c

fix exe name

8fb61bb

Julien Henry added 2 commits May 16, 2022 06:31

Merge branch 'master' into modular-synthesiser

ac0496c

format

1cb26e6

julienhenry marked this pull request as ready for review May 16, 2022 12:53

b-scholz changed the title ~~WIP Refactoring of the synthesizer~~ Splitting the generated C++ Code into multiple files to improve compile-time May 17, 2022

b-scholz approved these changes Jun 16, 2022

View reviewed changes

b-scholz merged commit 429e168 into souffle-lang:master Jun 16, 2022

adamjseitz mentioned this pull request Sep 8, 2022

Generated C++ compile time regression #2303

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Splitting the generated C++ Code into multiple files to improve compile-time #2215

Splitting the generated C++ Code into multiple files to improve compile-time #2215

julienhenry commented Mar 14, 2022 •

edited

Loading

codecov bot commented Mar 21, 2022 •

edited

Loading

b-scholz commented Mar 31, 2022 •

edited

Loading

julienhenry commented Apr 4, 2022

b-scholz commented Apr 4, 2022 •

edited

Loading

julienhenry commented Apr 5, 2022

b-scholz commented Apr 28, 2022

julienhenry commented Apr 29, 2022

julienhenry commented May 16, 2022 •

edited

Loading

b-scholz commented May 16, 2022

b-scholz commented Jun 12, 2022

julienhenry commented Jun 13, 2022 •

edited

Loading

b-scholz left a comment

Splitting the generated C++ Code into multiple files to improve compile-time #2215

Splitting the generated C++ Code into multiple files to improve compile-time #2215

Conversation

julienhenry commented Mar 14, 2022 • edited Loading

codecov bot commented Mar 21, 2022 • edited Loading

Codecov Report

b-scholz commented Mar 31, 2022 • edited Loading

julienhenry commented Apr 4, 2022

b-scholz commented Apr 4, 2022 • edited Loading

julienhenry commented Apr 5, 2022

b-scholz commented Apr 28, 2022

julienhenry commented Apr 29, 2022

julienhenry commented May 16, 2022 • edited Loading

b-scholz commented May 16, 2022

b-scholz commented Jun 12, 2022

julienhenry commented Jun 13, 2022 • edited Loading

b-scholz left a comment

Choose a reason for hiding this comment

julienhenry commented Mar 14, 2022 •

edited

Loading

codecov bot commented Mar 21, 2022 •

edited

Loading

b-scholz commented Mar 31, 2022 •

edited

Loading

b-scholz commented Apr 4, 2022 •

edited

Loading

julienhenry commented May 16, 2022 •

edited

Loading

julienhenry commented Jun 13, 2022 •

edited

Loading