Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache regex compilation #2328

Merged
merged 5 commits into from
Oct 28, 2022

Conversation

strRM
Copy link
Contributor

@strRM strRM commented Oct 20, 2022

We have run into some issues with regex constraints. Our program needs to be make use of the regex constraint, but we quickly noticed that it led to quite a slowdown. The runtime of our program was around 100 seconds and I tracked it down to the fact that regexes are compiled each time a match is performed. We decided to cache the construction of the regexes, of which there are very few, and this reduced the runtime to less than one second.

This patch introduces a new data structure called ConcurrentCache, which a thin wrapper around the existing concurrent hashmap. Presently, the cache is always generated by the compiler, but I think that is something we can change later, possibly making it optional.

@b-scholz
Copy link
Member

Can you reformat the code with clang-format or apply the changes from here:
https://github.com/souffle-lang/souffle/actions/runs/3289357677/jobs/5421102088

@strRM strRM force-pushed the rm/783-cache-regex-compilation branch from d205579 to 7e4a101 Compare October 20, 2022 12:38
@strRM
Copy link
Contributor Author

strRM commented Oct 20, 2022

Sorry about that. I was using clang-format which is version 14 on my host, but it looks like github CI uses 11.

@strRM
Copy link
Contributor Author

strRM commented Oct 20, 2022

Looks like there may be an issue with noomp.

I fixed it and will push and another update shortly, after the tests finish.

@strRM strRM force-pushed the rm/783-cache-regex-compilation branch from 7e4a101 to 906e31c Compare October 20, 2022 18:48
@codecov
Copy link

codecov bot commented Oct 20, 2022

Codecov Report

Merging #2328 (906e31c) into master (9abb420) will increase coverage by 0.04%.
The diff coverage is 82.14%.

❗ Current head 906e31c differs from pull request most recent head 08a5fad. Consider uploading reports for the commit 08a5fad to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2328      +/-   ##
==========================================
+ Coverage   77.43%   77.48%   +0.04%     
==========================================
  Files         468      469       +1     
  Lines       29178    29183       +5     
==========================================
+ Hits        22594    22611      +17     
+ Misses       6584     6572      -12     
Impacted Files Coverage Δ
src/include/souffle/SouffleInterface.h 88.99% <ø> (ø)
src/interpreter/Engine.h 100.00% <ø> (ø)
...ouffle/datastructure/ConcurrentInsertOnlyHashMap.h 88.00% <60.00%> (+5.07%) ⬆️
src/interpreter/Engine.cpp 83.97% <60.00%> (-0.09%) ⬇️
...rc/include/souffle/datastructure/ConcurrentCache.h 92.85% <92.85%> (ø)
src/synthesiser/Synthesiser.cpp 84.70% <100.00%> (+0.02%) ⬆️
src/include/souffle/utility/ParallelUtil.h 83.72% <0.00%> (-1.56%) ⬇️
src/include/souffle/utility/SubProcess.h 81.57% <0.00%> (+2.63%) ⬆️
src/include/souffle/io/ReadStreamCSV.h 85.55% <0.00%> (+3.50%) ⬆️
... and 1 more

@quentin
Copy link
Member

quentin commented Oct 23, 2022

An alternative design that would avoid introducing a dedicated data structure would be:

  • for the interpreter, store the prepared regex in the appropriate interpreter::Node (maybe of a new kind), like it was done for FFI callbacks stored in the FunctorNode for instance.
  • for the synthesizer, store the prepared regex as a field of the generated class.

EDIT: Of course that's possible if you only have regexes known in advance.

…ertOnlyHashMap.

The function weakFind returns the value associated with a key and provides a bit more
use than just weakContains, which is now expressed in terms of weakFind.

The weakFind function allows us to implement a fast path for implementing a cache, because
we can lookup values without requiring any kind of memory allocation, which the get() function
requires.
The ConcurrentCache uses the ConcurrentInsertOnlyHashMap to implement
a generic cache. The cache will be used to implement a cache for
regular expressions.
Using the ConcurrentCache speeds up regex matches tremendously, because
we avoid the very expensive operation of compiling the same regex over and
over again. This commit introduces the cache in the interpreter only.
Update the synthesizer to use a cache for regexes.
One drawback is that we now need to #include <regex>,
but the structure of the synthesizer makes impossible
to generate the cache only when needed.
@strRM strRM force-pushed the rm/783-cache-regex-compilation branch from 906e31c to 9b9db6c Compare October 24, 2022 10:10
@strRM
Copy link
Contributor Author

strRM commented Oct 24, 2022

I added code along the lines you had suggested, but I left the cache alone. So, in aggregate, this update provides fast access to constant regexes (probably the majority) and a cache for fast evaluation of dynamic regexes.

I created a RegexConstant, which replaces a StringConstant when processing a MATCH/NOT_MATCH constraint. When interpreting MATCH/NOT_MATCH I check if the pattern is a regex, otherwise the existing code is executed.

I could not think of a better way to do this right now.

For constraints like match(".*", X), where the pattern is a
string constant, we can avoid using the regex cache. Avoiding the
cache is going to perform better in all cases.

I introduced the new node RegexConstant which only appears immediately
below a MATCH or NOT_MATCH node.

The behavior with regards to invalid regexes has not changed and bad
regexes will not lead to program errors.
@strRM strRM force-pushed the rm/783-cache-regex-compilation branch from 25c7abd to 08a5fad Compare October 24, 2022 20:06
Copy link
Member

@b-scholz b-scholz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Great job!

@b-scholz b-scholz merged commit f2f9741 into souffle-lang:master Oct 28, 2022
@strRM strRM deleted the rm/783-cache-regex-compilation branch October 28, 2022 09:44
@strRM
Copy link
Contributor Author

strRM commented Oct 28, 2022

Thanks. This helps us a lot, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants