Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improved BDD Unicode table representation in NonBacktracking engine #61142

Merged
merged 5 commits into from
Nov 4, 2021

Conversation

veanes
Copy link
Contributor

@veanes veanes commented Nov 3, 2021

Main updates:

  • Updated BDD table serialization to be based on byte[] instead of long[] for saving serialization space used for these arrays. Overall this cut space requirements by at least half.
  • Removed the table for \w, instead deriving it from the 8 Unicode categories 0,1,2,3,4,5,8,18
  • Made the generation algorithm of the BDD tables for ignore-case at least 2x faster if this would be used dynamically -- further optimization are probably possible, this change was using direct improvements involving better use of BDD operations.
  • Limited CharSetSolver._charPredTable to ASCII only as it is almost never used for NonASCII but took up128kB space for all Unicode chars but essentially for no good reason.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 3, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Text.RegularExpressions and removed community-contribution Indicates that the PR has been added by a community member labels Nov 3, 2021
@ghost
Copy link

ghost commented Nov 3, 2021

Tagging subscribers to this area: @eerhardt, @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Main updates:

  • Updated BDD table serialization to be based on byte[] instead of long[] for saving serialization space used for these arrays. Overall this cut space requirements by at least half.
  • Removed the table for \w, instead deriving it from the 8 Unicode categories 0,1,2,3,4,5,8,18
  • Made the generation algorithm of the BDD tables for ignore-case at least 2x faster if this would be used dynamically -- further optimization are probably possible, this change was using direct improvements involving better use of BDD operations.
  • Limited CharSetSolver._charPredTable to ASCII only as it is almost never used for NonASCII but took up128kB space for all Unicode chars but essentially for no good reason.
Author: veanes
Assignees: -
Labels:

area-System.Text.RegularExpressions

Milestone: -

@danmoseley
Copy link
Member

#58828

BDD bdd = BDD.True;
for (int k = 0; k < 16; k++)
{
bdd = (c & (1 << k)) == 0 ? GetOrCreateBDD(k, BDD.False, bdd) : GetOrCreateBDD(k, bdd, BDD.False);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cheaper ways to build up a BDD? Maybe the caching involved helps, but it seems like otherwise this is going to incrementally build up the BDD by creating 15 intermediate ones that are then thrown away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are better ways, I think, but this would involve using e.g. a designated array and non-object base representation with own memory-management over that array.
However this incremental build only happens once per ASCII character, I think it is negligible.

@veanes veanes force-pushed the updateUnicodeBDDs branch from edd1b5a to f67d79c Compare November 3, 2021 18:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants