Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Regex.EnumerateMatches #67794

Merged
merged 4 commits into from
Apr 13, 2022
Merged

Conversation

joperezr
Copy link
Member

@joperezr joperezr commented Apr 9, 2022

Fixes #65011
Fixes #23602

Adding EnumerateMatches method which returns an enumerator that can iterate over the matches in a passed-in span. The operation is performed amortized allocation free.

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost ghost assigned joperezr Apr 9, 2022
@ghost
Copy link

ghost commented Apr 9, 2022

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #65011
Fixes #23602

Adding EnumerateMatches method which returns an enumerator that can iterate over the matches in a passed-in span. The operation is performed amortized allocation free.

Author: joperezr
Assignees: -
Labels:

area-System.Text.RegularExpressions, new-api-needs-documentation

Milestone: -

@joperezr
Copy link
Member Author

joperezr commented Apr 9, 2022

Here is a quick benchmark I wrote to see how this compares with the existing way to iterate over a MatchCollection using Regex.Matches:

    // regex pattern used is "\b\w+\b" and the input is loremIpsum 5 paragraph string.

    [Benchmark(Baseline = true)]
    public int MatchCollection()
    {
        int x = 0;
        for (int i = 0; i < 1000; i++)
        {
            foreach (Match match in regex.Matches(loremIpsum))
            {
                if (match.ValueSpan[0] >= 'a' && match.ValueSpan[0] <= 'z')
                    x++;
            }
        }

        return x;
    }

    [Benchmark]
    public int MatchEnuemrator()
    {
        int x = 0;
        ReadOnlySpan<char> span = loremIpsum.AsSpan();
        for (int i = 0; i < 1000; i++)
        {
            foreach (ValueMatch word in regex.EnumerateMatches(span))
            {
                if (span.Slice(word.Index, word.Length)[0] >= 'a' && span.Slice(word.Index, word.Length)[0] <= 'z')
                    x++;
            }
        }

        return x;
    }

And the results are:

Method Mean Error StdDev Ratio RatioSD Gen 0 Gen 1 Allocated
MatchCollection 208.5 ms 3.97 ms 4.72 ms 1.00 0.00 43000.0000 1000.0000 273,544,480 B
MatchEnuemrator 146.0 ms 2.71 ms 2.67 ms 0.70 0.02 - - -

@stephentoub
Copy link
Member

@olsaarik, in the current NonBacktracking code, it would benefit from knowing that indexes are needed but not captures. Will that still be the case after your upcoming fixes?

@joperezr joperezr force-pushed the AddRegexEnumerate branch from ad317d4 to aba9a54 Compare April 9, 2022 19:17
@joperezr joperezr force-pushed the AddRegexEnumerate branch from 876c140 to ac0ca12 Compare April 11, 2022 22:02
@stephentoub stephentoub merged commit 89daf96 into dotnet:main Apr 13, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
4 participants