Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use IndexOf for .* in RegexInterpreter/Compiler #31930

Merged
merged 3 commits into from
Feb 8, 2020

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Feb 7, 2020

The first commit primarily cleans up style in RegexInterpreter. As part of that there were a few small substantive changes:

  • Stored the TextInfo rather than storing the CultureInfo and accessing the TextInfo virtual property on each call.
  • Coalesced duplicate case blocks.
  • Removed unnecessary resx string that should have been an assert.

The second commit uses IndexOf to implement Notoneloop{atomic} when we're left-to-right and case-sensitive (which is the typical case). This is most beneficial when someone uses a loop like .*, which will generally become [^\n]*, and thus we IndexOf('\n') to find the end of the greedy consumption. If the end is found within just a few characters, there's a tiny regression, due to the overhead of calling AsSpan(...).IndexOf, but when there's more than that, the gains can be substantial.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Text.RegularExpressions;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);

    [Params(RegexOptions.None, RegexOptions.Compiled)]
    public RegexOptions Options { get; set; }

    [Params(1, 2, 4, 8, 64, 128)]
    public int Length { get; set; }

    private Regex _regex;
    private string _matchInput;

    [GlobalSetup]
    public void Setup()
    {
        _regex = new Regex("\".*\"", Options);
        _matchInput = '"' + new string('a', Length) + '"';
    }

    [Benchmark] public bool IsMatch() => _regex.IsMatch(_matchInput);
}
Method Toolchain Options Length Mean Error StdDev Ratio
IsMatch \master\corerun.exe None 1 89.75 ns 0.332 ns 0.294 ns 1.00
IsMatch \pr\corerun.exe None 1 91.28 ns 0.366 ns 0.306 ns 1.02
IsMatch \master\corerun.exe None 2 90.46 ns 0.124 ns 0.110 ns 1.00
IsMatch \pr\corerun.exe None 2 92.76 ns 0.386 ns 0.301 ns 1.03
IsMatch \master\corerun.exe None 4 97.74 ns 0.102 ns 0.086 ns 1.00
IsMatch \pr\corerun.exe None 4 93.63 ns 0.523 ns 0.489 ns 0.96
IsMatch \master\corerun.exe None 8 105.08 ns 0.711 ns 0.665 ns 1.00
IsMatch \pr\corerun.exe None 8 93.50 ns 0.242 ns 0.226 ns 0.89
IsMatch \master\corerun.exe None 64 227.09 ns 0.396 ns 0.370 ns 1.00
IsMatch \pr\corerun.exe None 64 97.02 ns 0.129 ns 0.114 ns 0.43
IsMatch \master\corerun.exe None 128 365.66 ns 3.756 ns 2.933 ns 1.00
IsMatch \pr\corerun.exe None 128 93.92 ns 0.696 ns 0.651 ns 0.26

Contributes to #1349
cc: @danmosemsft, @eerhardt, @ViktorHofer

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Almost entirely style.  A few substantive but small changes:
- Store the TextInfo rather than storing the CultureInfo and accessing the TextInfo virtual property on each call.
- Removed unnecessary resx string that should have been an assert
- Coalesced duplicate case blocks
This is primarily to improve the performance of .* loops.  We'll now use Span.IndexOf to search for the target character (e.g. \n), rather than the open-coded loop we currently have.
@stephentoub stephentoub merged commit 850b4a2 into dotnet:master Feb 8, 2020
@stephentoub stephentoub deleted the interpretertweaks branch February 8, 2020 13:52
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants