Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613

Closed
EgorBo opened this issue Dec 4, 2020 · 21 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@EgorBo
Copy link
Member

EgorBo commented Dec 4, 2020

It's a quite popular case when we compare a string against some predefined one using Ordinal/OrdinalIgnoreCase case (I found plenty of cases in dotnet/aspnetcore and dotnet/runtime repos). E.g. patterns like:

* string.Equals(str, "cns", StringComparison.OrdinalIgnoreCase); // or just Ordinal

* str.Equals("cns", StringComparison.OrdinalIgnoreCase); // or just Ordinal

* StringComparer.OrdinalIgnoreCase.Equals(str, "cns"); // or just Ordinal

So Roslyn/ILLink/JIT (perhaps, it should be JIT to handle more cases after inlining + Roslyn/ILLink know nothing about target arch) could optimize such comparisons by inlining a more optimized check, e.g. here is what we can do for small strings (1-4 chars) keeping in mind that strings are 8-bytes aligned (for simplicity I only cared about AMD64 arch):

[Benchmark(Baseline = true)]
[Arguments("true")]
[Arguments("TRUE")]
public bool StringEquals(string str)
{
    return string.Equals(str, "True", StringComparison.OrdinalIgnoreCase);
}

[Benchmark]
[Arguments("true")]
[Arguments("TRUE")]
public bool Inlined(string str)
{
    return object.ReferenceEquals(str, "True") || 
           (str.Length == 4 &&

            // string's content fits into 64bit register, the following code looks awful
            // but it's a simple 'or + cmp' in the codegen
            (Unsafe.ReadUnaligned<ulong>(ref Unsafe.As<char, byte>(
                   ref MemoryMarshal.GetReference(str.AsSpan()))) | 0x0020002000200020 /*'to upper' since the const arg doesn't contain anything but [a..Z]*/) == 0x65007500720074);
}
|       Method |  str |      Mean |     Error |    StdDev | Ratio |
|------------- |----- |----------:|----------:|----------:|------:|
| StringEquals | TRUE | 4.1723 ns | 0.0023 ns | 0.0020 ns |  1.00 |
|      Inlined | TRUE | 0.3416 ns | 0.0024 ns | 0.0022 ns |  0.08 |
|              |      |           |           |           |       |
| StringEquals | true | 4.1718 ns | 0.0023 ns | 0.0021 ns |  1.00 |
|      Inlined | true | 0.3406 ns | 0.0019 ns | 0.0016 ns |  0.08 |

We also can inline SIMD stuff for longer strings, e.g. here I check that an http header is "Proxy-Authenticate" (18 chars) using two AVX2 vectors: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44

|           Method |         headerName |      Mean |     Error |    StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------- |------------------- |----------:|----------:|----------:|------:|------:|------:|------:|----------:|
|     StringEqauls | PROXY-AUTHENTICATE | 10.296 ns | 0.0087 ns | 0.0078 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | PROXY-AUTHENTICATE |  2.560 ns | 0.0042 ns | 0.0038 ns |  0.25 |     - |     - |     - |         - |
|                  |                    |           |           |           |       |       |       |       |           |
|     StringEqauls | proxy-authenticate | 10.298 ns | 0.0078 ns | 0.0065 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | proxy-authenticate |  2.563 ns | 0.0071 ns | 0.0066 ns |  0.25 |     - |     - |     - |         - |

when the input string is let's say 30 bytes the results are even better:

|           Method |           headerName |      Mean |     Error |    StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------- |--------------------- |----------:|----------:|----------:|------:|------:|------:|------:|----------:|
|     StringEqauls | PROXY(...)WORLD [30] | 15.396 ns | 0.0141 ns | 0.0117 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | PROXY(...)WORLD [30] |  2.558 ns | 0.0010 ns | 0.0008 ns |  0.17 |     - |     - |     - |         - |
|                  |                      |           |           |           |       |       |       |       |           |
|     StringEqauls | proxy(...)World [30] | 14.997 ns | 0.0089 ns | 0.0079 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | proxy(...)World [30] |  2.550 ns | 0.0038 ns | 0.0030 ns |  0.17 |     - |     - |     - |         - |

So for [0..32] chars (string.Length) we can emit an inlined super-fast comparison:
[0..4]: using a single 64bit GP register
[5..8]: using two 64bit GP registers
[9..16]: using two 128bit vectors
[17..32]: using two 256bit vectors
[33...]: leave as is.

/cc @GrabYourPitchforks @benaadams @jkotas @stephentoub

@EgorBo EgorBo added the tenet-performance Performance related issue label Dec 4, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 4, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Szer
Copy link

Szer commented Dec 4, 2020

It could be very beneficial for headers parsing in web servers.

@GrabYourPitchforks
Copy link
Member

Somewhat related to #11484. If the JIT has knowledge that one of the parameters is constant, it can tell us to go down a separate code path, or it can even redirect the call to a different method.

@GrabYourPitchforks GrabYourPitchforks added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 4, 2020
@EgorBo
Copy link
Member Author

EgorBo commented Dec 4, 2020

@GrabYourPitchforks Nice idea, reminds me this LLVM intrinsic: https://llvm.org/docs/LangRef.html#llvm-is-constant-intrinsic

Thanks to #37226 and #1378 the following code is computed as a const! 🎉

public ulong Test() => ConstStringTo64BitInt("True");

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong ConstStringTo64BitInt(string str)
{
    Debug.Assert(str.Length <= 4);
    ulong v = 0;
    if (str.Length == 4) v |= ((ulong)str[3] << 48);
    if (str.Length >= 3) v |= ((ulong)str[2] << 32);
    if (str.Length >= 2) v |= ((ulong)str[1] << 16);
    if (str.Length >= 1) v |= ((ulong)str[0]);
    if (str.Length == 0) return 0;
    return v;
}

Codegen for Test:

       48B85400720075006500 mov      rax, 0x65007500720054
       C3                   ret

A sort of JitHelpers.IsConstString() should be easy to implement. I can make a prototype.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 4, 2020

bool JitHelpers.IsConstAsciiString(string) prototype: EgorBo@86ca130

@jkotas
Copy link
Member

jkotas commented Dec 5, 2020

There are many APIs that can be replaced with more efficient call-site specific equivalent when the input is in a particular shape. Would source generators be better way to handle that? It would be much more powerful solution than IsConstAsciiString.

@Szer
Copy link

Szer commented Dec 5, 2020

Would source generators be better way to handle that? It would be much more powerful solution than IsConstAsciiString

Does it mean that optimized path in language agnostic BCL could only be used in C#?

@EgorBo
Copy link
Member Author

EgorBo commented Dec 5, 2020

I think at least for the

string.Equals(str, "cns", StringComparison.OrdinalIgnoreCase); // Also, for just `Ordinal`

for str.Length <= 4 (or 2 in case of 32bit) small strings we can introduce the optimization directly in the JIT, it shouldn't be difficult to implement - a couple of GT_QMARK, GT_EQ and an indirect load. We can also rely on PGO and don't do it in the cold blocks. But I'd honestly also implement if for longer strings (up to 32 symbols) using SSE/AVX, it shouldn't be a lot of code as well. So the existing code and F# could benefit from it (and handle more cases after inlining) in contrast with the SourceGen-based approach

Inspired by LLVM: https://godbolt.org/z/z3MKa9

UPD: example:

public static bool Test(string str)
{
    return string.Equals(str, "true", StringComparison.Ordinal);
}

is optimized into:

public static bool Test(string str)
{
    return object.ReferenceEquals(str, "true") || // <-- perhaps we can omit this one since the comparison is fast any way
            (str != null &&
            str.Length == 4 &&
            Unsafe.ReadUnaligned<ulong>(
                ref Unsafe.As<char, byte>(ref str._firstChar)) == 0x65007500720054);
}

Codegen:

       48B8B8310090AA010000 mov      rax, 0x1AA900031B8
       483B08               cmp      rcx, gword ptr [rax]
       7426                 je       SHORT G_M59561_IG07
       4885C9               test     rcx, rcx
       741E                 je       SHORT G_M59561_IG05
       83790804             cmp      dword ptr [rcx+8], 4
       7518                 jne      SHORT G_M59561_IG05
       4883C10C             add      rcx, 12
       48B85400720075006500 mov      rax, 0x65007500720054
       483901               cmp      qword ptr [rcx], rax
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
       C3                   ret
G_M59561_IG05:              
       33C0                 xor      eax, eax
       C3                   ret
G_M59561_IG07:              
       B801000000           mov      eax, 1
       C3                   ret

; Total bytes of code 59

Without object.ReferenceEquals:

       4885C9               test     rcx, rcx
       741E                 je       SHORT G_M59561_IG05
       83790804             cmp      dword ptr [rcx+8], 4
       7518                 jne      SHORT G_M59561_IG05
       4883C10C             add      rcx, 12
       48B85400720075006500 mov      rax, 0x65007500720054
       483901               cmp      qword ptr [rcx], rax
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
       C3                   ret
G_M59561_IG05:              
       33C0                 xor      eax, eax
       C3                   ret

; Total bytes of code 38
\--*  NE        int
   +--*  LCL_VAR   ref    V00 arg0
   \--*  CNS_INT   ref    null

qmark 

\--*  EQ        int
   +--*  ARR_LENGTH int
   |  \--*  LCL_VAR   ref    V00 arg0
   \--*  CNS_INT   int    4

qmark

\--*  EQ        int
   +--*  IND       long
       \--*  ADD       byref
          +--*  LCL_VAR   ref    V00
    \--*  CNS_INT   long   0x65007500720054

else

\--*  RETURN       int
   \--*  CNS_INT   int    0

master:

       4883EC28             sub      rsp, 40
       48BAF831FFAE39010000 mov      rdx, 0x139AEFF31F8
       488B12               mov      rdx, gword ptr [rdx]
       41B805000000         mov      r8d, 5
       E81CA1FDFF           call     System.String:Equals(System.String,System.String,int):bool
       90                   nop      
       4883C428             add      rsp, 40
       C3                   ret      

; Total bytes of code: 34

@EgorBo
Copy link
Member Author

EgorBo commented Dec 5, 2020

Prototype for ^ in JIT: EgorBo@5883362

@jkotas
Copy link
Member

jkotas commented Dec 5, 2020

It could be very beneficial for headers parsing in web servers.

Could you please share links to the webserver code where you think this would be very beneficial?

Prototype

I think it make sense to start with the simpler memcpy-like and memcmp-like cases before going to more esoteric cases like globalization.

Also, we need to maintain parity between string and Spans (the same optimization has to be implemented for both). We have been promoting Spans as the high performance solution. We do not want to be in situation where strings have extra optimizations that are missing for Spans.

@Szer
Copy link

Szer commented Dec 5, 2020

Could you please share links to the webserver code where you think this would be very beneficial?

I'm working in Azure Gateway team (reverse proxy for everything in Azure), our header parsing code taking 3.5% CPU time.
Code looks like this (header names were removed):

        private static bool ShouldIgnore(string headerName)
        {
            switch (headerName.Length)
            {
                case 13:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 17:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 22:
                    return
                        headerName.Equals("foo", StringComparison.OrdinalIgnoreCase) ||
                        headerName.Equals("foo", StringComparison.OrdinalIgnoreCase) ||
                        headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 25:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 28:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                default:
                    return false;
            }
        }

I'm pretty sure ASP Net team also will be pleased to optimize header parsing.
Example 1

Example2 (this is basically codegen)

@Tornhoof
Copy link
Contributor

Tornhoof commented Dec 5, 2020

There are many APIs that can be replaced with more efficient call-site specific equivalent when the input is in a particular shape. Would source generators be better way to handle that? It would be much more powerful solution than IsConstAsciiString.

The first customer of such a source-gen could be System.Text.Json's property matching, I use pretty much the same approach (without avx and only ordinal comparison, not case-insensitive) in SpanJson and up to ~32 bytes it's a lot faster than string.Equals.

@jkotas
Copy link
Member

jkotas commented Dec 5, 2020

Yep, this can make naively written parsers better. There are other tricks that high-performance parsers use as the ASP.NET example shows and that this won't handle. I doubt that ASP.NET will want to replace their fine-tuned parser even if this optimization gets implemented.

@gfoidl
Copy link
Member

gfoidl commented Dec 5, 2020

...and a lot of fast parsers operate with utf-8 or ascii bytes and try to avoid creating strings as much as possible.

Maybe also consider Equals with ROS in this (general) deliberation.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 5, 2020

@jkotas

I think it make sense to start with the simpler memcpy-like and memcmp-like cases before going to more esoteric cases like globalization.

I'm not sure I follow, can you please post an example of these memcpy/memcmp usages?

My idea was to optimize cases when we compare a string against some predefined one in the most convenient way. E.g. even a simple search shows many cases in dotnet/runtime:
image
and it doesn't even cover cases when the constant string is a named constant.

I agree it'd be nice to also handle ReadOnlySpan as well but it's more complicated to implement because of that const string -> to Span conversion in the middle, it definitely won't be a single statement in the import phase.

@jkotas
Copy link
Member

jkotas commented Dec 6, 2020

I'm not sure I follow, can you please post an example of these memcpy/memcmp usages?

memcpy/memcmp primitives are exposed via number of different methods: CopyTo, TryCopyTo, Equals, SequenceEquals, StartsWith, ... .

Example for memcpy:

https://github.com/dotnet/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Boolean.cs#L102

This would be more naturaly written as "True".AsSpan().CopyTo(destination) and compiled into a sigle instruction that writes all 8 bytes in one shot.

Example for memcmp:

else if (value.AsSpan(idx).StartsWith("persist="))

My idea was to optimize cases when we compare a string against some predefined one in the most convenient way

I agree with the idea of enabling people to write natural code. I would add that the optimizations in this space need to work well with our other best practices for writing high-performance code, that means Spans. Optimizations that only kick in for code that is likely slow for other reason (e.g. allocates a lot of strings) won't move the needle.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 6, 2020

@jkotas Thanks for the examples!
Looks like thse StartsWith/EndsWith/SequenceCompareTo methods for Spans have AggressiveInlining on them and all three use SpanHelpers.SequenceEqual under the hood so in theory it should be beneficial to try to intrinsify it for const args. The problem that even this simple method:

static bool Test(ReadOnlySpan<char> span)
{
    return span.StartsWith("test");
}

Emits a freaking forest of GenTree 😢 https://gist.github.com/EgorBo/7dd7537b8a8cb4341b47bc9499f60e7a
(26 basic-blocks!)

So the only way is to intrinsify each of them separately I guess. Or rely on that source-gen approach.

@jkotas
Copy link
Member

jkotas commented Dec 6, 2020

Or apply the optimization later in the pipeline once this forest of GenTrees gets simplified.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 6, 2020

Or apply the optimization later in the pipeline once this forest of GenTrees gets simplified.

Still contains a lot of temps - https://gist.github.com/EgorBo/c6eb8edd8de19934506a3a9f3c859f46
Ideally this code:

ref char p = ref MemoryMarshal.GetReference("hello".AsSpan()); // after inlining

should be optimized into just

ref char p = ref "hello"._firstChar;

Perhaps, spans should have some special operators in JIT's IR or something like that

@JulieLeeMSFT JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Dec 6, 2020
@EgorBo
Copy link
Member Author

EgorBo commented Dec 25, 2020

Ah, I didn't notice it was assigned

@sandreenko
Copy link
Contributor

Ah, I didn't notice it was assigned

Don't worry, my task was just to do a further triage, I have not started the implementation :-)

@sandreenko sandreenko removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Dec 25, 2020
@sandreenko sandreenko removed their assignment Dec 25, 2020
@sandreenko sandreenko added this to the 6.0.0 milestone Dec 25, 2020
@EgorBo EgorBo modified the milestones: 6.0.0, Future Mar 8, 2021
@EgorBo EgorBo closed this as completed Sep 6, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Oct 7, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

9 participants