-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S.IO.StringReader: Use ReadOnlySpan.IndexOfAny in ReadLine() for performance #60463
S.IO.StringReader: Use ReadOnlySpan.IndexOfAny in ReadLine() for performance #60463
Conversation
Tagging subscribers to this area: @dotnet/area-system-io Issue DetailsHave not done benchmarks of this yet, as I wanted to know first if this could be an acceptable change. The premise is
|
It sounds reasonable as long as there is no regression for relatively short lines (20 characters?) that might be common. |
You might consider reviewing/augmenting the coverage for this in dotnet/performance first. Then using that to evaluate it. |
@adamsitnik code size will of course increase if you count @danmoseley unfortunately I cannot find any benchmarks of Can you guys confirm this? I guess I will add a benchmark for this under MicroBenchmarks in that case. |
Guessing location of test should be something like |
@nietras yes, it looks like there are none (I didn't check earlier as I was on my phone). Yes, that would be the place to put some. Seems like they need not be that elaborate. |
Added benchmark in dotnet/performance#2083 and repeating results here. Even for small line lengths [1, 8] there is a minor speedup (4%) and hence no regressions. For empty lines e.g. just new lines, there is a 40% regression, though. For longer lines we see a nice 2-3x speed up. It might be possible to do something about the empty line regression, but perhaps that is not so important? BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
[Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
Job-BIEWJM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-IFXICU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
ReadLine code has the following: public override string? ReadLine()
{
if (_s == null)
{
throw new ObjectDisposedException(null, SR.ObjectDisposed_ReaderClosed);
} would it be of any benefit to use |
I'm not surprised by line length 0. I am surprised by the line length 1. Do we have an explanation for why that gets faster? |
Me neither on line length 0. Although I have improved that a bit. The 1 to 8 line length case still has som lines of 8 + 2 = 10 chars or 20 bytes < 16 bytes length, so we still hit 128-bit paths and SSE2 in some percentage right. Additionally, the Probably, 1 character only would be slower too. I can run a test for that if you like? |
Improved regression a little bit by special casing line length 0 to avoid Substring overhead. Also added ThrowHelper. Additionally, added line length 1 case. 6% regression. No doubt still dominated by Substring and new string here, so overhead of calling BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
[Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
Job-BZJQTB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-PFAYGN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
Using BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
[Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
Job-LBAYRI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-QZQWHF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
Yes, when I commented I hadn't realized your comment about length 1 was about random selections between 1 and 8. I think this change is still worth taking: it simplifies the code, and improves what's expected to be the majority case. I just want to make sure we understand the regressions before doing so. (And I do think 0 is worth special-casing if it avoids regressing that case while not measurably impacting others; I haven't looked at the change yet, though) |
@stephentoub replacing Since we have context here creating the span for 1 char line could side step the checks of the Span.Slice... BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
[Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
Job-VAAVCB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-XGNHNV : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
Latest benchmark results. No significant changes. Line lengths 0 (+1.2ns) and 1 (+2ns) still have ~30% regressions, all others improvements, going up to 3x for [129, 1024] with -220ns. BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
[Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
Job-ZPUFRH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-ZTWPGG : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much!
@nietras You and I had some discussion on this issue regarding potential improvements to Thanks again for this! It looks great. :) |
@stephentoub any remaining feedback or can we merge? the failure is in HTTP tests which I will investigate. |
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Stephen Toub <[email protected]>
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Stephen Toub <[email protected]>
src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs
Outdated
Show resolved
Hide resolved
Co-authored-by: Stephen Toub <[email protected]>
Co-authored-by: Stephen Toub <[email protected]>
Co-authored-by: Stephen Toub <[email protected]>
@stephentoub hope comments have been resolved, PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@GrabYourPitchforks regarding Substring, there are only a limited number of things one can try to improve this, one of which would be to try and optimize the guard clauses. This would perhaps mean changing the exception messages (not types), since we would like to do multiple checks in one go for example, is that acceptable? Other than that one might specialize for short substrings... if that would improve anything. @adamsitnik the existing performance Benchmarks do not look that useful, at least in my runs they are highly volatile and seem to be based on JIT compiling away all checks... maybe... would adding other tests be acceptable? Perhaps you can take a look and tell me if I'm wrong, but use of Is there any guidance anywhere on getting a good dev inner loop for looking at corelib asm? How to use BDN in that case? A bit rusty here. |
do you mean these benchmarks?
yes, of course! The existing
For managed part of the corelib you should be able to use disassembly diagnoser with corerun: https://github.com/dotnet/performance/blob/797425cdc4dcbc407f4546fd83e00d79b431ea09/docs/benchmarkdotnet.md#disassembly You can also use VTune: https://github.com/dotnet/performance/blob/5d6fc749333ce7d7e34e6d0ac311df67c5d2047c/docs/profiling-workflow-dotnet-runtime.md#vtune or use one of tools provided by the Jit Team. @kunalspathak @EgorBo @AndyAyersMS what is your preferred way of getting the disassembly? |
I would use vTune to profile and also see the instructions in the disassembly that are hot. If you want to use JitDisasm and JitDump flags, then you can set those environment variables with respective method names and use |
@kunalspathak @adamsitnik thanks both for the pointers, I'll try to see if I can figure this out! 😅
Yes, an example run below where BDN notes one of these is multi-modal. Additionally, If you have any suggestions on how to make these more stable I am all ears. One thing I want is better coverage of short substrings e.g. sizes 0-8 or similar.
BenchmarkDotNet=v0.13.1.1620-nightly, OS=Windows 10.0.19044.1348 (21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
[Host] : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
Job-ZIOPOT : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-ESNSIZ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
Improvement on ubuntu-x64 dotnet/perf-autofiling-issues#2670 |
Have not done benchmarks of this yet, as I wanted to know first if this could be an acceptable change. The premise is
IndexOfAny
is highly optimized and uses vectorization if possible. Also untested.