Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S.IO.StringReader: Use ReadOnlySpan.IndexOfAny in ReadLine() for performance #60463

Merged

Conversation

nietras
Copy link
Contributor

@nietras nietras commented Oct 15, 2021

Have not done benchmarks of this yet, as I wanted to know first if this could be an acceptable change. The premise is IndexOfAny is highly optimized and uses vectorization if possible. Also untested.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 15, 2021
@ghost
Copy link

ghost commented Oct 15, 2021

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Have not done benchmarks of this yet, as I wanted to know first if this could be an acceptable change. The premise is IndexOfAny is highly optimized and uses vectorization if possible. Also untested.

Author: nietras
Assignees: -
Labels:

area-System.IO, community-contribution

Milestone: -

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Oct 15, 2021
@adamsitnik
Copy link
Member

It sounds reasonable as long as there is no regression for relatively short lines (20 characters?) that might be common.

@danmoseley
Copy link
Member

You might consider reviewing/augmenting the coverage for this in dotnet/performance first. Then using that to evaluate it.

@nietras
Copy link
Contributor Author

nietras commented Oct 15, 2021

@adamsitnik code size will of course increase if you count IndexOfAny, but agree tests should show small line length perf differences.

@danmoseley unfortunately I cannot find any benchmarks of StringReader in https://github.com/dotnet/performance/

Can you guys confirm this? I guess I will add a benchmark for this under MicroBenchmarks in that case.

@nietras
Copy link
Contributor Author

nietras commented Oct 15, 2021

Guessing location of test should be something like C:\git\oss\performance\src\benchmarks\micro\libraries\System.IO e.g. StringReaderReadLineTests.cs.

@danmoseley
Copy link
Member

@nietras yes, it looks like there are none (I didn't check earlier as I was on my phone). Yes, that would be the place to put some. Seems like they need not be that elaborate.

@nietras
Copy link
Contributor Author

nietras commented Oct 17, 2021

Added benchmark in dotnet/performance#2083 and repeating results here.

Even for small line lengths [1, 8] there is a minor speedup (4%) and hence no regressions. For empty lines e.g. just new lines, there is a 40% regression, though.

For longer lines we see a nice 2-3x speed up.

It might be possible to do something about the empty line regression, but perhaps that is not so important?

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-BIEWJM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-IFXICU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain LineLengthRange Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated
ReadLine Job-BIEWJM runtime-m [ 0, 0] 7.208 ns 0.0556 ns 0.0434 ns 7.194 ns 7.165 ns 7.288 ns 1.00 0.00 - -
ReadLine Job-IFXICU runtime-pr [ 0, 0] 10.112 ns 0.1781 ns 0.1666 ns 10.105 ns 9.875 ns 10.438 ns 1.41 0.02 - -
ReadLine Job-BIEWJM runtime-m [ 1, 8] 17.751 ns 0.2287 ns 0.2028 ns 17.686 ns 17.552 ns 18.192 ns 1.00 0.00 0.0020 33 B
ReadLine Job-IFXICU runtime-pr [ 1, 8] 16.998 ns 0.1011 ns 0.0946 ns 16.992 ns 16.872 ns 17.212 ns 0.96 0.01 0.0020 33 B
ReadLine Job-BIEWJM runtime-m [ 9, 32] 26.336 ns 0.1205 ns 0.1127 ns 26.273 ns 26.200 ns 26.510 ns 1.00 0.00 0.0039 65 B
ReadLine Job-IFXICU runtime-pr [ 9, 32] 19.896 ns 0.0740 ns 0.0618 ns 19.879 ns 19.801 ns 20.046 ns 0.76 0.00 0.0038 65 B
ReadLine Job-BIEWJM runtime-m [ 33, 128] 62.594 ns 0.2100 ns 0.1861 ns 62.584 ns 62.222 ns 62.928 ns 1.00 0.00 0.0108 185 B
ReadLine Job-IFXICU runtime-pr [ 33, 128] 31.021 ns 0.0902 ns 0.0800 ns 31.015 ns 30.912 ns 31.200 ns 0.50 0.00 0.0110 185 B
ReadLine Job-BIEWJM runtime-m [ 129,1024] 319.244 ns 0.6595 ns 0.6169 ns 319.302 ns 318.088 ns 320.391 ns 1.00 0.00 0.0697 1,181 B
ReadLine Job-IFXICU runtime-pr [ 129,1024] 98.174 ns 0.2733 ns 0.2282 ns 98.123 ns 97.705 ns 98.567 ns 0.31 0.00 0.0705 1,181 B

@nietras
Copy link
Contributor Author

nietras commented Oct 17, 2021

ReadLine code has the following:

        public override string? ReadLine()
        {
            if (_s == null)
            {
                throw new ObjectDisposedException(null, SR.ObjectDisposed_ReaderClosed);
            }

would it be of any benefit to use ThrowHelper here? Or is that a mute point now?

@stephentoub
Copy link
Member

Even for small line lengths [1, 8] there is a minor speedup (4%) and hence no regressions. For empty lines e.g. just new lines, there is a 40% regression, though.

I'm not surprised by line length 0. I am surprised by the line length 1. Do we have an explanation for why that gets faster?

@nietras
Copy link
Contributor Author

nietras commented Oct 17, 2021

I'm not surprised by line length 0. I am surprised by the line length 1. Do we have an explanation for why that gets faster?

Me neither on line length 0. Although I have improved that a bit. The 1 to 8 line length case still has som lines of 8 + 2 = 10 chars or 20 bytes < 16 bytes length, so we still hit 128-bit paths and SSE2 in some percentage right. Additionally, the IndexOfAny has optimized code for >= 4 chars, which is more than 50% of cases.

Probably, 1 character only would be slower too. I can run a test for that if you like?

@nietras
Copy link
Contributor Author

nietras commented Oct 17, 2021

Improved regression a little bit by special casing line length 0 to avoid Substring overhead. Also added ThrowHelper.

Additionally, added line length 1 case. 6% regression. No doubt still dominated by Substring and new string here, so overhead of calling IndexOfAny not that big an issue, perhaps.

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-BZJQTB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-PFAYGN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain LineLengthRange Mean Error StdDev Median Min Max Ratio Gen 0 Allocated
ReadLine Job-BZJQTB m [ 0, 0] 7.371 ns 0.0943 ns 0.0883 ns 7.402 ns 7.212 ns 7.498 ns 1.00 - -
ReadLine Job-PFAYGN pr [ 0, 0] 9.472 ns 0.1117 ns 0.0990 ns 9.471 ns 9.350 ns 9.608 ns 1.28 - -
ReadLine Job-BZJQTB m [ 0, 1024] 292.841 ns 1.4690 ns 1.3741 ns 292.241 ns 291.296 ns 295.981 ns 1.00 0.0617 1,049 B
ReadLine Job-PFAYGN pr [ 0, 1024] 90.232 ns 0.5357 ns 0.5011 ns 90.073 ns 89.500 ns 91.097 ns 0.31 0.0625 1,049 B
ReadLine Job-BZJQTB m [ 1, 1] 13.278 ns 0.1174 ns 0.1098 ns 13.291 ns 13.068 ns 13.433 ns 1.00 0.0014 24 B
ReadLine Job-PFAYGN pr [ 1, 1] 14.013 ns 0.0911 ns 0.0853 ns 13.998 ns 13.897 ns 14.180 ns 1.06 0.0014 24 B
ReadLine Job-BZJQTB m [ 1, 8] 18.093 ns 0.1064 ns 0.0996 ns 18.079 ns 17.968 ns 18.277 ns 1.00 0.0019 33 B
ReadLine Job-PFAYGN pr [ 1, 8] 16.986 ns 0.0912 ns 0.0853 ns 16.950 ns 16.884 ns 17.136 ns 0.94 0.0020 33 B
ReadLine Job-BZJQTB m [ 9, 32] 27.359 ns 0.1525 ns 0.1427 ns 27.381 ns 27.056 ns 27.582 ns 1.00 0.0039 65 B
ReadLine Job-PFAYGN pr [ 9, 32] 20.399 ns 0.1372 ns 0.1216 ns 20.407 ns 20.208 ns 20.590 ns 0.75 0.0038 65 B
ReadLine Job-BZJQTB m [ 33, 128] 64.000 ns 0.3802 ns 0.3557 ns 64.061 ns 63.476 ns 64.639 ns 1.00 0.0110 185 B
ReadLine Job-PFAYGN pr [ 33, 128] 31.459 ns 0.2157 ns 0.2018 ns 31.485 ns 31.102 ns 31.758 ns 0.49 0.0110 185 B
ReadLine Job-BZJQTB m [ 129, 1024] 326.246 ns 1.2682 ns 1.1863 ns 325.828 ns 324.743 ns 328.814 ns 1.00 0.0704 1,181 B
ReadLine Job-PFAYGN pr [ 129, 1024] 98.230 ns 0.7595 ns 0.7104 ns 97.891 ns 97.538 ns 99.812 ns 0.30 0.0704 1,181 B

@nietras
Copy link
Contributor Author

nietras commented Oct 17, 2021

Using Environment.NewLine. Line length 1 sees larger regression now inline with length 0, so my previous comment about new string might be incorrect.

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-LBAYRI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-QZQWHF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain LineLengthRange Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated
ReadLine Job-LBAYRI m [ 0, 0] 3.602 ns 0.0187 ns 0.0175 ns 3.601 ns 3.578 ns 3.636 ns 1.00 0.00 - -
ReadLine Job-QZQWHF pr [ 0, 0] 4.538 ns 0.0529 ns 0.0469 ns 4.538 ns 4.399 ns 4.593 ns 1.26 0.01 - -
ReadLine Job-LBAYRI m [ 0, 1024] 290.351 ns 0.8590 ns 0.8035 ns 290.511 ns 289.016 ns 291.381 ns 1.00 0.00 0.0621 1,045 B
ReadLine Job-QZQWHF pr [ 0, 1024] 86.508 ns 0.4516 ns 0.3771 ns 86.576 ns 85.995 ns 87.273 ns 0.30 0.00 0.0624 1,045 B
ReadLine Job-LBAYRI m [ 1, 1] 8.032 ns 0.1276 ns 0.1194 ns 8.067 ns 7.786 ns 8.233 ns 1.00 0.00 0.0014 24 B
ReadLine Job-QZQWHF pr [ 1, 1] 10.258 ns 0.0778 ns 0.0689 ns 10.254 ns 10.153 ns 10.393 ns 1.28 0.02 0.0014 24 B
ReadLine Job-LBAYRI m [ 1, 8] 16.362 ns 0.1832 ns 0.1714 ns 16.300 ns 16.151 ns 16.691 ns 1.00 0.00 0.0019 33 B
ReadLine Job-QZQWHF pr [ 1, 8] 14.091 ns 0.1842 ns 0.1633 ns 14.075 ns 13.780 ns 14.408 ns 0.86 0.01 0.0020 33 B
ReadLine Job-LBAYRI m [ 9, 32] 25.329 ns 0.1646 ns 0.1540 ns 25.380 ns 25.096 ns 25.568 ns 1.00 0.00 0.0038 65 B
ReadLine Job-QZQWHF pr [ 9, 32] 17.268 ns 0.0570 ns 0.0533 ns 17.280 ns 17.174 ns 17.356 ns 0.68 0.00 0.0038 65 B
ReadLine Job-LBAYRI m [ 33, 128] 61.464 ns 0.1807 ns 0.1602 ns 61.445 ns 61.241 ns 61.771 ns 1.00 0.00 0.0108 185 B
ReadLine Job-QZQWHF pr [ 33, 128] 28.177 ns 0.1248 ns 0.1167 ns 28.221 ns 27.878 ns 28.306 ns 0.46 0.00 0.0110 185 B
ReadLine Job-LBAYRI m [ 129, 1024] 321.899 ns 0.9247 ns 0.8650 ns 321.595 ns 320.682 ns 323.301 ns 1.00 0.00 0.0702 1,175 B
ReadLine Job-QZQWHF pr [ 129, 1024] 96.303 ns 0.5141 ns 0.4557 ns 96.215 ns 95.688 ns 97.331 ns 0.30 0.00 0.0699 1,175 B

@stephentoub
Copy link
Member

stephentoub commented Oct 17, 2021

The 1 to 8 line length case still has som lines of 8 + 2 = 10 chars or 20 bytes < 16 bytes length, so we still hit 128-bit paths and SSE2 in some percentage right. Additionally, the IndexOfAny has optimized code for >= 4 chars, which is more than 50% of cases. Probably, 1 character only would be slower too. I can run a test for that if you like?

Yes, when I commented I hadn't realized your comment about length 1 was about random selections between 1 and 8.

I think this change is still worth taking: it simplifies the code, and improves what's expected to be the majority case. I just want to make sure we understand the regressions before doing so. (And I do think 0 is worth special-casing if it avoids regressing that case while not measurably impacting others; I haven't looked at the change yet, though)

@nietras
Copy link
Contributor Author

nietras commented Oct 17, 2021

@stephentoub replacing Substring with new string(ReadOnlySpan<char>) does not appear to help on tiny lines, thought it might given Substring does a bunch of checks. I'd probably still use this, as I assume code size should be smaller.

Since we have context here creating the span for 1 char line could side step the checks of the Span.Slice...

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-VAAVCB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XGNHNV : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain LineLengthRange Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated
ReadLine Job-VAAVCB m [ 0, 0] 3.377 ns 0.0260 ns 0.0231 ns 3.367 ns 3.348 ns 3.418 ns 1.00 0.00 - -
ReadLine Job-XGNHNV pr [ 0, 0] 4.503 ns 0.0447 ns 0.0418 ns 4.503 ns 4.419 ns 4.568 ns 1.33 0.02 - -
ReadLine Job-VAAVCB m [ 0, 1024] 284.428 ns 2.1638 ns 1.8068 ns 283.843 ns 282.711 ns 288.775 ns 1.00 0.00 0.0619 1,045 B
ReadLine Job-XGNHNV pr [ 0, 1024] 84.528 ns 0.6573 ns 0.6149 ns 84.244 ns 83.892 ns 85.629 ns 0.30 0.00 0.0624 1,045 B
ReadLine Job-VAAVCB m [ 1, 1] 7.780 ns 0.1162 ns 0.1087 ns 7.819 ns 7.601 ns 7.940 ns 1.00 0.00 0.0014 24 B
ReadLine Job-XGNHNV pr [ 1, 1] 9.962 ns 0.0544 ns 0.0424 ns 9.972 ns 9.835 ns 9.996 ns 1.27 0.01 0.0014 24 B
ReadLine Job-VAAVCB m [ 1, 8] 16.026 ns 0.1693 ns 0.1501 ns 16.019 ns 15.709 ns 16.328 ns 1.00 0.00 0.0019 33 B
ReadLine Job-XGNHNV pr [ 1, 8] 13.556 ns 0.0395 ns 0.0308 ns 13.558 ns 13.497 ns 13.612 ns 0.84 0.01 0.0019 33 B
ReadLine Job-VAAVCB m [ 9, 32] 25.054 ns 0.0973 ns 0.0910 ns 25.013 ns 24.947 ns 25.238 ns 1.00 0.00 0.0039 65 B
ReadLine Job-XGNHNV pr [ 9, 32] 16.762 ns 0.1132 ns 0.1059 ns 16.761 ns 16.624 ns 16.922 ns 0.67 0.00 0.0039 65 B
ReadLine Job-VAAVCB m [ 33, 128] 60.175 ns 0.2383 ns 0.2112 ns 60.153 ns 59.796 ns 60.608 ns 1.00 0.00 0.0110 185 B
ReadLine Job-XGNHNV pr [ 33, 128] 27.256 ns 0.2427 ns 0.2027 ns 27.213 ns 26.944 ns 27.614 ns 0.45 0.00 0.0110 185 B
ReadLine Job-VAAVCB m [ 129, 1024] 326.478 ns 3.8631 ns 3.6135 ns 327.742 ns 318.213 ns 330.983 ns 1.00 0.00 0.0700 1,175 B
ReadLine Job-XGNHNV pr [ 129, 1024] 92.774 ns 0.5438 ns 0.5087 ns 92.504 ns 92.264 ns 93.762 ns 0.28 0.00 0.0700 1,175 B

@nietras
Copy link
Contributor Author

nietras commented Oct 18, 2021

Latest benchmark results. No significant changes. Line lengths 0 (+1.2ns) and 1 (+2ns) still have ~30% regressions, all others improvements, going up to 3x for [129, 1024] with -220ns.

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-ZPUFRH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-ZTWPGG : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain LineLengthRange Mean Error StdDev Median Min Max Ratio Gen 0 Allocated
ReadLine Job-ZPUFRH m [ 0, 0] 3.667 ns 0.0277 ns 0.0245 ns 3.665 ns 3.637 ns 3.723 ns 1.00 - -
ReadLine Job-ZTWPGG pr [ 0, 0] 4.862 ns 0.0306 ns 0.0271 ns 4.852 ns 4.822 ns 4.912 ns 1.33 - -
ReadLine Job-ZPUFRH m [ 0, 1024] 279.798 ns 1.6451 ns 1.5388 ns 279.602 ns 276.440 ns 282.489 ns 1.00 0.0621 1,045 B
ReadLine Job-ZTWPGG pr [ 0, 1024] 85.326 ns 0.4640 ns 0.4113 ns 85.207 ns 84.769 ns 86.083 ns 0.30 0.0622 1,045 B
ReadLine Job-ZPUFRH m [ 1, 1] 7.437 ns 0.0750 ns 0.0665 ns 7.435 ns 7.271 ns 7.572 ns 1.00 0.0014 24 B
ReadLine Job-ZTWPGG pr [ 1, 1] 9.588 ns 0.0712 ns 0.0666 ns 9.597 ns 9.451 ns 9.710 ns 1.29 0.0014 24 B
ReadLine Job-ZPUFRH m [ 1, 8] 15.502 ns 0.1362 ns 0.1207 ns 15.523 ns 15.278 ns 15.717 ns 1.00 0.0019 33 B
ReadLine Job-ZTWPGG pr [ 1, 8] 13.196 ns 0.0994 ns 0.0881 ns 13.193 ns 13.058 ns 13.361 ns 0.85 0.0019 33 B
ReadLine Job-ZPUFRH m [ 9, 32] 23.798 ns 0.1131 ns 0.1002 ns 23.804 ns 23.660 ns 23.962 ns 1.00 0.0038 65 B
ReadLine Job-ZTWPGG pr [ 9, 32] 16.729 ns 0.1849 ns 0.1639 ns 16.760 ns 16.508 ns 17.046 ns 0.70 0.0039 65 B
ReadLine Job-ZPUFRH m [ 33, 128] 58.583 ns 0.2611 ns 0.2442 ns 58.465 ns 58.337 ns 59.034 ns 1.00 0.0110 185 B
ReadLine Job-ZTWPGG pr [ 33, 128] 27.529 ns 0.0886 ns 0.0829 ns 27.545 ns 27.332 ns 27.625 ns 0.47 0.0110 185 B
ReadLine Job-ZPUFRH m [ 129, 1024] 313.442 ns 2.0088 ns 1.8790 ns 314.089 ns 308.542 ns 315.778 ns 1.00 0.0691 1,175 B
ReadLine Job-ZTWPGG pr [ 129, 1024] 94.201 ns 0.8138 ns 0.7612 ns 93.789 ns 93.577 ns 95.612 ns 0.30 0.0700 1,175 B

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much!

@GrabYourPitchforks
Copy link
Member

@nietras You and I had some discussion on this issue regarding potential improvements to string.Substring. If you want to try taking those on as a new PR, please feel free. Otherwise we can add it to our own backlog of future improvements.

Thanks again for this! It looks great. :)

@danmoseley
Copy link
Member

@stephentoub any remaining feedback or can we merge? the failure is in HTTP tests which I will investigate.

Co-authored-by: Stephen Toub <[email protected]>
Co-authored-by: Stephen Toub <[email protected]>
nietras and others added 4 commits December 7, 2021 23:03
Co-authored-by: Stephen Toub <[email protected]>
Co-authored-by: Stephen Toub <[email protected]>
Co-authored-by: Stephen Toub <[email protected]>
@nietras
Copy link
Contributor Author

nietras commented Dec 7, 2021

@stephentoub hope comments have been resolved, PTAL.

Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@adamsitnik adamsitnik added this to the 7.0.0 milestone Dec 8, 2021
@adamsitnik adamsitnik merged commit 836f2c5 into dotnet:main Dec 8, 2021
@nietras
Copy link
Contributor Author

nietras commented Dec 8, 2021

@GrabYourPitchforks regarding Substring, there are only a limited number of things one can try to improve this, one of which would be to try and optimize the guard clauses. This would perhaps mean changing the exception messages (not types), since we would like to do multiple checks in one go for example, is that acceptable?

Other than that one might specialize for short substrings... if that would improve anything.

@adamsitnik the existing performance Benchmarks do not look that useful, at least in my runs they are highly volatile and seem to be based on JIT compiling away all checks... maybe... would adding other tests be acceptable? Perhaps you can take a look and tell me if I'm wrong, but use of Arguments seems problematic.

Is there any guidance anywhere on getting a good dev inner loop for looking at corelib asm? How to use BDN in that case? A bit rusty here.

@adamsitnik
Copy link
Member

the existing performance Benchmarks do not look that useful,

do you mean these benchmarks?

would adding other tests be acceptable?

yes, of course! The existing string benchmarks are far from perfect, please fell free to add more.

Is there any guidance anywhere on getting a good dev inner loop for looking at corelib asm? How to use BDN in that case?

For managed part of the corelib you should be able to use disassembly diagnoser with corerun: https://github.com/dotnet/performance/blob/797425cdc4dcbc407f4546fd83e00d79b431ea09/docs/benchmarkdotnet.md#disassembly

You can also use VTune: https://github.com/dotnet/performance/blob/5d6fc749333ce7d7e34e6d0ac311df67c5d2047c/docs/profiling-workflow-dotnet-runtime.md#vtune

or use one of tools provided by the Jit Team. @kunalspathak @EgorBo @AndyAyersMS what is your preferred way of getting the disassembly?

@kunalspathak
Copy link
Member

I would use vTune to profile and also see the instructions in the disassembly that are hot. If you want to use JitDisasm and JitDump flags, then you can set those environment variables with respective method names and use dotnet MicroBenchmarks.dll with --corerun path\to\checked\corerun.

@nietras nietras deleted the stringreader-use-indexofany-in-readline branch December 9, 2021 13:17
@nietras
Copy link
Contributor Author

nietras commented Dec 9, 2021

@kunalspathak @adamsitnik thanks both for the pointers, I'll try to see if I can figure this out! 😅

do you mean these benchmarks?

Yes, an example run below where m and pr are the exact same commit. Here up to 50% difference. I have observed up 80% divergences.

BDN notes one of these is multi-modal.

Additionally, Substring(0) is reported to take 1 ns which seems a bit suspicious to me, that's 5 cycles. Yes it only has to return the same string, but still. Is the call inlined and the JIT fully compiles things away since the start is "hard-coded"?

If you have any suggestions on how to make these more stable I am all ears. One thing I want is better coverage of short substrings e.g. sizes 0-8 or similar.

// * Warnings *
MultimodalDistribution
  Perf_String.Substring_Int: PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog, Toolchain=\runtime-pr\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe, IterationTime=250.0000 ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1 -> It seems that the distribution can have several modes (mValue = 3)
BenchmarkDotNet=v0.13.1.1620-nightly, OS=Windows 10.0.19044.1348 (21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-ZIOPOT : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-ESNSIZ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Job Toolchain s i1 i2 i Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Allocated
Substring_IntInt Job-ZIOPOT m dzsdzsDDZSDZSDZSddsz 0 8 ? 13.046 ns 0.6042 ns 0.6958 ns 13.142 ns 10.866 ns 14.260 ns 1.00 0.00 0.0024 40 B
Substring_IntInt Job-ESNSIZ pr dzsdzsDDZSDZSDZSddsz 0 8 ? 12.091 ns 0.7554 ns 0.8700 ns 12.326 ns 9.890 ns 13.647 ns 0.93 0.09 0.0024 40 B
Substring_Int Job-ZIOPOT m dzsdzsDDZSDZSDZSddsz ? ? 0 1.092 ns 0.0024 ns 0.0023 ns 1.091 ns 1.088 ns 1.096 ns 1.00 0.00 - -
Substring_Int Job-ESNSIZ pr dzsdzsDDZSDZSDZSddsz ? ? 0 1.094 ns 0.0066 ns 0.0061 ns 1.094 ns 1.085 ns 1.105 ns 1.00 0.01 - -
Substring_IntInt Job-ZIOPOT m dzsdzsDDZSDZSDZSddsz 7 4 ? 9.119 ns 0.7003 ns 0.8065 ns 9.210 ns 7.828 ns 10.840 ns 1.00 0.00 0.0019 32 B
Substring_IntInt Job-ESNSIZ pr dzsdzsDDZSDZSDZSddsz 7 4 ? 8.439 ns 0.5301 ns 0.6104 ns 8.404 ns 7.452 ns 9.690 ns 0.93 0.09 0.0019 32 B
Substring_Int Job-ZIOPOT m dzsdzsDDZSDZSDZSddsz ? ? 7 8.947 ns 0.5383 ns 0.5983 ns 8.964 ns 7.750 ns 10.128 ns 1.00 0.00 0.0029 48 B
Substring_Int Job-ESNSIZ pr dzsdzsDDZSDZSDZSddsz ? ? 7 9.384 ns 0.7757 ns 0.8934 ns 9.538 ns 7.507 ns 10.890 ns 1.04 0.09 0.0028 48 B
Substring_IntInt Job-ZIOPOT m dzsdzsDDZSDZSDZSddsz 10 1 ? 5.125 ns 0.1145 ns 0.1071 ns 5.120 ns 4.950 ns 5.324 ns 1.00 0.00 0.0014 24 B
Substring_IntInt Job-ESNSIZ pr dzsdzsDDZSDZSDZSddsz 10 1 ? 7.813 ns 0.5706 ns 0.6105 ns 7.760 ns 6.692 ns 9.071 ns 1.53 0.14 0.0014 24 B
Substring_Int Job-ZIOPOT m dzsdzsDDZSDZSDZSddsz ? ? 10 11.729 ns 0.9097 ns 1.0112 ns 11.659 ns 8.977 ns 13.597 ns 1.00 0.00 0.0028 48 B
Substring_Int Job-ESNSIZ pr dzsdzsDDZSDZSDZSddsz ? ? 10 9.645 ns 0.6790 ns 0.7819 ns 9.837 ns 8.141 ns 10.912 ns 0.83 0.12 0.0028 48 B

@ghost ghost locked as resolved and limited conversation to collaborators Jan 8, 2022
@EgorBo
Copy link
Member

EgorBo commented Jan 25, 2022

Improvement on ubuntu-x64 dotnet/perf-autofiling-issues#2670

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants