S.IO.StringReader: Use ReadOnlySpan.IndexOfAny in ReadLine() for performance #60463

nietras · 2021-10-15T09:37:56Z

Have not done benchmarks of this yet, as I wanted to know first if this could be an acceptable change. The premise is IndexOfAny is highly optimized and uses vectorization if possible. Also untested.

…ormance

ghost · 2021-10-15T09:38:06Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Have not done benchmarks of this yet, as I wanted to know first if this could be an acceptable change. The premise is IndexOfAny is highly optimized and uses vectorization if possible. Also untested.

Author:	nietras
Assignees:	-
Labels:	`area-System.IO`, `community-contribution`
Milestone:	-

adamsitnik · 2021-10-15T14:13:16Z

It sounds reasonable as long as there is no regression for relatively short lines (20 characters?) that might be common.

danmoseley · 2021-10-15T14:35:54Z

You might consider reviewing/augmenting the coverage for this in dotnet/performance first. Then using that to evaluate it.

nietras · 2021-10-15T18:54:27Z

@adamsitnik code size will of course increase if you count IndexOfAny, but agree tests should show small line length perf differences.

@danmoseley unfortunately I cannot find any benchmarks of StringReader in https://github.com/dotnet/performance/

Can you guys confirm this? I guess I will add a benchmark for this under MicroBenchmarks in that case.

nietras · 2021-10-15T18:57:00Z

Guessing location of test should be something like C:\git\oss\performance\src\benchmarks\micro\libraries\System.IO e.g. StringReaderReadLineTests.cs.

danmoseley · 2021-10-15T19:04:04Z

@nietras yes, it looks like there are none (I didn't check earlier as I was on my phone). Yes, that would be the place to put some. Seems like they need not be that elaborate.

nietras · 2021-10-17T11:16:17Z

Added benchmark in dotnet/performance#2083 and repeating results here.

Even for small line lengths [1, 8] there is a minor speedup (4%) and hence no regressions. For empty lines e.g. just new lines, there is a 40% regression, though.

For longer lines we see a nice 2-3x speed up.

It might be possible to do something about the empty line regression, but perhaps that is not so important?

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-BIEWJM : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-IFXICU : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	LineLengthRange	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Allocated
ReadLine	Job-BIEWJM	runtime-m	[ 0, 0]	7.208 ns	0.0556 ns	0.0434 ns	7.194 ns	7.165 ns	7.288 ns	1.00	0.00	-	-
ReadLine	Job-IFXICU	runtime-pr	[ 0, 0]	10.112 ns	0.1781 ns	0.1666 ns	10.105 ns	9.875 ns	10.438 ns	1.41	0.02	-	-

ReadLine	Job-BIEWJM	runtime-m	[ 1, 8]	17.751 ns	0.2287 ns	0.2028 ns	17.686 ns	17.552 ns	18.192 ns	1.00	0.00	0.0020	33 B
ReadLine	Job-IFXICU	runtime-pr	[ 1, 8]	16.998 ns	0.1011 ns	0.0946 ns	16.992 ns	16.872 ns	17.212 ns	0.96	0.01	0.0020	33 B

ReadLine	Job-BIEWJM	runtime-m	[ 9, 32]	26.336 ns	0.1205 ns	0.1127 ns	26.273 ns	26.200 ns	26.510 ns	1.00	0.00	0.0039	65 B
ReadLine	Job-IFXICU	runtime-pr	[ 9, 32]	19.896 ns	0.0740 ns	0.0618 ns	19.879 ns	19.801 ns	20.046 ns	0.76	0.00	0.0038	65 B

ReadLine	Job-BIEWJM	runtime-m	[ 33, 128]	62.594 ns	0.2100 ns	0.1861 ns	62.584 ns	62.222 ns	62.928 ns	1.00	0.00	0.0108	185 B
ReadLine	Job-IFXICU	runtime-pr	[ 33, 128]	31.021 ns	0.0902 ns	0.0800 ns	31.015 ns	30.912 ns	31.200 ns	0.50	0.00	0.0110	185 B

ReadLine	Job-BIEWJM	runtime-m	[ 129,1024]	319.244 ns	0.6595 ns	0.6169 ns	319.302 ns	318.088 ns	320.391 ns	1.00	0.00	0.0697	1,181 B
ReadLine	Job-IFXICU	runtime-pr	[ 129,1024]	98.174 ns	0.2733 ns	0.2282 ns	98.123 ns	97.705 ns	98.567 ns	0.31	0.00	0.0705	1,181 B

nietras · 2021-10-17T11:21:24Z

ReadLine code has the following:

        public override string? ReadLine()
        {
            if (_s == null)
            {
                throw new ObjectDisposedException(null, SR.ObjectDisposed_ReaderClosed);
            }

would it be of any benefit to use ThrowHelper here? Or is that a mute point now?

stephentoub · 2021-10-17T11:24:23Z

Even for small line lengths [1, 8] there is a minor speedup (4%) and hence no regressions. For empty lines e.g. just new lines, there is a 40% regression, though.

I'm not surprised by line length 0. I am surprised by the line length 1. Do we have an explanation for why that gets faster?

nietras · 2021-10-17T11:39:42Z

I'm not surprised by line length 0. I am surprised by the line length 1. Do we have an explanation for why that gets faster?

Me neither on line length 0. Although I have improved that a bit. The 1 to 8 line length case still has som lines of 8 + 2 = 10 chars or 20 bytes < 16 bytes length, so we still hit 128-bit paths and SSE2 in some percentage right. Additionally, the IndexOfAny has optimized code for >= 4 chars, which is more than 50% of cases.

Probably, 1 character only would be slower too. I can run a test for that if you like?

nietras · 2021-10-17T11:49:35Z

Improved regression a little bit by special casing line length 0 to avoid Substring overhead. Also added ThrowHelper.

Additionally, added line length 1 case. 6% regression. No doubt still dominated by Substring and new string here, so overhead of calling IndexOfAny not that big an issue, perhaps.

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-BZJQTB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-PFAYGN : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	LineLengthRange	Mean	Error	StdDev	Median	Min	Max	Ratio	Gen 0	Allocated
ReadLine	Job-BZJQTB	m	[ 0, 0]	7.371 ns	0.0943 ns	0.0883 ns	7.402 ns	7.212 ns	7.498 ns	1.00	-	-
ReadLine	Job-PFAYGN	pr	[ 0, 0]	9.472 ns	0.1117 ns	0.0990 ns	9.471 ns	9.350 ns	9.608 ns	1.28	-	-

ReadLine	Job-BZJQTB	m	[ 0, 1024]	292.841 ns	1.4690 ns	1.3741 ns	292.241 ns	291.296 ns	295.981 ns	1.00	0.0617	1,049 B
ReadLine	Job-PFAYGN	pr	[ 0, 1024]	90.232 ns	0.5357 ns	0.5011 ns	90.073 ns	89.500 ns	91.097 ns	0.31	0.0625	1,049 B

ReadLine	Job-BZJQTB	m	[ 1, 1]	13.278 ns	0.1174 ns	0.1098 ns	13.291 ns	13.068 ns	13.433 ns	1.00	0.0014	24 B
ReadLine	Job-PFAYGN	pr	[ 1, 1]	14.013 ns	0.0911 ns	0.0853 ns	13.998 ns	13.897 ns	14.180 ns	1.06	0.0014	24 B

ReadLine	Job-BZJQTB	m	[ 1, 8]	18.093 ns	0.1064 ns	0.0996 ns	18.079 ns	17.968 ns	18.277 ns	1.00	0.0019	33 B
ReadLine	Job-PFAYGN	pr	[ 1, 8]	16.986 ns	0.0912 ns	0.0853 ns	16.950 ns	16.884 ns	17.136 ns	0.94	0.0020	33 B

ReadLine	Job-BZJQTB	m	[ 9, 32]	27.359 ns	0.1525 ns	0.1427 ns	27.381 ns	27.056 ns	27.582 ns	1.00	0.0039	65 B
ReadLine	Job-PFAYGN	pr	[ 9, 32]	20.399 ns	0.1372 ns	0.1216 ns	20.407 ns	20.208 ns	20.590 ns	0.75	0.0038	65 B

ReadLine	Job-BZJQTB	m	[ 33, 128]	64.000 ns	0.3802 ns	0.3557 ns	64.061 ns	63.476 ns	64.639 ns	1.00	0.0110	185 B
ReadLine	Job-PFAYGN	pr	[ 33, 128]	31.459 ns	0.2157 ns	0.2018 ns	31.485 ns	31.102 ns	31.758 ns	0.49	0.0110	185 B

ReadLine	Job-BZJQTB	m	[ 129, 1024]	326.246 ns	1.2682 ns	1.1863 ns	325.828 ns	324.743 ns	328.814 ns	1.00	0.0704	1,181 B
ReadLine	Job-PFAYGN	pr	[ 129, 1024]	98.230 ns	0.7595 ns	0.7104 ns	97.891 ns	97.538 ns	99.812 ns	0.30	0.0704	1,181 B

nietras · 2021-10-17T12:08:39Z

Using Environment.NewLine. Line length 1 sees larger regression now inline with length 0, so my previous comment about new string might be incorrect.

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-LBAYRI : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-QZQWHF : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	LineLengthRange	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Allocated
ReadLine	Job-LBAYRI	m	[ 0, 0]	3.602 ns	0.0187 ns	0.0175 ns	3.601 ns	3.578 ns	3.636 ns	1.00	0.00	-	-
ReadLine	Job-QZQWHF	pr	[ 0, 0]	4.538 ns	0.0529 ns	0.0469 ns	4.538 ns	4.399 ns	4.593 ns	1.26	0.01	-	-

ReadLine	Job-LBAYRI	m	[ 0, 1024]	290.351 ns	0.8590 ns	0.8035 ns	290.511 ns	289.016 ns	291.381 ns	1.00	0.00	0.0621	1,045 B
ReadLine	Job-QZQWHF	pr	[ 0, 1024]	86.508 ns	0.4516 ns	0.3771 ns	86.576 ns	85.995 ns	87.273 ns	0.30	0.00	0.0624	1,045 B

ReadLine	Job-LBAYRI	m	[ 1, 1]	8.032 ns	0.1276 ns	0.1194 ns	8.067 ns	7.786 ns	8.233 ns	1.00	0.00	0.0014	24 B
ReadLine	Job-QZQWHF	pr	[ 1, 1]	10.258 ns	0.0778 ns	0.0689 ns	10.254 ns	10.153 ns	10.393 ns	1.28	0.02	0.0014	24 B

ReadLine	Job-LBAYRI	m	[ 1, 8]	16.362 ns	0.1832 ns	0.1714 ns	16.300 ns	16.151 ns	16.691 ns	1.00	0.00	0.0019	33 B
ReadLine	Job-QZQWHF	pr	[ 1, 8]	14.091 ns	0.1842 ns	0.1633 ns	14.075 ns	13.780 ns	14.408 ns	0.86	0.01	0.0020	33 B

ReadLine	Job-LBAYRI	m	[ 9, 32]	25.329 ns	0.1646 ns	0.1540 ns	25.380 ns	25.096 ns	25.568 ns	1.00	0.00	0.0038	65 B
ReadLine	Job-QZQWHF	pr	[ 9, 32]	17.268 ns	0.0570 ns	0.0533 ns	17.280 ns	17.174 ns	17.356 ns	0.68	0.00	0.0038	65 B

ReadLine	Job-LBAYRI	m	[ 33, 128]	61.464 ns	0.1807 ns	0.1602 ns	61.445 ns	61.241 ns	61.771 ns	1.00	0.00	0.0108	185 B
ReadLine	Job-QZQWHF	pr	[ 33, 128]	28.177 ns	0.1248 ns	0.1167 ns	28.221 ns	27.878 ns	28.306 ns	0.46	0.00	0.0110	185 B

ReadLine	Job-LBAYRI	m	[ 129, 1024]	321.899 ns	0.9247 ns	0.8650 ns	321.595 ns	320.682 ns	323.301 ns	1.00	0.00	0.0702	1,175 B
ReadLine	Job-QZQWHF	pr	[ 129, 1024]	96.303 ns	0.5141 ns	0.4557 ns	96.215 ns	95.688 ns	97.331 ns	0.30	0.00	0.0699	1,175 B

stephentoub · 2021-10-17T13:47:16Z

The 1 to 8 line length case still has som lines of 8 + 2 = 10 chars or 20 bytes < 16 bytes length, so we still hit 128-bit paths and SSE2 in some percentage right. Additionally, the IndexOfAny has optimized code for >= 4 chars, which is more than 50% of cases. Probably, 1 character only would be slower too. I can run a test for that if you like?

Yes, when I commented I hadn't realized your comment about length 1 was about random selections between 1 and 8.

I think this change is still worth taking: it simplifies the code, and improves what's expected to be the majority case. I just want to make sure we understand the regressions before doing so. (And I do think 0 is worth special-casing if it avoids regressing that case while not measurably impacting others; I haven't looked at the change yet, though)

nietras · 2021-10-17T14:11:24Z

@stephentoub replacing Substring with new string(ReadOnlySpan<char>) does not appear to help on tiny lines, thought it might given Substring does a bunch of checks. I'd probably still use this, as I assume code size should be smaller.

Since we have context here creating the span for 1 char line could side step the checks of the Span.Slice...

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-VAAVCB : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XGNHNV : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	LineLengthRange	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Allocated
ReadLine	Job-VAAVCB	m	[ 0, 0]	3.377 ns	0.0260 ns	0.0231 ns	3.367 ns	3.348 ns	3.418 ns	1.00	0.00	-	-
ReadLine	Job-XGNHNV	pr	[ 0, 0]	4.503 ns	0.0447 ns	0.0418 ns	4.503 ns	4.419 ns	4.568 ns	1.33	0.02	-	-

ReadLine	Job-VAAVCB	m	[ 0, 1024]	284.428 ns	2.1638 ns	1.8068 ns	283.843 ns	282.711 ns	288.775 ns	1.00	0.00	0.0619	1,045 B
ReadLine	Job-XGNHNV	pr	[ 0, 1024]	84.528 ns	0.6573 ns	0.6149 ns	84.244 ns	83.892 ns	85.629 ns	0.30	0.00	0.0624	1,045 B

ReadLine	Job-VAAVCB	m	[ 1, 1]	7.780 ns	0.1162 ns	0.1087 ns	7.819 ns	7.601 ns	7.940 ns	1.00	0.00	0.0014	24 B
ReadLine	Job-XGNHNV	pr	[ 1, 1]	9.962 ns	0.0544 ns	0.0424 ns	9.972 ns	9.835 ns	9.996 ns	1.27	0.01	0.0014	24 B

ReadLine	Job-VAAVCB	m	[ 1, 8]	16.026 ns	0.1693 ns	0.1501 ns	16.019 ns	15.709 ns	16.328 ns	1.00	0.00	0.0019	33 B
ReadLine	Job-XGNHNV	pr	[ 1, 8]	13.556 ns	0.0395 ns	0.0308 ns	13.558 ns	13.497 ns	13.612 ns	0.84	0.01	0.0019	33 B

ReadLine	Job-VAAVCB	m	[ 9, 32]	25.054 ns	0.0973 ns	0.0910 ns	25.013 ns	24.947 ns	25.238 ns	1.00	0.00	0.0039	65 B
ReadLine	Job-XGNHNV	pr	[ 9, 32]	16.762 ns	0.1132 ns	0.1059 ns	16.761 ns	16.624 ns	16.922 ns	0.67	0.00	0.0039	65 B

ReadLine	Job-VAAVCB	m	[ 33, 128]	60.175 ns	0.2383 ns	0.2112 ns	60.153 ns	59.796 ns	60.608 ns	1.00	0.00	0.0110	185 B
ReadLine	Job-XGNHNV	pr	[ 33, 128]	27.256 ns	0.2427 ns	0.2027 ns	27.213 ns	26.944 ns	27.614 ns	0.45	0.00	0.0110	185 B

ReadLine	Job-VAAVCB	m	[ 129, 1024]	326.478 ns	3.8631 ns	3.6135 ns	327.742 ns	318.213 ns	330.983 ns	1.00	0.00	0.0700	1,175 B
ReadLine	Job-XGNHNV	pr	[ 129, 1024]	92.774 ns	0.5438 ns	0.5087 ns	92.504 ns	92.264 ns	93.762 ns	0.28	0.00	0.0700	1,175 B

src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs

nietras · 2021-10-18T14:24:40Z

Latest benchmark results. No significant changes. Line lengths 0 (+1.2ns) and 1 (+2ns) still have ~30% regressions, all others improvements, going up to 3x for [129, 1024] with -220ns.

BenchmarkDotNet=v0.13.1.1611-nightly, OS=Windows 10.0.19043.1266 (21H1/May2021Update)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=6.0.100-rc.2.21505.57
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-ZPUFRH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-ZTWPGG : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	LineLengthRange	Mean	Error	StdDev	Median	Min	Max	Ratio	Gen 0	Allocated
ReadLine	Job-ZPUFRH	m	[ 0, 0]	3.667 ns	0.0277 ns	0.0245 ns	3.665 ns	3.637 ns	3.723 ns	1.00	-	-
ReadLine	Job-ZTWPGG	pr	[ 0, 0]	4.862 ns	0.0306 ns	0.0271 ns	4.852 ns	4.822 ns	4.912 ns	1.33	-	-

ReadLine	Job-ZPUFRH	m	[ 0, 1024]	279.798 ns	1.6451 ns	1.5388 ns	279.602 ns	276.440 ns	282.489 ns	1.00	0.0621	1,045 B
ReadLine	Job-ZTWPGG	pr	[ 0, 1024]	85.326 ns	0.4640 ns	0.4113 ns	85.207 ns	84.769 ns	86.083 ns	0.30	0.0622	1,045 B

ReadLine	Job-ZPUFRH	m	[ 1, 1]	7.437 ns	0.0750 ns	0.0665 ns	7.435 ns	7.271 ns	7.572 ns	1.00	0.0014	24 B
ReadLine	Job-ZTWPGG	pr	[ 1, 1]	9.588 ns	0.0712 ns	0.0666 ns	9.597 ns	9.451 ns	9.710 ns	1.29	0.0014	24 B

ReadLine	Job-ZPUFRH	m	[ 1, 8]	15.502 ns	0.1362 ns	0.1207 ns	15.523 ns	15.278 ns	15.717 ns	1.00	0.0019	33 B
ReadLine	Job-ZTWPGG	pr	[ 1, 8]	13.196 ns	0.0994 ns	0.0881 ns	13.193 ns	13.058 ns	13.361 ns	0.85	0.0019	33 B

ReadLine	Job-ZPUFRH	m	[ 9, 32]	23.798 ns	0.1131 ns	0.1002 ns	23.804 ns	23.660 ns	23.962 ns	1.00	0.0038	65 B
ReadLine	Job-ZTWPGG	pr	[ 9, 32]	16.729 ns	0.1849 ns	0.1639 ns	16.760 ns	16.508 ns	17.046 ns	0.70	0.0039	65 B

ReadLine	Job-ZPUFRH	m	[ 33, 128]	58.583 ns	0.2611 ns	0.2442 ns	58.465 ns	58.337 ns	59.034 ns	1.00	0.0110	185 B
ReadLine	Job-ZTWPGG	pr	[ 33, 128]	27.529 ns	0.0886 ns	0.0829 ns	27.545 ns	27.332 ns	27.625 ns	0.47	0.0110	185 B

ReadLine	Job-ZPUFRH	m	[ 129, 1024]	313.442 ns	2.0088 ns	1.8790 ns	314.089 ns	308.542 ns	315.778 ns	1.00	0.0691	1,175 B
ReadLine	Job-ZTWPGG	pr	[ 129, 1024]	94.201 ns	0.8138 ns	0.7612 ns	93.789 ns	93.577 ns	95.612 ns	0.30	0.0700	1,175 B

src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs

GrabYourPitchforks

Thanks so much!

GrabYourPitchforks · 2021-12-07T21:32:46Z

@nietras You and I had some discussion on this issue regarding potential improvements to string.Substring. If you want to try taking those on as a new PR, please feel free. Otherwise we can add it to our own backlog of future improvements.

Thanks again for this! It looks great. :)

danmoseley · 2021-12-07T21:49:04Z

@stephentoub any remaining feedback or can we merge? the failure is in HTTP tests which I will investigate.

src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs

Co-authored-by: Stephen Toub <[email protected]>

src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs

Co-authored-by: Stephen Toub <[email protected]>

src/libraries/System.Private.CoreLib/src/System/IO/StringReader.cs

Co-authored-by: Stephen Toub <[email protected]>

nietras · 2021-12-07T22:17:14Z

@stephentoub hope comments have been resolved, PTAL.

stephentoub

Thanks!

nietras · 2021-12-08T14:19:36Z

@GrabYourPitchforks regarding Substring, there are only a limited number of things one can try to improve this, one of which would be to try and optimize the guard clauses. This would perhaps mean changing the exception messages (not types), since we would like to do multiple checks in one go for example, is that acceptable?

Other than that one might specialize for short substrings... if that would improve anything.

@adamsitnik the existing performance Benchmarks do not look that useful, at least in my runs they are highly volatile and seem to be based on JIT compiling away all checks... maybe... would adding other tests be acceptable? Perhaps you can take a look and tell me if I'm wrong, but use of Arguments seems problematic.

Is there any guidance anywhere on getting a good dev inner loop for looking at corelib asm? How to use BDN in that case? A bit rusty here.

adamsitnik · 2021-12-09T06:52:01Z

the existing performance Benchmarks do not look that useful,

do you mean these benchmarks?

would adding other tests be acceptable?

yes, of course! The existing string benchmarks are far from perfect, please fell free to add more.

Is there any guidance anywhere on getting a good dev inner loop for looking at corelib asm? How to use BDN in that case?

For managed part of the corelib you should be able to use disassembly diagnoser with corerun: https://github.com/dotnet/performance/blob/797425cdc4dcbc407f4546fd83e00d79b431ea09/docs/benchmarkdotnet.md#disassembly

You can also use VTune: https://github.com/dotnet/performance/blob/5d6fc749333ce7d7e34e6d0ac311df67c5d2047c/docs/profiling-workflow-dotnet-runtime.md#vtune

or use one of tools provided by the Jit Team. @kunalspathak @EgorBo @AndyAyersMS what is your preferred way of getting the disassembly?

kunalspathak · 2021-12-09T06:55:29Z

I would use vTune to profile and also see the instructions in the disassembly that are hot. If you want to use JitDisasm and JitDump flags, then you can set those environment variables with respective method names and use dotnet MicroBenchmarks.dll with --corerun path\to\checked\corerun.

nietras · 2021-12-09T13:24:46Z

@kunalspathak @adamsitnik thanks both for the pointers, I'll try to see if I can figure this out! 😅

do you mean these benchmarks?

Yes, an example run below where m and pr are the exact same commit. Here up to 50% difference. I have observed up 80% divergences.

BDN notes one of these is multi-modal.

Additionally, Substring(0) is reported to take 1 ns which seems a bit suspicious to me, that's 5 cycles. Yes it only has to return the same string, but still. Is the call inlined and the JIT fully compiles things away since the start is "hard-coded"?

If you have any suggestions on how to make these more stable I am all ears. One thing I want is better coverage of short substrings e.g. sizes 0-8 or similar.

// * Warnings *
MultimodalDistribution
  Perf_String.Substring_Int: PowerPlanMode=00000000-0000-0000-0000-000000000000, Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog, Toolchain=\runtime-pr\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe, IterationTime=250.0000 ms, MaxIterationCount=20, MinIterationCount=15, WarmupCount=1 -> It seems that the distribution can have several modes (mValue = 3)

BenchmarkDotNet=v0.13.1.1620-nightly, OS=Windows 10.0.19044.1348 (21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
  [Host]     : .NET 6.0.0 (6.0.21.48005), X64 RyuJIT
  Job-ZIOPOT : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-ESNSIZ : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable,-bl:benchmarkdotnet.binlog  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Method	Job	Toolchain	s	i1	i2	i	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Allocated
Substring_IntInt	Job-ZIOPOT	m	dzsdzsDDZSDZSDZSddsz	0	8	?	13.046 ns	0.6042 ns	0.6958 ns	13.142 ns	10.866 ns	14.260 ns	1.00	0.00	0.0024	40 B
Substring_IntInt	Job-ESNSIZ	pr	dzsdzsDDZSDZSDZSddsz	0	8	?	12.091 ns	0.7554 ns	0.8700 ns	12.326 ns	9.890 ns	13.647 ns	0.93	0.09	0.0024	40 B

Substring_Int	Job-ZIOPOT	m	dzsdzsDDZSDZSDZSddsz	?	?	0	1.092 ns	0.0024 ns	0.0023 ns	1.091 ns	1.088 ns	1.096 ns	1.00	0.00	-	-
Substring_Int	Job-ESNSIZ	pr	dzsdzsDDZSDZSDZSddsz	?	?	0	1.094 ns	0.0066 ns	0.0061 ns	1.094 ns	1.085 ns	1.105 ns	1.00	0.01	-	-

Substring_IntInt	Job-ZIOPOT	m	dzsdzsDDZSDZSDZSddsz	7	4	?	9.119 ns	0.7003 ns	0.8065 ns	9.210 ns	7.828 ns	10.840 ns	1.00	0.00	0.0019	32 B
Substring_IntInt	Job-ESNSIZ	pr	dzsdzsDDZSDZSDZSddsz	7	4	?	8.439 ns	0.5301 ns	0.6104 ns	8.404 ns	7.452 ns	9.690 ns	0.93	0.09	0.0019	32 B

Substring_Int	Job-ZIOPOT	m	dzsdzsDDZSDZSDZSddsz	?	?	7	8.947 ns	0.5383 ns	0.5983 ns	8.964 ns	7.750 ns	10.128 ns	1.00	0.00	0.0029	48 B
Substring_Int	Job-ESNSIZ	pr	dzsdzsDDZSDZSDZSddsz	?	?	7	9.384 ns	0.7757 ns	0.8934 ns	9.538 ns	7.507 ns	10.890 ns	1.04	0.09	0.0028	48 B

Substring_IntInt	Job-ZIOPOT	m	dzsdzsDDZSDZSDZSddsz	10	1	?	5.125 ns	0.1145 ns	0.1071 ns	5.120 ns	4.950 ns	5.324 ns	1.00	0.00	0.0014	24 B
Substring_IntInt	Job-ESNSIZ	pr	dzsdzsDDZSDZSDZSddsz	10	1	?	7.813 ns	0.5706 ns	0.6105 ns	7.760 ns	6.692 ns	9.071 ns	1.53	0.14	0.0014	24 B

Substring_Int	Job-ZIOPOT	m	dzsdzsDDZSDZSDZSddsz	?	?	10	11.729 ns	0.9097 ns	1.0112 ns	11.659 ns	8.977 ns	13.597 ns	1.00	0.00	0.0028	48 B
Substring_Int	Job-ESNSIZ	pr	dzsdzsDDZSDZSDZSddsz	?	?	10	9.645 ns	0.6790 ns	0.7819 ns	9.837 ns	8.141 ns	10.912 ns	0.83	0.12	0.0028	48 B

EgorBo · 2022-01-25T16:47:54Z

Improvement on ubuntu-x64 dotnet/perf-autofiling-issues#2670

S.IO.StringReader: Use ReadOnlySpan.IndexOfAny in ReadLine() for perf…

05214aa

…ormance

ghost added the community-contribution Indicates that the PR has been added by a community member label Oct 15, 2021

dotnet-issue-labeler bot added the area-System.IO label Oct 15, 2021

adamsitnik added the tenet-performance Performance related issue label Oct 15, 2021

runfoapp bot mentioned this pull request Oct 15, 2021

System.IO.Tests.File_ReadWriteAllBytes.ReadAllBytes_NonSeekableFileStream_InWindows failed #60427

Open

nietras mentioned this pull request Oct 17, 2021

Add StringReaderReadLineTests dotnet/performance#2083

Merged

Use throwhelper, add found line length 0 special case

e160a7c

replace Substring with new string(ReadOnlySpan<char>...)

5e915c5

stephentoub reviewed Oct 18, 2021

View reviewed changes