GZipStream.ReadByte 10x slower than on Desktop #39233

KirillOsenkov · 2020-07-13T22:34:13Z

Description

Reading a binlog file takes 4 seconds on .NET Desktop, but 40 seconds on Core 3.1.

Originally reported by @vatsan-madhavan in KirillOsenkov/MSBuildStructuredLog#376

Repro:

git clone https://github.com/vatsan-madhavan/MSBuildStructuredLogSummary
msbuild /r
Download the binlog from https://1drv.ms/u/s!Ah63gfGkKca4lsl2sy71gpkpM9TCog?e=ifgsMj and save it anywhere, e.g. C:\temp\vatsan\vatsan.binlog
MSBuildBinLogSummary\bin\Debug\net472\MSBuildBinLogSummary.exe --log-file C:\temp\Vatsan\vatsan.binlog - takes 4 seconds
MSBuildBinLogSummary\bin\Debug\netcoreapp3.1\MSBuildBinLogSummary.exe --log-file C:\temp\Vatsan\vatsan.binlog - takes 40 seconds

The stack that I saw was taking all the time is ReadByte from a GzipStream, it's called millions of times (to read the binlog):

  | ZLibNative.ZLibStreamHandle.Inflate Line 300
-- | --
  | Inflater.Inflate Line 281
  | Inflater.ReadInflateOutput Line 266
  | Inflater.ReadOutput Line 122
  | Inflater.InflateVerified Line 92
  | Inflater.Inflate Line 55
  | DeflateStream.ReadByte Line 229
  | GZipStream.ReadByte Line 72
  | BinaryReader.ReadBoolean Line 236
  | BuildEventArgsReader.ReadTargetFinishedEventArgs Line 359
  | BuildEventArgsReader.Read Line 71
  | BinLogReader.Replay Line 78
  | BinLogReader.Replay Line 37
  | StructuredLoggerUtil..ctor Line 30
  | Program.ProcessBinaryLog Line 29
  | Program.Main Line 24

Configuration

3.1.301
Windows 10
x64

Regression?

.NET Framework 4.7.2 is 4 seconds, .NET Core 3.1 is 40 seconds

The text was updated successfully, but these errors were encountered:

KirillOsenkov · 2020-07-13T22:35:02Z

@carlossanlop

danmoseley · 2020-07-14T05:19:43Z

Would be interesting to know 5.0 numbers

vatsan-madhavan · 2020-07-14T06:25:01Z

About the same as 3.1.

> Measure-Command { .\bin\Release\netcoreapp3.1\win-x64\publish\MSBuildBinLogSummary.exe --log-file D:\src\repos\pinvoke\msbuild.binlog }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 51
Milliseconds      : 868
Ticks             : 518685936
TotalDays         : 0.000600330944444444
TotalHours        : 0.0144079426666667
TotalMinutes      : 0.86447656
TotalSeconds      : 51.8685936
TotalMilliseconds : 51868.5936


> Measure-Command { .\bin\Release\net5.0\win-x64\publish\MSBuildBinLogSummary.exe --log-file D:\src\repos\pinvoke\msbuild.binlog }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 49
Milliseconds      : 313
Ticks             : 493133924
TotalDays         : 0.000570756856481482
TotalHours        : 0.0136981645555556
TotalMinutes      : 0.821889873333333
TotalSeconds      : 49.3133924
TotalMilliseconds : 49313.3924


> Measure-Command { .\bin\Release\net472\MSBuildBinLogSummary.exe --log-file D:\src\repos\pinvoke\msbuild.binlog }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 4
Milliseconds      : 746
Ticks             : 47464341
TotalDays         : 5.49355798611111E-05
TotalHours        : 0.00131845391666667
TotalMinutes      : 0.079107235
TotalSeconds      : 4.7464341
TotalMilliseconds : 4746.4341


> type ..\global.json
{
  "sdk": {
    "version": "5.0.100-preview.8.20363.2",
    "rollForward": "latestFeature"
  }
}

Wraith2 · 2020-07-15T12:54:34Z

I had a look into this to see if it was anything I could diagnose and/or fix. The answer seems to be no but what I found may be of use to someone more knowledgeable. Using managed profiling tools it looks like it's just spending a lot of time reading and spinning, not helpful. Using VTune I think it's clear that it's spending a lot of time in window_output_flush which is memcopying. Here's what I see:

danmoseley · 2020-07-15T15:30:09Z

It would be interesting to know whether we are calling zlib in the same pattern. Maybe we are asking it to do more work. Also I wonder whether it would be difficult to swap in the older zlib library to remove that variable.

Wraith2 · 2020-07-15T16:54:14Z

netcore 3.1:

netfx 472:

Focussed on the part that stands out as taking more time between the two. It looks like the netcore version is doing byte reads from the byte stream by reading a byte and then I'd guess that the native compression is pushing the rest of the contents back 1 byte and reading another byte to keep it full, that would explain the memcpy to me. Netfx is doing array based reads. I'd suggest that the msbuild library might want to add in a buffered reader around the binary reader.

KirillOsenkov · 2020-07-19T19:21:18Z

I've tried wrapping the GZipStream in a BufferedStream, but against expectations it didn't change the picture much at all. There must be some real regression in the native ZLib library.

For easier investigations we should take the binlogs and the BinLogReader out of the picture and just write a simple console app that uses ReadByte() to scan a large gzip stream (~30MB compressed). I'm sure we'll see a similar ~10x perf regression here as well.

Wraith2 · 2020-07-19T19:23:55Z

A repro with direct code rather than using a library would make things easier to track down. The use of the channel for events also complicates the traces a lot because you see a lot of waiting. I suspect you'll see this behaviour anytime you read individual bytes from the netcore stream.

KirillOsenkov · 2020-12-09T00:34:45Z

@stephentoub FYI reading binlogs on Core is 10x slower than on Desktop

ericstj · 2020-12-09T03:23:13Z

I've tried wrapping the GZipStream in a BufferedStream, but against expectations it didn't change the picture much at all.

Are you sure about that? I made a smaller repro. Download this and add your msbuild.binlog to it. Run it on .NETFramework and Core and observe the difference. Then remove the buffered stream and observe.

Not meant to be a proper benchmark, but illustrates the point.

var bytes = File.ReadAllBytes("msbuild.binlog");
using (var ms = new MemoryStream(bytes))
using (var gzs = new GZipStream(ms, CompressionMode.Decompress))
using (var bs = new BufferedStream(gzs))
{
    long positon = 0, last = 0;
    const long mb = 1024 * 1024;

    var sw = Stopwatch.StartNew();
    while (bs.ReadByte() != -1)
    {
        if (++positon - last > 100 * mb)
            Console.WriteLine($"{(last = positon) / mb} MB");
    }

    Console.WriteLine($"Time: {sw.Elapsed}");
}

testGzipStream.zip

Here's what I'm seeing:

Framework	Duration
net5.0	2.1970965
netcoreapp3.1	2.0621879
net48	4.0997598

Without the BufferedStream I do see significant differences, but both are much slower than if the stream is buffered. Still seems to be some difference between .NETCore and .NETFramework, but less mainline than "zlib is 10x slower"

KirillOsenkov · 2020-12-15T21:16:40Z

Thanks for looking into this Eric!

First of all, adding BufferedStream indeed does completely fix it in Vatsan's standalone command-line repro. It brings the time down from 31 to two seconds for .NET 5:

and from 4 down to 2 seconds for net472:

So for practical purposes we can consider this solved - this teaches me to always wrap a GZipStream in a BufferedStream when reading in small increments, for huge savings. The reason I said it didn't work for me when I tried it is I tried in the binlog viewer and it didn't visibly reduce the binlog loading time (because on net472 it just shaves off a couple seconds (4 -> 2 seconds).

However now I'm just curious about the discrepancy without the BufferedStream. I tried your repro without the buffered stream on net472 and got 38 seconds (vs. 4 seconds).

But the most interesting bit is for net5.0 the time without BufferedStream is 13:18 (thirteen minutes), which is absolutely disproportional. This is inline with the ratio we saw on binlog vs. your simple repro: net5 is 10x-20x slower than net472:

I'm not sure it's worth digging in further to understand that (you decide), but the easy workaround is to use BufferedStream which makes net5 twice as fast as net472, which indeed lives up to my high expectations.

KirillOsenkov · 2020-12-15T21:23:17Z

Wild thought: looking at the traces here, could it be an interop issue? #39233 (comment)

@jkoritzinsky

KirillOsenkov · 2020-12-15T21:27:38Z

Hmm now that I'm re-reading this paragraph from @Wraith2 earlier in the thread it starts making sense to me:

It looks like the netcore version is doing byte reads from the byte stream by reading a byte and then I'd guess that the native compression is pushing the rest of the contents back 1 byte and reading another byte to keep it full, that would explain the memcpy to me. Netfx is doing array based reads.

Wraith2 · 2020-12-15T21:38:22Z

You could test to see if peak throughput is when the buffered reader buffer size is equal to the native buffer used, in that situation no memcpy to refill the buffer will be used only a full read in the native side.

KirillOsenkov · 2020-12-15T22:08:58Z

Good idea! I tried various buffer sizes (all .NET 5) and it bottoms out at 1.3 seconds after 16K buffer size:

See here for details: dotnet/runtime#39233 (comment)

Wraith2 · 2020-12-15T22:30:57Z

My assumption when I investigated originally was that the native implementation had been optimized for throughput and that the small reads for headers followed by mostly large full buffer reads was the intended use case. As always optimizing for all use cases would be problematic.

On the other hand since the small reads degraded so badly and it was accepted it doesn't seem that there's any particular perf being measured for native. Perhaps @adamsitnik the perf guru may have some insight on that and whether it might be useful.

ericstj · 2020-12-15T22:55:28Z

This looks like it might be the culprit. This is called with every ReadByte. It does appear to be shifting the state buffer in zlib.

runtime/src/libraries/Native/Windows/System.IO.Compression.Native/zlib-intel/inflate.h

Line 187 in 99f3f55

memmove(state->window + woff, state->window + roff, copysz);

stephentoub · 2021-01-05T19:08:17Z

@jtkukunas, any thoughts on this issue? It seems like the memmove may have been introduced as part of
intel/zlib@62b7616, and presumably it's a tradeoff with it optimizing for larger / "normal" size buffers at the expense of small ones.

stephentoub · 2021-09-02T10:51:29Z

@jtkukunas, any thoughts?

akoeplinger · 2021-09-03T10:43:44Z

should we add this to the 7.0 milestone? it seems important enough to me so it doesn't get lost in the Future milestone.

Wraith2 · 2021-09-03T11:02:09Z

It would be good to identify the desired behaviour and then find a way to make the native lib support that. At the moment the optimization is for peak large read throughput but that causes small reads to suffer high buffer consolidation overhead in the native side.

danmoseley · 2021-09-03T17:43:59Z

Probably we should open an issue in zlib directly?

Wraith2 · 2021-09-03T17:46:57Z

Possibly but I don't know if or by how much a standard version of zlib would regress performance over using @jtkukunas 's fork. It looks like the specific change for this is only in his fork.

danmoseley · 2021-09-03T17:51:10Z

I wasn't clear, I meant https://github.com/jtkukunas/zlib. Yes, we only use https://github.com/madler/zlib for Arm/Arm64.

Wraith2 · 2021-09-03T17:56:52Z

In which case yes, i think opening an issue with a simple repro would probably be the next action.

Austin-Lamb · 2022-05-20T23:59:51Z

I have a minimal repro of this issue in case it helps. With code like this:

var localPath = Path.GetTempFileName();
new WebClient().DownloadFile(@"https://github.com/dotnet/runtime/archive/refs/tags/v6.0.5.zip", localPath);

var numBytes = 0;

var watch = Stopwatch.StartNew();

using (var archive = new ZipArchive(File.OpenRead(localPath), ZipArchiveMode.Read))
{
    foreach (var entry in archive.Entries)
    {
        using (var stream = entry.Open())
        {
            int b = 0;
            while ((b = stream.ReadByte()) != -1 && numBytes < 100_000_000)
                numBytes++;
        }
    }
}

watch.Stop();

Console.WriteLine($"Read {numBytes} bytes");
Console.WriteLine($"Time elapsed: {watch.Elapsed}");

When running this in .NET Framework 4.6.2, I get numbers like this:

Running in .NET Framework 4.6.2
Read 100000000 bytes
Time elapsed: 00:00:05.6521488

When running in .NET 6, I get numbers like this:

Running in .NET 6.0.5
Read 100000000 bytes
Time elapsed: 00:00:55.3112673

So it's ~10-11x slower here. In some cases in a larger codebase I've seen it be 13-14x slower. @stephentoub any thoughts about fixing this in .NET 7?

stephentoub · 2022-05-21T00:11:34Z

any thoughts about fixing this in .NET 7?

What would the fix be?

(We still haven't heard back from @jtkukunas.)

Austin-Lamb · 2022-05-21T00:53:54Z

Could we insert a buffer into the stream? When I tested putting a BufferedStream atop the DeflateStream, it substantially improved perf (by 50x) - and it looks like .NET Framework essentially had a buffering mechanism in this codepath.

stephentoub · 2022-05-21T00:58:15Z

Could we insert a buffer into the stream?

If someone wants a buffer, they can add a BufferedStream themselves; BufferedStream exists because most streams don't implicitly buffer themselves, and it provides the ability to opting in to doing so when it's desirable. Implicitly adding a buffer impacts all use cases (e.g. additional allocation), whereas this issue is localized to ReadByte.

and it looks like .NET Framework essentially had a buffering mechanism in this codepath.

Can you elaborate? What buffer are you referring to?

Austin-Lamb · 2022-05-21T01:00:52Z

I didn't read the code closely, but in Inflater there is a thing called OutputWindow in .NET Framework - and a cursory read suggests this is some kind of buffer.

I'm planning to use a BufferedStream as my workaround, but it feels like DeflateStream and GZipStream performance should be equivalent on upgrading to .NET Core from .NET Framework so codebases don't have to go manually buffer these streams to upgrade without performance regression, right?

stephentoub · 2022-05-21T01:10:02Z

but in Inflater there is a thing called OutputWindow in .NET Framework - and a cursory read suggests this is some kind of buffer.

That's part of the old managed implementation of decompression, which is only used if you explicitly opt out of using zlib via config, and is way slower, existing only for legacy compatibility.

but it feels like DeflateStream and GZipStream performance should be equivalent on upgrading to .NET Core from .NET Framework so codebases don't have to go manually buffer these streams to upgrade without performance regression, right?

DeflateStream and GZipStream in .NET Core are generally much faster than they are in .NET Framework, when you read/write in reasonably-sized chunks (though many of the improvements made in .NET Core around compression were in fact ported back to .NET Framework, a rarity but done due to its significant impact). This issue is about reading a single byte at a time; it is not the general case, nor if you care about performance is it a recommended way to consume data even without this issue, as it's way faster to read many bytes at a time even if this was performing well. And this issue remains open because we recognize there's something to be improved here. That doesn't mean forcing everyone to incur the cost of an extra buffer in order to improve a case that's not the high-performance way of extracting the data is the right tradeoff.

Austin-Lamb · 2022-05-21T01:19:01Z

The reason I found this is we have a BinaryReader which ends up reading strings, and strings first read a 7-bit encoded length with Read7BitEncodedInt which is a single byte so this got super expensive quickly. ReadByte can happen even without a caller directly trying to read bytes one at a time, through mechanisms like that.

How do you opt out of zlib via config, so I can try that out?

stephentoub · 2022-05-21T01:22:33Z

How do you opt out of zlib via config, so I can try that out?

On .NET Framework, you use this switch:
https://docs.microsoft.com/en-us/dotnet/framework/migration-guide/retargeting/4.5.2-4.8#suggestion
zlib is the only mechanism for .NET Core; it's the primary mechanism for .NET Framework, which just kept the old managed implementations around for compatibility.

The reason I found this is we have a BinaryReader which ends up reading strings, and strings first read a 7-bit encoded length with Read7BitEncodedInt which is a single byte so this got super expensive quickly.

Makes sense. Using a BufferedStream for now is a good way to go. Even if/when this issue is addressed, BufferedStream would still be likely to help your use case.

jtkukunas · 2022-05-23T16:10:11Z

Not sure how I missed this. Will take a look.

stephentoub · 2022-05-23T16:11:11Z

Thanks, @jtkukunas.

jtkukunas · 2022-06-10T14:46:00Z

@stephentoub Yes, this behavior is a tradeoff for normal-sized operations. One byte at a time is ... suboptimal. I'd recommend buffering the data.

danmoseley · 2022-06-10T14:57:40Z

Thanks @jtkukunas

@KirillOsenkov I think this issue can be closed now?

KirillOsenkov · 2022-06-10T15:30:30Z

Sure, thanks

KirillOsenkov added the tenet-performance Performance related issue label Jul 13, 2020

Dotnet-GitSync-Bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Jul 13, 2020

KirillOsenkov mentioned this issue Jul 13, 2020

Commandline/object-model to inspect a binlog for errors/warnings KirillOsenkov/MSBuildStructuredLog#376

Closed

carlossanlop added this to the Future milestone Aug 6, 2020

carlossanlop removed the untriaged New issue has not been triaged by the area owner label Aug 17, 2020

KirillOsenkov mentioned this issue Dec 9, 2020

Produce msbuild binlog files for all build phases that today produce the triplets of logs / errors / warnings #45706

Open

4 tasks

KirillOsenkov mentioned this issue Dec 9, 2020

Investigate BufferedStream KirillOsenkov/MSBuildStructuredLog#426

Closed

KirillOsenkov added a commit to KirillOsenkov/MSBuildStructuredLog that referenced this issue Dec 15, 2020

Tweak the buffer size when reading the GZipStream.

ccc63ef

See here for details: dotnet/runtime#39233 (comment)

qmfrederik mentioned this issue Dec 15, 2020

Consider buffering in DeflateStream dotnet/Nerdbank.GitVersioning#550

Open

KirillOsenkov mentioned this issue Jan 29, 2021

Corrupt binlogs recorded on Linux using GZipStream #47024

Closed

eerhardt mentioned this issue Sep 2, 2021

Loading time of 2MB ML model is significantly slower in .NET Core compared to .NET Framework dotnet/machinelearning#5913

Open

martintomasek mentioned this issue Sep 3, 2021

Added BufferedStream for reading from DeflateStream - reduces loading… dotnet/machinelearning#5924

Merged

danmoseley closed this as completed Jun 10, 2022

ghost locked as resolved and limited conversation to collaborators Jul 10, 2022

GZipStream.ReadByte 10x slower than on Desktop #39233

GZipStream.ReadByte 10x slower than on Desktop #39233

Comments

KirillOsenkov commented Jul 13, 2020

Description

Configuration

Regression?

KirillOsenkov commented Jul 13, 2020

danmoseley commented Jul 14, 2020

vatsan-madhavan commented Jul 14, 2020

Wraith2 commented Jul 15, 2020

danmoseley commented Jul 15, 2020

Wraith2 commented Jul 15, 2020

KirillOsenkov commented Jul 19, 2020

Wraith2 commented Jul 19, 2020

KirillOsenkov commented Dec 9, 2020

ericstj commented Dec 9, 2020

KirillOsenkov commented Dec 15, 2020

KirillOsenkov commented Dec 15, 2020

KirillOsenkov commented Dec 15, 2020

Wraith2 commented Dec 15, 2020

KirillOsenkov commented Dec 15, 2020

Wraith2 commented Dec 15, 2020

ericstj commented Dec 15, 2020

stephentoub commented Jan 5, 2021

stephentoub commented Sep 2, 2021

akoeplinger commented Sep 3, 2021

Wraith2 commented Sep 3, 2021

danmoseley commented Sep 3, 2021

Wraith2 commented Sep 3, 2021

danmoseley commented Sep 3, 2021

Wraith2 commented Sep 3, 2021

Austin-Lamb commented May 20, 2022

stephentoub commented May 21, 2022

Austin-Lamb commented May 21, 2022

stephentoub commented May 21, 2022

Austin-Lamb commented May 21, 2022

stephentoub commented May 21, 2022

Austin-Lamb commented May 21, 2022

stephentoub commented May 21, 2022

jtkukunas commented May 23, 2022

stephentoub commented May 23, 2022

jtkukunas commented Jun 10, 2022

danmoseley commented Jun 10, 2022

KirillOsenkov commented Jun 10, 2022