-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GZipStream.ReadByte 10x slower than on Desktop #39233
Comments
Would be interesting to know 5.0 numbers |
About the same as 3.1.
|
I had a look into this to see if it was anything I could diagnose and/or fix. The answer seems to be no but what I found may be of use to someone more knowledgeable. Using managed profiling tools it looks like it's just spending a lot of time reading and spinning, not helpful. Using VTune I think it's clear that it's spending a lot of time in window_output_flush which is memcopying. Here's what I see: |
It would be interesting to know whether we are calling zlib in the same pattern. Maybe we are asking it to do more work. Also I wonder whether it would be difficult to swap in the older zlib library to remove that variable. |
Focussed on the part that stands out as taking more time between the two. It looks like the netcore version is doing byte reads from the byte stream by reading a byte and then I'd guess that the native compression is pushing the rest of the contents back 1 byte and reading another byte to keep it full, that would explain the memcpy to me. Netfx is doing array based reads. I'd suggest that the msbuild library might want to add in a buffered reader around the binary reader. |
I've tried wrapping the GZipStream in a BufferedStream, but against expectations it didn't change the picture much at all. There must be some real regression in the native ZLib library. For easier investigations we should take the binlogs and the BinLogReader out of the picture and just write a simple console app that uses ReadByte() to scan a large gzip stream (~30MB compressed). I'm sure we'll see a similar ~10x perf regression here as well. |
A repro with direct code rather than using a library would make things easier to track down. The use of the channel for events also complicates the traces a lot because you see a lot of waiting. I suspect you'll see this behaviour anytime you read individual bytes from the netcore stream. |
@stephentoub FYI reading binlogs on Core is 10x slower than on Desktop |
Are you sure about that? I made a smaller repro. Download this and add your msbuild.binlog to it. Run it on .NETFramework and Core and observe the difference. Then remove the buffered stream and observe. Not meant to be a proper benchmark, but illustrates the point. var bytes = File.ReadAllBytes("msbuild.binlog");
using (var ms = new MemoryStream(bytes))
using (var gzs = new GZipStream(ms, CompressionMode.Decompress))
using (var bs = new BufferedStream(gzs))
{
long positon = 0, last = 0;
const long mb = 1024 * 1024;
var sw = Stopwatch.StartNew();
while (bs.ReadByte() != -1)
{
if (++positon - last > 100 * mb)
Console.WriteLine($"{(last = positon) / mb} MB");
}
Console.WriteLine($"Time: {sw.Elapsed}");
} Here's what I'm seeing:
Without the |
Wild thought: looking at the traces here, could it be an interop issue? #39233 (comment) |
Hmm now that I'm re-reading this paragraph from @Wraith2 earlier in the thread it starts making sense to me:
|
You could test to see if peak throughput is when the buffered reader buffer size is equal to the native buffer used, in that situation no memcpy to refill the buffer will be used only a full read in the native side. |
See here for details: dotnet/runtime#39233 (comment)
My assumption when I investigated originally was that the native implementation had been optimized for throughput and that the small reads for headers followed by mostly large full buffer reads was the intended use case. As always optimizing for all use cases would be problematic. On the other hand since the small reads degraded so badly and it was accepted it doesn't seem that there's any particular perf being measured for native. Perhaps @adamsitnik the perf guru may have some insight on that and whether it might be useful. |
This looks like it might be the culprit. This is called with every ReadByte. It does appear to be shifting the state buffer in zlib. runtime/src/libraries/Native/Windows/System.IO.Compression.Native/zlib-intel/inflate.h Line 187 in 99f3f55
|
@jtkukunas, any thoughts on this issue? It seems like the memmove may have been introduced as part of |
@jtkukunas, any thoughts? |
should we add this to the 7.0 milestone? it seems important enough to me so it doesn't get lost in the Future milestone. |
It would be good to identify the desired behaviour and then find a way to make the native lib support that. At the moment the optimization is for peak large read throughput but that causes small reads to suffer high buffer consolidation overhead in the native side. |
Probably we should open an issue in zlib directly? |
Possibly but I don't know if or by how much a standard version of zlib would regress performance over using @jtkukunas 's fork. It looks like the specific change for this is only in his fork. |
I wasn't clear, I meant https://github.com/jtkukunas/zlib. Yes, we only use https://github.com/madler/zlib for Arm/Arm64. |
In which case yes, i think opening an issue with a simple repro would probably be the next action. |
I have a minimal repro of this issue in case it helps. With code like this: var localPath = Path.GetTempFileName();
new WebClient().DownloadFile(@"https://github.com/dotnet/runtime/archive/refs/tags/v6.0.5.zip", localPath);
var numBytes = 0;
var watch = Stopwatch.StartNew();
using (var archive = new ZipArchive(File.OpenRead(localPath), ZipArchiveMode.Read))
{
foreach (var entry in archive.Entries)
{
using (var stream = entry.Open())
{
int b = 0;
while ((b = stream.ReadByte()) != -1 && numBytes < 100_000_000)
numBytes++;
}
}
}
watch.Stop();
Console.WriteLine($"Read {numBytes} bytes");
Console.WriteLine($"Time elapsed: {watch.Elapsed}"); When running this in .NET Framework 4.6.2, I get numbers like this:
When running in .NET 6, I get numbers like this:
So it's ~10-11x slower here. In some cases in a larger codebase I've seen it be 13-14x slower. @stephentoub any thoughts about fixing this in .NET 7? |
What would the fix be? (We still haven't heard back from @jtkukunas.) |
Could we insert a buffer into the stream? When I tested putting a BufferedStream atop the DeflateStream, it substantially improved perf (by 50x) - and it looks like .NET Framework essentially had a buffering mechanism in this codepath. |
If someone wants a buffer, they can add a BufferedStream themselves; BufferedStream exists because most streams don't implicitly buffer themselves, and it provides the ability to opting in to doing so when it's desirable. Implicitly adding a buffer impacts all use cases (e.g. additional allocation), whereas this issue is localized to ReadByte.
Can you elaborate? What buffer are you referring to? |
I didn't read the code closely, but in Inflater there is a thing called OutputWindow in .NET Framework - and a cursory read suggests this is some kind of buffer. I'm planning to use a BufferedStream as my workaround, but it feels like DeflateStream and GZipStream performance should be equivalent on upgrading to .NET Core from .NET Framework so codebases don't have to go manually buffer these streams to upgrade without performance regression, right? |
That's part of the old managed implementation of decompression, which is only used if you explicitly opt out of using zlib via config, and is way slower, existing only for legacy compatibility.
DeflateStream and GZipStream in .NET Core are generally much faster than they are in .NET Framework, when you read/write in reasonably-sized chunks (though many of the improvements made in .NET Core around compression were in fact ported back to .NET Framework, a rarity but done due to its significant impact). This issue is about reading a single byte at a time; it is not the general case, nor if you care about performance is it a recommended way to consume data even without this issue, as it's way faster to read many bytes at a time even if this was performing well. And this issue remains open because we recognize there's something to be improved here. That doesn't mean forcing everyone to incur the cost of an extra buffer in order to improve a case that's not the high-performance way of extracting the data is the right tradeoff. |
The reason I found this is we have a How do you opt out of zlib via config, so I can try that out? |
On .NET Framework, you use this switch:
Makes sense. Using a BufferedStream for now is a good way to go. Even if/when this issue is addressed, BufferedStream would still be likely to help your use case. |
Not sure how I missed this. Will take a look. |
Thanks, @jtkukunas. |
@stephentoub Yes, this behavior is a tradeoff for normal-sized operations. One byte at a time is ... suboptimal. I'd recommend buffering the data. |
Thanks @jtkukunas @KirillOsenkov I think this issue can be closed now? |
Sure, thanks |
Description
Reading a binlog file takes 4 seconds on .NET Desktop, but 40 seconds on Core 3.1.
Originally reported by @vatsan-madhavan in KirillOsenkov/MSBuildStructuredLog#376
Repro:
git clone https://github.com/vatsan-madhavan/MSBuildStructuredLogSummary
msbuild /r
MSBuildBinLogSummary\bin\Debug\net472\MSBuildBinLogSummary.exe --log-file C:\temp\Vatsan\vatsan.binlog
- takes 4 secondsMSBuildBinLogSummary\bin\Debug\netcoreapp3.1\MSBuildBinLogSummary.exe --log-file C:\temp\Vatsan\vatsan.binlog
- takes 40 secondsThe stack that I saw was taking all the time is ReadByte from a GzipStream, it's called millions of times (to read the binlog):
Configuration
Regression?
.NET Framework 4.7.2 is 4 seconds, .NET Core 3.1 is 40 seconds
The text was updated successfully, but these errors were encountered: