Bzip input stream simple vectorization #611

konrad-kruczynski · 2021-04-18T10:46:38Z

Hi guys.
Here's the simple approach to get some speed-up on BZip2 decompression. Rotation loop (yy[j] = yy[j -1]) is here vectorized automatically, i.e. by means of Vector<byte> instead of e.g. Vector128<byte> tied to a specific platform. The API is available from the .NET Core/Standard 2.1, however the real gain starts in .NET Core 3.

First commits add .NET Core 3.1 as a target platform for performance tests and such a test for BZip2 decompression. The last commit is the actual vectorization. Here are some results from two Intel machines (one on Windows and one on a rather antique MacBook Air).

Without vectorization:
First machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-USQLOW | .NET Core 2.1 | 4.595 s | 0.0185 s | 0.0164 s |  0.97 |
| DecompressData | Job-NMSPHA | .NET Core 3.1 | 4.730 s | 0.0490 s | 0.0434 s |  1.00 |
| DecompressData | Job-ORXZIN |        net461 | 4.723 s | 0.0314 s | 0.0279 s |  1.00 |

Second machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev |
|--------------- |----------- |-------------- |--------:|---------:|---------:|
| DecompressData | Job-TBYXJE | .NET Core 2.1 | 8.177 s | 0.0607 s | 0.0538 s |
| DecompressData | Job-ZFNFSZ | .NET Core 3.1 | 8.343 s | 0.1529 s | 0.2717 s |

With vectorization:
First machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-JSFVZF | .NET Core 2.1 | 4.534 s | 0.0386 s | 0.0302 s |  0.99 |
| DecompressData | Job-TNNLLN | .NET Core 3.1 | 2.335 s | 0.0252 s | 0.0211 s |  0.51 |
| DecompressData | Job-THQXFG |        net461 | 4.590 s | 0.0217 s | 0.0181 s |  1.00 |

Second machine

|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev |
|--------------- |----------- |-------------- |--------:|---------:|---------:|
| DecompressData | Job-DYBSPD | .NET Core 2.1 | 7.931 s | 0.1080 s | 0.0957 s |
| DecompressData | Job-FKJXRZ | .NET Core 3.1 | 5.131 s | 0.1007 s | 0.1273 s |

Machine details.
First machine

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
Intel Core i7-4770K CPU 3.50GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.201
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-JSFVZF : .NET Core 2.1.26 (CoreCLR 4.6.29812.02, CoreFX 4.6.29812.01), X64 RyuJIT
  Job-TNNLLN : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-THQXFG : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT

Second machine

BenchmarkDotNet=v0.12.1, OS=macOS 11.2.3 (20D91) [Darwin 20.3.0]
Intel Core i5-4250U CPU 1.30GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
.NET Core SDK=5.0.202
  [Host]     : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT
  Job-DYBSPD : .NET Core 2.1.27 (CoreCLR 4.6.29916.01, CoreFX 4.6.29916.03), X64 RyuJIT
  Job-FKJXRZ : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT

The speed-up on test machines on vectorized vs non-vectorized is about 35-50%. Note that it is only observable from .NET Core 3 usage onwards (I also tested on .NET 5, results are similar).

I certify that I own, and have sufficient rights to contribute, all source code and related material intended to be compiled or integrated with the source code for the SharpZipLib open source product (the "Contribution"). My Contribution is licensed under the MIT License.

codecov · 2021-04-18T10:48:24Z

Codecov Report

Merging #611 (d8811c1) into master (7ed87d1) will increase coverage by 2.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #611      +/-   ##
==========================================
+ Coverage   70.96%   73.27%   +2.30%     
==========================================
  Files          68       68              
  Lines       13417     8718    -4699     
==========================================
- Hits         9522     6388    -3134     
+ Misses       3895     2330    -1565

Impacted Files	Coverage Δ
.../ICSharpCode.SharpZipLib/BZip2/BZip2InputStream.cs	`74.70% <100.00%> (+2.62%)`	⬆️
...pZipLib/Core/Exceptions/StreamDecodingException.cs	`60.00% <0.00%> (-6.67%)`	⬇️
...pLib/Core/Exceptions/StreamUnsupportedException.cs	`60.00% <0.00%> (-6.67%)`	⬇️
.../Core/Exceptions/UnexpectedEndOfStreamException.cs	`60.00% <0.00%> (-6.67%)`	⬇️
src/ICSharpCode.SharpZipLib/Checksum/Adler32.cs	`81.81% <0.00%> (-3.29%)`	⬇️
...pCode.SharpZipLib/Zip/Compression/PendingBuffer.cs	`74.13% <0.00%> (-3.14%)`	⬇️
.../ICSharpCode.SharpZipLib/Core/FileSystemScanner.cs	`48.73% <0.00%> (-2.91%)`	⬇️
...e.SharpZipLib/Zip/Compression/InflaterDynHeader.cs	`91.48% <0.00%> (-2.17%)`	⬇️
...harpZipLib/Zip/Compression/Streams/OutputWindow.cs	`67.74% <0.00%> (-2.07%)`	⬇️
src/ICSharpCode.SharpZipLib/Core/PathFilter.cs	`10.25% <0.00%> (-1.46%)`	⬇️
... and 57 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ed87d1...d8811c1. Read the comment docs.

piksel

Nothing wrong with the implementation, but it took a while staring at it to understand what was happening. Perhaps some comments around it?

The speed improvements are really nice and badly needed for the bzip2 algo which is painfully slow (among other problems).

src/ICSharpCode.SharpZipLib/BZip2/BZip2InputStream.cs

Numpsy · 2021-04-28T09:13:52Z

I once had a (very) brief go at using the SSE intrinsics in the deflate code, but never tried just System.Numerics.Vector on it's own (those require a NetCore3+ TFM to build though).
Nice performance gain from an isolated change though :-)

konrad-kruczynski · 2021-04-30T13:35:09Z

Nothing wrong with the implementation, but it took a while staring at it to understand what was happening. Perhaps some comments around it?

Sure, I'l add some.

The speed improvements are really nice and badly needed for the bzip2 algo which is painfully slow (among other problems).

I would also like for some of you to confirm those speed improvements. Just to be sure that my conclusions are correct.

konrad-kruczynski · 2021-04-30T13:36:18Z

I once had a (very) brief go at using the SSE intrinsics in the deflate code, but never tried just System.Numerics.Vector on it's own (those require a NetCore3+ TFM to build though).
Nice performance gain from an isolated change though :-)

Right. The sole purpose of choosing System.Numerics.Vector instead of going directly with SSE class was portability, e.g. it will work on ARM out of the box. Nonetheless the first version was written using SSE directly.

Numpsy · 2021-04-30T15:27:15Z

I would also like for some of you to confirm those speed improvements. Just to be sure that my conclusions are correct.

this is what I get using your branch

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
Intel Core i7-5820K CPU 3.30GHz (Broadwell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.300-preview.21180.15
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-OUVGAT : .NET Core 2.1.26 (CoreCLR 4.6.29812.02, CoreFX 4.6.29812.01), X64 RyuJIT
  Job-MRCUFD : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-RAVTER : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT


|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-OUVGAT | .NET Core 2.1 | 4.850 s | 0.0354 s | 0.0332 s |  1.00 |
| DecompressData | Job-MRCUFD | .NET Core 3.1 | 2.457 s | 0.0063 s | 0.0056 s |  0.51 |
| DecompressData | Job-RAVTER |        net461 | 4.862 s | 0.0411 s | 0.0384 s |  1.00 |

konrad-kruczynski · 2021-05-02T11:14:23Z

I would also like for some of you to confirm those speed improvements. Just to be sure that my conclusions are correct.

this is what I get using your branch

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.928 (2004/?/20H1)
Intel Core i7-5820K CPU 3.30GHz (Broadwell), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.300-preview.21180.15
  [Host]     : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-OUVGAT : .NET Core 2.1.26 (CoreCLR 4.6.29812.02, CoreFX 4.6.29812.01), X64 RyuJIT
  Job-MRCUFD : .NET Core 3.1.13 (CoreCLR 4.700.21.11102, CoreFX 4.700.21.11602), X64 RyuJIT
  Job-RAVTER : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT


|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-OUVGAT | .NET Core 2.1 | 4.850 s | 0.0354 s | 0.0332 s |  1.00 |
| DecompressData | Job-MRCUFD | .NET Core 3.1 | 2.457 s | 0.0063 s | 0.0056 s |  0.51 |
| DecompressData | Job-RAVTER |        net461 | 4.862 s | 0.0411 s | 0.0384 s |  1.00 |

Good, this is very similar to my results. Since I was developing it on M1, I was only able to do native tests on my daughter's laptop and also included some results from friend's PC.

piksel · 2021-05-03T17:46:42Z

Yeah, got basically the same results:

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19042
AMD Ryzen 7 3800X, 1 CPU, 16 logical and 8 physical cores
.NET Core SDK=5.0.300-preview.21180.15
  [Host]     : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT
  Job-NUXXKY : .NET Core 2.1.27 (CoreCLR 4.6.29916.01, CoreFX 4.6.29916.03), X64 RyuJIT
  Job-BNPYUR : .NET Core 3.1.14 (CoreCLR 4.700.21.16201, CoreFX 4.700.21.16208), X64 RyuJIT
  Job-FDMENM : .NET Framework 4.8 (4.8.4300.0), X64 RyuJIT


|         Method |        Job |     Toolchain |    Mean |    Error |   StdDev | Ratio |
|--------------- |----------- |-------------- |--------:|---------:|---------:|------:|
| DecompressData | Job-NUXXKY | .NET Core 2.1 | 3.698 s | 0.0520 s | 0.0461 s |  1.00 |
| DecompressData | Job-BNPYUR | .NET Core 3.1 | 1.900 s | 0.0373 s | 0.0415 s |  0.51 |
| DecompressData | Job-FDMENM |        net461 | 3.702 s | 0.0463 s | 0.0411 s |  1.00 |

konrad-kruczynski · 2021-05-04T10:31:39Z

Great, is there any obstacle left to have this merged soon?

konrad-kruczynski added 3 commits April 18, 2021 12:32

Added benchmark for BZip2 decompression.

c37ba26

Updated benchmarks to be run also on .NET Core 3.1.

47c5478

Simple automatic vectorization of the rotation loop.

7dac297

piksel reviewed Apr 27, 2021

View reviewed changes

src/ICSharpCode.SharpZipLib/BZip2/BZip2InputStream.cs Show resolved Hide resolved

Added comment describing vectorization.

d8811c1

piksel merged commit 1b9fcfc into icsharpcode:master May 4, 2021

piksel mentioned this pull request May 13, 2021

Fixed mismatched framework directives for vectorized memory move #635

Merged

HowToDoThis added a commit to HowToDoThis/SharpZipLib that referenced this pull request Aug 11, 2021

PR icsharpcode#611

fa2b3f8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bzip input stream simple vectorization #611

Bzip input stream simple vectorization #611

konrad-kruczynski commented Apr 18, 2021

codecov bot commented Apr 18, 2021 •

edited

Loading

piksel left a comment

Numpsy commented Apr 28, 2021

konrad-kruczynski commented Apr 30, 2021

konrad-kruczynski commented Apr 30, 2021

Numpsy commented Apr 30, 2021

konrad-kruczynski commented May 2, 2021 •

edited

Loading

piksel commented May 3, 2021

konrad-kruczynski commented May 4, 2021

Bzip input stream simple vectorization #611

Bzip input stream simple vectorization #611

Conversation

konrad-kruczynski commented Apr 18, 2021

codecov bot commented Apr 18, 2021 • edited Loading

Codecov Report

piksel left a comment

Choose a reason for hiding this comment

Numpsy commented Apr 28, 2021

konrad-kruczynski commented Apr 30, 2021

konrad-kruczynski commented Apr 30, 2021

Numpsy commented Apr 30, 2021

konrad-kruczynski commented May 2, 2021 • edited Loading

piksel commented May 3, 2021

konrad-kruczynski commented May 4, 2021

codecov bot commented Apr 18, 2021 •

edited

Loading

konrad-kruczynski commented May 2, 2021 •

edited

Loading