Improve compressed NRRD read performance #92

addisonElliott · 2019-03-30T05:15:28Z

This PR makes two changes to improve the performance of the nrrd.write function for compressed NRRD files, i.e. gzip or bzip2 compressed data.

The first change is to switch the decompressed data buffer from a bytes object to a bytearray. A bytes object is immutable and so appending to a bytes object requires that a new object is created in memory with the data. bytearray is a mutable object similar to bytes except that appending to the bytearray will result in adding the data to the array and allocating memory only if necessary. In addition, importing the data into a Numpy array is switched from np.fromstring to np.frombuffer. This is done because we no longer have a string and because there is a warning about np.fromstring being deprecated. Performance tests (see the issue for more details) show a large speedup with this improvement.

The second change fine tunes the chunk size parameter and how it is used in nrrd.read. Previously, the chunk size was set to 1MB and it would read a 1MB chunk and then decompress it. For larger files, this is inefficient. This PR changes the chunk size to be 1GB and also changes it such that the entire compressed data is read all at once and then only the decompression is chunked. The reasoning is detailed below...

Based on an initial analysis, it was found that increasing the chunk size for nrrd.write actually increased the amount of time required to write the file. The exact difference may vary depending on the machine but the general trend should be consistent. With that, the write chunk size was kept at 1MB to preserve RAM while writing. The example experiment uses random data to be written which likely affects the compression ratio, but additional tests were done with non-random data and similar results were achieved as described.

Experiment for writing large amounts of data with various chunk sizes:
https://gist.github.com/addisonElliott/097de1ca1311026e2e116541c9eed0c5#file-write_experiment-py

Changing the nrrd.read chunk size alone does not improve performance. Upon analysis, it was found that there was a large delay for small & large files with a large chunk size when calling fh.read(CHUNK_SIZE). For example, a 3kB file took 0.7s to return with a 1GB chunk size while it takes 100s of microseconds when using a 1MB chunk size. In the experiment linked below, it is shown that there is a speedup for large files but smaller files had almost a 50% slow down in performance.

Experiment for reading small/large files with various chunk sizes:
https://gist.github.com/addisonElliott/097de1ca1311026e2e116541c9eed0c5#file-read_normal_experiment-py

As mentioned above, the resolution to the slow down with a larger chunk size on smaller files is to read the entire file into memory at once using fh.read() (no argument). This has the disadvantage of using additional memory but the fact of the matter is that reading a raw-encoded NRRD file loads the entire file into memory. In addition, the data will need to be in memory anyway when it is converted to a Numpy array. Even furthermore, we are reading the decompressed data, so the user should have enough RAM if they are expecting to be able to hold the uncompressed data in RAM.

With the entire compressed file read in at once now, the chunk size is set to be 1GB to increase performance on reading larger files while keeping the same performance for smaller files. The chunk size only sets the amount of data to decompress at once.

One potential concern with a larger chunk size is that there is an issue in older versions of Python where zlib is unable to decompress data that is larger than 4GB in file size. See issue #21 for more details. However, as far as I can tell this is fixed in the latest version of Python 2.7 and all versions of Python 3. See source here and here for more information. Note that these fixes came out after the pynrrd issue was reported.

The fixed benchmark can be seen here:
https://gist.github.com/addisonElliott/097de1ca1311026e2e116541c9eed0c5#file-read_fixed_experiment-py

Note that performance is similar for small files and for the 1GB file with a 1GB chunk size, the performance is increased by ~15%.

Fixes issue #88

Switch to np.frombuffer rather than np.fromstring. Fixes deprecation warning

codecov-io · 2019-03-30T05:16:52Z

Codecov Report

Merging #92 into master will increase coverage by 0.27%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
+ Coverage   99.17%   99.45%   +0.27%     
==========================================
  Files           6        6              
  Lines         363      365       +2     
  Branches      117      116       -1     
==========================================
+ Hits          360      363       +3     
  Misses          1        1              
+ Partials        2        1       -1

Impacted Files	Coverage Δ
nrrd/reader.py	`100% <100%> (ø)`	⬆️
nrrd/writer.py	`98.3% <100%> (+0.82%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dfddeb...3570654. Read the comment docs.

addisonElliott added 2 commits March 29, 2019 20:26

Switch to using bytearray rather than byte string

7562112

Switch to np.frombuffer rather than np.fromstring. Fixes deprecation warning

Increase chunk size for reading to increase performance

3570654

addisonElliott merged commit 16be776 into mhe:master Apr 2, 2019

addisonElliott deleted the fix-compressed-nrrds-io-speed branch April 2, 2019 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve compressed NRRD read performance #92

Improve compressed NRRD read performance #92

addisonElliott commented Mar 30, 2019

codecov-io commented Mar 30, 2019

Improve compressed NRRD read performance #92

Improve compressed NRRD read performance #92

Conversation

addisonElliott commented Mar 30, 2019

codecov-io commented Mar 30, 2019

Codecov Report