Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compressed NRRD read performance #92

Merged

Conversation

addisonElliott
Copy link
Collaborator

This PR makes two changes to improve the performance of the nrrd.write function for compressed NRRD files, i.e. gzip or bzip2 compressed data.

The first change is to switch the decompressed data buffer from a bytes object to a bytearray. A bytes object is immutable and so appending to a bytes object requires that a new object is created in memory with the data. bytearray is a mutable object similar to bytes except that appending to the bytearray will result in adding the data to the array and allocating memory only if necessary. In addition, importing the data into a Numpy array is switched from np.fromstring to np.frombuffer. This is done because we no longer have a string and because there is a warning about np.fromstring being deprecated. Performance tests (see the issue for more details) show a large speedup with this improvement.

The second change fine tunes the chunk size parameter and how it is used in nrrd.read. Previously, the chunk size was set to 1MB and it would read a 1MB chunk and then decompress it. For larger files, this is inefficient. This PR changes the chunk size to be 1GB and also changes it such that the entire compressed data is read all at once and then only the decompression is chunked. The reasoning is detailed below...

Based on an initial analysis, it was found that increasing the chunk size for nrrd.write actually increased the amount of time required to write the file. The exact difference may vary depending on the machine but the general trend should be consistent. With that, the write chunk size was kept at 1MB to preserve RAM while writing. The example experiment uses random data to be written which likely affects the compression ratio, but additional tests were done with non-random data and similar results were achieved as described.

Experiment for writing large amounts of data with various chunk sizes:
https://gist.github.com/addisonElliott/097de1ca1311026e2e116541c9eed0c5#file-write_experiment-py

Changing the nrrd.read chunk size alone does not improve performance. Upon analysis, it was found that there was a large delay for small & large files with a large chunk size when calling fh.read(CHUNK_SIZE). For example, a 3kB file took 0.7s to return with a 1GB chunk size while it takes 100s of microseconds when using a 1MB chunk size. In the experiment linked below, it is shown that there is a speedup for large files but smaller files had almost a 50% slow down in performance.

Experiment for reading small/large files with various chunk sizes:
https://gist.github.com/addisonElliott/097de1ca1311026e2e116541c9eed0c5#file-read_normal_experiment-py

As mentioned above, the resolution to the slow down with a larger chunk size on smaller files is to read the entire file into memory at once using fh.read() (no argument). This has the disadvantage of using additional memory but the fact of the matter is that reading a raw-encoded NRRD file loads the entire file into memory. In addition, the data will need to be in memory anyway when it is converted to a Numpy array. Even furthermore, we are reading the decompressed data, so the user should have enough RAM if they are expecting to be able to hold the uncompressed data in RAM.

With the entire compressed file read in at once now, the chunk size is set to be 1GB to increase performance on reading larger files while keeping the same performance for smaller files. The chunk size only sets the amount of data to decompress at once.

One potential concern with a larger chunk size is that there is an issue in older versions of Python where zlib is unable to decompress data that is larger than 4GB in file size. See issue #21 for more details. However, as far as I can tell this is fixed in the latest version of Python 2.7 and all versions of Python 3. See source here and here for more information. Note that these fixes came out after the pynrrd issue was reported.

The fixed benchmark can be seen here:
https://gist.github.com/addisonElliott/097de1ca1311026e2e116541c9eed0c5#file-read_fixed_experiment-py

Note that performance is similar for small files and for the 1GB file with a 1GB chunk size, the performance is increased by ~15%.

Fixes issue #88

@codecov-io
Copy link

Codecov Report

Merging #92 into master will increase coverage by 0.27%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #92      +/-   ##
==========================================
+ Coverage   99.17%   99.45%   +0.27%     
==========================================
  Files           6        6              
  Lines         363      365       +2     
  Branches      117      116       -1     
==========================================
+ Hits          360      363       +3     
  Misses          1        1              
+ Partials        2        1       -1
Impacted Files Coverage Δ
nrrd/reader.py 100% <100%> (ø) ⬆️
nrrd/writer.py 98.3% <100%> (+0.82%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0dfddeb...3570654. Read the comment docs.

@addisonElliott addisonElliott merged commit 16be776 into mhe:master Apr 2, 2019
@addisonElliott addisonElliott deleted the fix-compressed-nrrds-io-speed branch April 2, 2019 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants