Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I/O performance improvements #100

Merged
merged 34 commits into from
Sep 10, 2024
Merged

I/O performance improvements #100

merged 34 commits into from
Sep 10, 2024

Conversation

iguinn
Copy link
Contributor

@iguinn iguinn commented Jul 11, 2024

  • Use low level h5py API to for read. Note that this resulted in moving the handling of lists of files from _serializers.read.composite to store.read and core.read. Only the _serializers functions are using the low level API; read itself is still using h5py.File to open the file get the top level group.
  • Require h5py >= 3.10. Not sure why, but there is a noticeable performance improvement starting at this version
  • Encode string attributes to utf-8 before writing. This is something that happens implicitly anyway, so I'm not sure why doing this helps, but it does
  • Write files using paged aggregation. This is a setting that you use when opening a file to write. Right now I have it hard-coded to use 64 kB pages, which seemed to be optimal in terms of file size and read speed, although this was only tested using a file with many small datasets. Also use latest version of file for writing
    Among these changes, the low level API made the biggest difference (a factor of almost 2), while the other two changes combine for an improvement of maybe 1.5 or so. The low level API makes a difference without reprocessing our files, while the other two changes happen at file write time, so will only be noticed after reprocessing

Copy link

codecov bot commented Jul 11, 2024

Codecov Report

Attention: Patch coverage is 81.13208% with 40 lines in your changes missing coverage. Please review.

Project coverage is 76.69%. Comparing base (7e2c9ee) to head (c134065).
Report is 102 commits behind head on main.

Files with missing lines Patch % Lines
...rc/lgdo/lh5/_serializers/read/vector_of_vectors.py 59.09% 9 Missing ⚠️
src/lgdo/lh5/_serializers/read/composite.py 69.56% 7 Missing ⚠️
src/lgdo/lh5/_serializers/read/ndarray.py 75.86% 7 Missing ⚠️
src/lgdo/lh5/store.py 82.92% 7 Missing ⚠️
src/lgdo/lh5/_serializers/read/utils.py 84.21% 3 Missing ⚠️
src/lgdo/lh5/core.py 91.42% 3 Missing ⚠️
src/lgdo/lh5/_serializers/read/encoded.py 83.33% 2 Missing ⚠️
src/lgdo/lh5/_serializers/read/array.py 88.88% 1 Missing ⚠️
src/lgdo/lh5/_serializers/read/scalar.py 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #100      +/-   ##
==========================================
+ Coverage   76.28%   76.69%   +0.41%     
==========================================
  Files          46       46              
  Lines        3036     3133      +97     
==========================================
+ Hits         2316     2403      +87     
- Misses        720      730      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@gipert gipert added performance Code performance lh5 HDF5 I/O labels Jul 12, 2024
@gipert
Copy link
Member

gipert commented Jul 12, 2024

@oschulz anything to be worried about from the Julia side?

@oschulz
Copy link

oschulz commented Jul 12, 2024

@oschulz anything to be worried about from the Julia side?

Probably not, but can you give a test output file to @apmypb for testing?

@oschulz
Copy link

oschulz commented Jul 12, 2024

Probably not, but can you give a test output file to @apmypb for testing?

The best thing would be to replace the HDF5 files (read and write again with these improvements) in the legend-testdata repo as part of a PR there. Then we can test on Julia and finally merge into legend-testdata.

@iguinn
Copy link
Contributor Author

iguinn commented Jul 20, 2024

Added a change to read files with locking=False (see #78 (comment))

@gipert
Copy link
Member

gipert commented Aug 12, 2024

I wonder if we really want to hardcode locking=False... Isn't it a useful feature?

@iguinn
Copy link
Contributor Author

iguinn commented Aug 12, 2024

Yeah, perhaps we should make this an argument that defaults to True for write and False for read?

There are some other settings that I hardcoded here:

                {
                    "fs_strategy": "page",
                    "fs_page_size": 65536,
                    "fs_persist": True,
                    "fs_threshold": 1,
                    "libver": ("latest", "latest"),
                }

which I can change.

@gipert
Copy link
Member

gipert commented Aug 13, 2024

I'm a bit reluctant about hardcoding settings that might produce unexpected behavior or sub-optimal performance on a different system than your test one. Looking at the h5py defaults:

https://github.com/h5py/h5py/blob/959604fe7f6ee1f2bdc5328e9036b96c0ec44323/h5py/_hl/files.py#L376

seems like your fs_threshold and libver values are the same? Should we avoid hardcoding then?

Why did you change fs_persist?

About fs_page_size: maybe we should override the page size value in the dataflow only? Or do you feel that this value should be universally good? Again, I'm worried about unexpected behavior in user applications.

@iguinn
Copy link
Contributor Author

iguinn commented Aug 27, 2024

I un-hardcoded these things. I added a locking option for read, which will default to False (I think this is ok in most of our use-cases and it helps with using the ro filesystem), and a page_buffer option for write which I am now defaulting to 0

@gipert gipert self-requested a review September 7, 2024 08:19
src/lgdo/lh5/store.py Outdated Show resolved Hide resolved
Comment on lines 292 to 300
if page_buffer:
file_kwargs.update(
{
"fs_strategy": "page",
"fs_page_size": page_buffer,
"fs_persist": True,
"fs_threshold": 1,
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is duplicated in this file and in core.py, can it be defined in only one place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in this pull request #97, I moved a lot of the code that opened files and found the group being read in into store.read and core.read, which created a lot of redundancy. I think both of these functions should be revisited. I actually tried fixing this at some point by having store.read use core.read, but it caused an error that I didn't have time to sort through; as I recall it had to do with how the output is sometimes an lgdo object and some times a tuple.

src/lgdo/lh5/core.py Outdated Show resolved Hide resolved
src/lgdo/lh5/store.py Outdated Show resolved Hide resolved
src/lgdo/lh5/core.py Outdated Show resolved Hide resolved
@gipert
Copy link
Member

gipert commented Sep 10, 2024

Looks good to me, I'll merge. Thanks a lot Ian for all this!

@gipert gipert merged commit cd78184 into legend-exp:main Sep 10, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lh5 HDF5 I/O performance Code performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Increase read speed by x20-100 for most data LH5Store.read() performance
4 participants