-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I/O performance improvements #100
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #100 +/- ##
==========================================
+ Coverage 76.28% 76.69% +0.41%
==========================================
Files 46 46
Lines 3036 3133 +97
==========================================
+ Hits 2316 2403 +87
- Misses 720 730 +10 ☔ View full report in Codecov by Sentry. |
@oschulz anything to be worried about from the Julia side? |
The best thing would be to replace the HDF5 files (read and write again with these improvements) in the legend-testdata repo as part of a PR there. Then we can test on Julia and finally merge into legend-testdata. |
Added a change to read files with locking=False (see #78 (comment)) |
I wonder if we really want to hardcode |
Yeah, perhaps we should make this an argument that defaults to True for write and False for read? There are some other settings that I hardcoded here:
which I can change. |
I'm a bit reluctant about hardcoding settings that might produce unexpected behavior or sub-optimal performance on a different system than your test one. Looking at the h5py defaults: seems like your Why did you change About |
I un-hardcoded these things. I added a locking option for read, which will default to False (I think this is ok in most of our use-cases and it helps with using the ro filesystem), and a page_buffer option for write which I am now defaulting to 0 |
src/lgdo/lh5/store.py
Outdated
if page_buffer: | ||
file_kwargs.update( | ||
{ | ||
"fs_strategy": "page", | ||
"fs_page_size": page_buffer, | ||
"fs_persist": True, | ||
"fs_threshold": 1, | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is duplicated in this file and in core.py
, can it be defined in only one place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in this pull request #97, I moved a lot of the code that opened files and found the group being read in into store.read
and core.read
, which created a lot of redundancy. I think both of these functions should be revisited. I actually tried fixing this at some point by having store.read
use core.read
, but it caused an error that I didn't have time to sort through; as I recall it had to do with how the output is sometimes an lgdo object and some times a tuple.
Co-authored-by: Luigi Pertoldi <[email protected]>
Co-authored-by: Luigi Pertoldi <[email protected]>
Co-authored-by: Luigi Pertoldi <[email protected]>
Looks good to me, I'll merge. Thanks a lot Ian for all this! |
_serializers.read.composite
tostore.read
andcore.read
. Only the_serializers
functions are using the low level API;read
itself is still usingh5py.File
to open the file get the top level group.Among these changes, the low level API made the biggest difference (a factor of almost 2), while the other two changes combine for an improvement of maybe 1.5 or so. The low level API makes a difference without reprocessing our files, while the other two changes happen at file write time, so will only be noticed after reprocessing