Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor h5repack / H5DOpen performance for virtual datasets with 1000+ real underlying datasets #5187

Open
crusaderky opened this issue Dec 20, 2024 · 1 comment

Comments

@crusaderky
Copy link

crusaderky commented Dec 20, 2024

versioned-hdf5 is an abstraction layer on top of h5py which adds diff versioning to HDF5 files. This is done by Copy-on-Write at chunk level. When the user modifies a dataset, they
a. now have two read-only datasets, before and after the change, and
b. the total disk usage is the size of the original dataset, plus the size of the changed chunks.

This is achieved by creating a virtual dataset for each version, which is a stitch of single chunks of a "raw data" plain dataset. Every time the user commits a new version which changes a subset of chunks, the original chunk is not modified; instead versioned-hdf5 appends the changed chunks to the raw dataset and creates a brand new virtual dataset:

raw_data: shape=(5000, 1000), chunks=(1000, 1000)
(plain HDF5 dataset)
raw0
raw1
raw2
raw3
raw4 

version 1: shape=(2000, 2000), chunks=(1000, 1000)
(virtual dataset; stitched from 4 datasets which are different chunks of raw_data in the same file)
raw0 raw1
raw2 raw3

version 2: shape=(2000, 2000), chunks=(1000, 1000)
(virtual dataset; stitched from 4 datasets which are different chunks of raw_data in the same file)
raw0 raw4
raw2 raw3

The problem we're facing is that libhdf5 is performing very poorly on H5DOpen, as I suspect nobody else is handling huge amounts of virtual datasets, each stitched from huge amounts of underlying referenced datasets.
E.g. if you have 100 versions of a dataset with 10k chunks, the HDF5 file will contain 1 raw dataset plus 100 virtual datasets, each with 10k references to the same raw dataset.

The problem becomes unbearable with h5repack.
Here's a demo script that builds 95 datasets of 44 to 440 chunks each, then proceeds to create 100 incremental diff versions from them:
https://gist.github.com/crusaderky/b91549221447e966fb2b22c5177df724

h5repack is exceptionally slow to duplicate it (all tests on NVMe):

$ time h5repack ~/dset.h5 ~/out.h5

real    1m27.546s
user    1m17.833s
sys     0m9.685s

if I replace the virtual datasets with real datasets that conceptually store the same metadata, I get a drastic improvement:

$ time h5repack ~/dset_no_virtual.h5 ~/out.h5

real    0m12.131s
user    0m2.399s
sys     0m9.708s

Profiling shows that more than half the time is, unsurprisingly, spent by H5DOpen2 opening the virtual datasets:

Image

Before I redesign versioned-hdf5 from scratch to avoid using virtual datasets, I would like to figure out if it's possible to improve the situation in libhdf5:

  1. When creating a virtual dataset, we're referencing the same raw dataset over and over again:
    https://github.com/deshaw/versioned-hdf5/blob/1a22450e90cea878ed16f99f42a4c82eb966249f/versioned_hdf5/backend.py#L465-L477
    When calling H5Pset_virtual, I wonder if it would be possible to allow leaving the dataset name NULL in all calls beyond the first one for a virtual dataset. H5DOpen would need to be changed to match. I'm unsure however how much benefit such a change would cause.
  2. Are there any other performance improvements that I can attempt in H5DOpen?
  3. In h5repack itself, there are 5 calls to H5DOpen for each virtual dataset:

if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0)

if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0)

if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0)
H5TOOLS_GOTO_ERROR((-1), "H5Dopen2 failed");
if ((dset_out = H5Dopen2(fidout, travt->objs[i].name, H5P_DEFAULT)) < 0)

if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0)

if ((dset_out = H5Dopen2(fidout, travt->objs[i].name, H5P_DEFAULT)) < 0)

In theory, I could reduce them to two (one for the input and one for the output) by keeping the object alive for the whole duration of h5repack. However, that would mean having all datasets open at the same time. Are there limits / caveats regarding the number of open objects referenced by hid_t?

  1. Alternatively to (3), I could refactor tools/src/h5repack/h5repack_copy.c::do_copy_objects and h5repack_refs.c::do_copy_refsobjs to process a single dataset from beginning to the end before moving to the next. I guess however that the reason why the current algorithm is breadth-first is because you need to ensure target datasets exist before creating cross-references to them?
  2. Finally, I wonder if I actually need to call H5DOpen2 at all for virtual datasets? Are there conditions I can test beforehand to skip it?
  3. Any other ideas?
@fortnern
Copy link
Member

Thanks for the report. Do you see the same poor performance if you use the C API routine H5Ocopy() instead of h5repack?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants