Poor h5repack / H5DOpen performance for virtual datasets with 1000+ real underlying datasets #5187

crusaderky · 2024-12-20T16:25:03Z

versioned-hdf5 is an abstraction layer on top of h5py which adds diff versioning to HDF5 files. This is done by Copy-on-Write at chunk level. When the user modifies a dataset, they
a. now have two read-only datasets, before and after the change, and
b. the total disk usage is the size of the original dataset, plus the size of the changed chunks.

This is achieved by creating a virtual dataset for each version, which is a stitch of single chunks of a "raw data" plain dataset. Every time the user commits a new version which changes a subset of chunks, the original chunk is not modified; instead versioned-hdf5 appends the changed chunks to the raw dataset and creates a brand new virtual dataset:

raw_data: shape=(5000, 1000), chunks=(1000, 1000)
(plain HDF5 dataset)
raw0
raw1
raw2
raw3
raw4 

version 1: shape=(2000, 2000), chunks=(1000, 1000)
(virtual dataset; stitched from 4 datasets which are different chunks of raw_data in the same file)
raw0 raw1
raw2 raw3

version 2: shape=(2000, 2000), chunks=(1000, 1000)
(virtual dataset; stitched from 4 datasets which are different chunks of raw_data in the same file)
raw0 raw4
raw2 raw3

The problem we're facing is that libhdf5 is performing very poorly on H5DOpen, as I suspect nobody else is handling huge amounts of virtual datasets, each stitched from huge amounts of underlying referenced datasets.
E.g. if you have 100 versions of a dataset with 10k chunks, the HDF5 file will contain 1 raw dataset plus 100 virtual datasets, each with 10k references to the same raw dataset.

The problem becomes unbearable with h5repack.
Here's a demo script that builds 95 datasets of 44 to 440 chunks each, then proceeds to create 100 incremental diff versions from them:
https://gist.github.com/crusaderky/b91549221447e966fb2b22c5177df724

h5repack is exceptionally slow to duplicate it (all tests on NVMe):

$ time h5repack ~/dset.h5 ~/out.h5

real    1m27.546s
user    1m17.833s
sys     0m9.685s

if I replace the virtual datasets with real datasets that conceptually store the same metadata, I get a drastic improvement:

$ time h5repack ~/dset_no_virtual.h5 ~/out.h5

real    0m12.131s
user    0m2.399s
sys     0m9.708s

Profiling shows that more than half the time is, unsurprisingly, spent by H5DOpen2 opening the virtual datasets:

Before I redesign versioned-hdf5 from scratch to avoid using virtual datasets, I would like to figure out if it's possible to improve the situation in libhdf5:

When creating a virtual dataset, we're referencing the same raw dataset over and over again:
https://github.com/deshaw/versioned-hdf5/blob/1a22450e90cea878ed16f99f42a4c82eb966249f/versioned_hdf5/backend.py#L465-L477
When calling H5Pset_virtual, I wonder if it would be possible to allow leaving the dataset name NULL in all calls beyond the first one for a virtual dataset. H5DOpen would need to be changed to match. I'm unsure however how much benefit such a change would cause.
Are there any other performance improvements that I can attempt in H5DOpen?
In h5repack itself, there are 5 calls to H5DOpen for each virtual dataset:

hdf5/tools/src/h5repack/h5repack_copy.c

Line 803 in 331193f

if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0)

hdf5/tools/src/h5repack/h5repack_copy.c

Line 885 in 331193f

if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0)

hdf5/tools/src/h5repack/h5repack_copy.c

Lines 1315 to 1317 in 331193f

    
           if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0) 
        
               H5TOOLS_GOTO_ERROR((-1), "H5Dopen2 failed"); 
        
           if ((dset_out = H5Dopen2(fidout, travt->objs[i].name, H5P_DEFAULT)) < 0)

hdf5/tools/src/h5repack/h5repack_refs.c

Line 102 in 331193f

if ((dset_in = H5Dopen2(fidin, travt->objs[i].name, H5P_DEFAULT)) < 0)

hdf5/tools/src/h5repack/h5repack_refs.c

Line 318 in 331193f

if ((dset_out = H5Dopen2(fidout, travt->objs[i].name, H5P_DEFAULT)) < 0)

In theory, I could reduce them to two (one for the input and one for the output) by keeping the object alive for the whole duration of h5repack. However, that would mean having all datasets open at the same time. Are there limits / caveats regarding the number of open objects referenced by hid_t?

Alternatively to (3), I could refactor tools/src/h5repack/h5repack_copy.c::do_copy_objects and h5repack_refs.c::do_copy_refsobjs to process a single dataset from beginning to the end before moving to the next. I guess however that the reason why the current algorithm is breadth-first is because you need to ensure target datasets exist before creating cross-references to them?
Finally, I wonder if I actually need to call H5DOpen2 at all for virtual datasets? Are there conditions I can test beforehand to skip it?
Any other ideas?

The text was updated successfully, but these errors were encountered:

fortnern · 2024-12-20T17:05:25Z

Thanks for the report. Do you see the same poor performance if you use the C API routine H5Ocopy() instead of h5repack?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor h5repack / H5DOpen performance for virtual datasets with 1000+ real underlying datasets #5187

Poor h5repack / H5DOpen performance for virtual datasets with 1000+ real underlying datasets #5187

crusaderky commented Dec 20, 2024 •

edited

Loading

fortnern commented Dec 20, 2024

Poor h5repack / H5DOpen performance for virtual datasets with 1000+ real underlying datasets #5187

Poor h5repack / H5DOpen performance for virtual datasets with 1000+ real underlying datasets #5187

Comments

crusaderky commented Dec 20, 2024 • edited Loading

fortnern commented Dec 20, 2024

crusaderky commented Dec 20, 2024 •

edited

Loading