You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
versioned-hdf5 is an abstraction layer on top of h5py which adds diff versioning to HDF5 files. This is done by Copy-on-Write at chunk level. When the user modifies a dataset, they
a. now have two read-only datasets, before and after the change, and
b. the total disk usage is the size of the original dataset, plus the size of the changed chunks.
This is achieved by creating a virtual dataset for each version, which is a stitch of single chunks of a "raw data" plain dataset. Every time the user commits a new version which changes a subset of chunks, the original chunk is not modified; instead versioned-hdf5 appends the changed chunks to the raw dataset and creates a brand new virtual dataset:
raw_data: shape=(5000, 1000), chunks=(1000, 1000)
(plain HDF5 dataset)
raw0
raw1
raw2
raw3
raw4
version 1: shape=(2000, 2000), chunks=(1000, 1000)
(virtual dataset; stitched from 4 datasets which are different chunks of raw_data in the same file)
raw0 raw1
raw2 raw3
version 2: shape=(2000, 2000), chunks=(1000, 1000)
(virtual dataset; stitched from 4 datasets which are different chunks of raw_data in the same file)
raw0 raw4
raw2 raw3
The problem we're facing is that libhdf5 is performing very poorly on H5DOpen, as I suspect nobody else is handling huge amounts of virtual datasets, each stitched from huge amounts of underlying referenced datasets.
E.g. if you have 100 versions of a dataset with 10k chunks, the HDF5 file will contain 1 raw dataset plus 100 virtual datasets, each with 10k references to the same raw dataset.
h5repack is exceptionally slow to duplicate it (all tests on NVMe):
$ time h5repack ~/dset.h5 ~/out.h5
real 1m27.546s
user 1m17.833s
sys 0m9.685s
if I replace the virtual datasets with real datasets that conceptually store the same metadata, I get a drastic improvement:
$ time h5repack ~/dset_no_virtual.h5 ~/out.h5
real 0m12.131s
user 0m2.399s
sys 0m9.708s
Profiling shows that more than half the time is, unsurprisingly, spent by H5DOpen2 opening the virtual datasets:
Before I redesign versioned-hdf5 from scratch to avoid using virtual datasets, I would like to figure out if it's possible to improve the situation in libhdf5:
if ((dset_out=H5Dopen2(fidout, travt->objs[i].name, H5P_DEFAULT)) <0)
In theory, I could reduce them to two (one for the input and one for the output) by keeping the object alive for the whole duration of h5repack. However, that would mean having all datasets open at the same time. Are there limits / caveats regarding the number of open objects referenced by hid_t?
Alternatively to (3), I could refactor tools/src/h5repack/h5repack_copy.c::do_copy_objects and h5repack_refs.c::do_copy_refsobjs to process a single dataset from beginning to the end before moving to the next. I guess however that the reason why the current algorithm is breadth-first is because you need to ensure target datasets exist before creating cross-references to them?
Finally, I wonder if I actually need to call H5DOpen2 at all for virtual datasets? Are there conditions I can test beforehand to skip it?
Any other ideas?
The text was updated successfully, but these errors were encountered:
versioned-hdf5 is an abstraction layer on top of h5py which adds diff versioning to HDF5 files. This is done by Copy-on-Write at chunk level. When the user modifies a dataset, they
a. now have two read-only datasets, before and after the change, and
b. the total disk usage is the size of the original dataset, plus the size of the changed chunks.
This is achieved by creating a virtual dataset for each version, which is a stitch of single chunks of a "raw data" plain dataset. Every time the user commits a new version which changes a subset of chunks, the original chunk is not modified; instead versioned-hdf5 appends the changed chunks to the raw dataset and creates a brand new virtual dataset:
The problem we're facing is that libhdf5 is performing very poorly on H5DOpen, as I suspect nobody else is handling huge amounts of virtual datasets, each stitched from huge amounts of underlying referenced datasets.
E.g. if you have 100 versions of a dataset with 10k chunks, the HDF5 file will contain 1 raw dataset plus 100 virtual datasets, each with 10k references to the same raw dataset.
The problem becomes unbearable with h5repack.
Here's a demo script that builds 95 datasets of 44 to 440 chunks each, then proceeds to create 100 incremental diff versions from them:
https://gist.github.com/crusaderky/b91549221447e966fb2b22c5177df724
h5repack is exceptionally slow to duplicate it (all tests on NVMe):
if I replace the virtual datasets with real datasets that conceptually store the same metadata, I get a drastic improvement:
Profiling shows that more than half the time is, unsurprisingly, spent by H5DOpen2 opening the virtual datasets:
Before I redesign versioned-hdf5 from scratch to avoid using virtual datasets, I would like to figure out if it's possible to improve the situation in libhdf5:
https://github.com/deshaw/versioned-hdf5/blob/1a22450e90cea878ed16f99f42a4c82eb966249f/versioned_hdf5/backend.py#L465-L477
When calling
H5Pset_virtual
, I wonder if it would be possible to allow leaving the dataset name NULL in all calls beyond the first one for a virtual dataset. H5DOpen would need to be changed to match. I'm unsure however how much benefit such a change would cause.hdf5/tools/src/h5repack/h5repack_copy.c
Line 803 in 331193f
hdf5/tools/src/h5repack/h5repack_copy.c
Line 885 in 331193f
hdf5/tools/src/h5repack/h5repack_copy.c
Lines 1315 to 1317 in 331193f
hdf5/tools/src/h5repack/h5repack_refs.c
Line 102 in 331193f
hdf5/tools/src/h5repack/h5repack_refs.c
Line 318 in 331193f
In theory, I could reduce them to two (one for the input and one for the output) by keeping the object alive for the whole duration of h5repack. However, that would mean having all datasets open at the same time. Are there limits / caveats regarding the number of open objects referenced by
hid_t
?tools/src/h5repack/h5repack_copy.c::do_copy_objects
andh5repack_refs.c::do_copy_refsobjs
to process a single dataset from beginning to the end before moving to the next. I guess however that the reason why the current algorithm is breadth-first is because you need to ensure target datasets exist before creating cross-references to them?The text was updated successfully, but these errors were encountered: