Faster hdf reading #512

pmrv · 2021-11-11T13:13:48Z

I'm waiting for pyiron to submit a large number of jobs right now, so I thought I have a look at how to make it faster.

One of the major bottlenecks is calling list_all()/list_groups()/list_nodes() to check what datasets/groups are inside the HDF5 files. In fact roughly 75%(!) of loading a small lammps job is spent in FileHDFio.list_all().

This makes two changes:

Eagerly try load read data directly with h5io.read_hdf5 instead of checking first whether it is there and then reading it. This makes a simple read like job['output/generic/energy_pot'] faster by about a factor 5, so that it takes roughly the same amount of time as calling h5io.read_hdf5 directly on a file.
Use list_all() instead of list_nodes() and list_groups together. This saves opening the HDF5 file once.

Both together make loading a lammps jobs about 10% faster.

I want to mention that this

h5io.read_hdf5(j.project_hdf5.file_name, title=j.name + '/output/generic/energy_pot')

is still about twice as slow as directly doing

with h5py.File(j.project_hdf5.file_name) as f:
    a = f[j.name + '/output/generic/energy_pot']

so there's still room for improvement.

Most read acesses on FileHDFio will want to grab certain specific datasets. Optimize this by trying to just read_hdf5 the given path straight-away before trying other options.

The new fast path already takes care of loading leaf datasets

jan-janssen

Looks good to me

pmrv added 2 commits November 11, 2021 13:10

Make FileHDFio.__getitem__ faster

1d2b897

Most read acesses on FileHDFio will want to grab certain specific datasets. Optimize this by trying to just read_hdf5 the given path straight-away before trying other options.

Save one list_all call

d22e6b3

pmrv added the enhancement New feature or request label Nov 11, 2021

pmrv requested a review from niklassiemer November 11, 2021 13:15

pmrv added 2 commits November 11, 2021 14:44

Catch more errors

6ec2c29

Don't check for datasets twice

ac1ee1c

The new fast path already takes care of loading leaf datasets

jan-janssen approved these changes Nov 11, 2021

View reviewed changes

liamhuber approved these changes Nov 11, 2021

View reviewed changes

pmrv merged commit 14b6ea2 into master Nov 11, 2021

delete-merged-branch bot deleted the fast-hdf-access branch November 11, 2021 16:38

niklassiemer mentioned this pull request Nov 18, 2021

Bump h5io from 0.1.2 to 0.1.4 #499

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster hdf reading #512

Faster hdf reading #512

pmrv commented Nov 11, 2021 •

edited

Loading

jan-janssen left a comment

Faster hdf reading #512

Faster hdf reading #512

Conversation

pmrv commented Nov 11, 2021 • edited Loading

jan-janssen left a comment

Choose a reason for hiding this comment

pmrv commented Nov 11, 2021 •

edited

Loading