Improving performance for h5netcdf #195

hmaarrfk · 2022-11-30T19:07:00Z

What happened:

I've been trying to use h5netcdf since it gives access to the underlying h5py dataset which makes it easier to prototype with different HDF5 backends. netcdf4-c is much harder to compile and thus prototype with since many of the internals are not accessible.

My original concern arose from observations that, for some of our datasets, they open with

xarray + netcdf4-python in about 80 ms
xarray + h5netcdf: 600 ms.

Ultimately, I believe that I think things boil down to the fact an access to the _h5ds property is quite costly. This makes "every" operation "slow".

the code below:

import h5netcdf
from pathlib import Path

h5nc_file = h5netcdf.File(
    Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)

images = h5nc_file.variables['images']
for i in range(1000):
    images.shape

Using the spyder profile, one finds that 1000 calls to size, result in about 8000 calls to _h5ds

import h5netcdf
from pathlib import Path

h5nc_file = h5netcdf.File(
    Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc',
)

images = h5nc_file.variables['images']

for i in range(8000):
    images._h5ds

This is because this property is created on demand.

I think this has cascading effects because everything really relies on the access to the underlying h5py structure.

A previous suggestion tried to remove the property, but it seems like there was some concern about "resetting" the underlying _h5py pointer. I am unclear on where this kind of usecase would come up. My understanding is that this pointer would be something unchanging.

Expected Output

As a comparison, we can compare how long it takes to access the shape property from netcdf4:

import h5netcdf
from pathlib import Path
import netCDF4

h5nc_file = netCDF4.Dataset(
    str(Path.home() / 'sample_file_zebrafish_0.18.55_4uCams_small.nc'),
)

images = h5nc_file.variables['images']
for i in range(1000):
    images.shape

About 400 us instead of 350 ms.

Anything else we need to know?:

The sample image has 4 dimensions

sample_file_zebrafish_0.18.55_4uCams_small.zip

Version

Output of print(h5py.version.info, f"\nh5netcdf {h5netcdf.__version__}")

Summary of the h5py configuration

h5py 3.7.0
HDF5 1.12.2
Python 3.9.13 | packaged by ...
[GCC 10.4.0]
sys.platform linux
sys.maxsize 9223372036854775807
numpy 1.23.5
cython (built with) 0.29.32
numpy (built against) 1.20.3
HDF5 (built against) 1.12.2

h5netcdf 0.14.0.dev51+g8302776

The text was updated successfully, but these errors were encountered:

hmaarrfk · 2022-11-30T19:17:08Z

As an example, the following patch, avoids the creation of the _h5ds on demand, (it keeps a had reference is).

diff --git a/h5netcdf/core.py b/h5netcdf/core.py
index 39b0b5a..cb7f2b1 100644
--- a/h5netcdf/core.py
+++ b/h5netcdf/core.py
@@ -114,6 +114,10 @@ class BaseVariable(object):
         self._dimensions = dimensions
         self._initialized = True
 
+        # Always refer to the root file and store not h5py object
+        # subclasses:
+        self._h5ds = self._root._h5file[self._h5path]
+
     @property
     def _parent(self):
         return self._parent_ref()
@@ -122,12 +126,6 @@ class BaseVariable(object):
     def _root(self):
         return self._root_ref()
 
-    @property
-    def _h5ds(self):
-        # Always refer to the root file and store not h5py object
-        # subclasses:
-        return self._root._h5file[self._h5path]
-
     @property
     def name(self):
         """Return variable name."""
diff --git a/h5netcdf/dimensions.py b/h5netcdf/dimensions.py
index d291731..d5bda4d 100644
--- a/h5netcdf/dimensions.py
+++ b/h5netcdf/dimensions.py
@@ -86,8 +86,13 @@ class Dimension(object):
         else:
             self._root._max_dim_id += 1
         self._dimensionid = self._root._max_dim_id
-        if parent._root._writable and create_h5ds and not self._phony:
+        if self._phony:
+            self._h5ds = None
+        elif parent._root._writable and create_h5ds:
+            # Create scale will create the self._h5ds object
             self._create_scale()
+        else:
+            self._h5ds = self._root._h5file[self._h5path]
         self._initialized = True
 
     @property
@@ -131,12 +136,6 @@ class Dimension(object):
             return False
         return self._h5ds.maxshape == (None,)
 
-    @property
-    def _h5ds(self):
-        if self._phony:
-            return None
-        return self._root._h5file[self._h5path]
-
     @property
     def _isscale(self):
         return h5py.h5ds.is_scale(self._h5ds.id)
@@ -183,6 +182,7 @@ class Dimension(object):
                 dtype=">f4",
                 **kwargs,
             )
+        self._h5ds = self._root._h5file[self._h5path]
         self._h5ds.attrs["_Netcdf4Dimid"] = np.array(self._dimid, dtype=np.int32)
 
         if len(self._h5ds.shape) > 1:

but brings down from 500 ms (on my computer) to 22 ms

hmaarrfk · 2022-11-30T21:26:40Z

Relevant discussion: #117

kmuehlbauer · 2022-12-01T15:11:44Z

@hmaarrfk Thanks for providing all that information and your engagement to make h5netcdf more performant. I've taken a first step to minimize re-reading from underlying h5py/h5df in #196. The changes are minor, but have a huge effect on the presented use-case.

I've tried your patch here, but it broke the test suite. I'll look deeper into this the next days.

hmaarrfk · 2022-12-01T16:34:06Z

Thank you! My patch was more a proof of concept that h5netcdf doesn't have to be slow. I was trying to get pympler installed to recreate the cyclic dependency graphs before commenting further.

I also think that we (personally) need to tune our file structure. I remember reading about HDF5 metadata needing to be tuned for "large" files.

Were you able to recreate similar performance trends with a newly created hand crafted file? We likely generated the original netcdf4 with netcdf4-c (primarily due to performance concerns with h5netcdf).

kmuehlbauer · 2022-12-01T16:41:14Z

@hmaarrfk Do you think it would be useful to create several datasets from scratch with netCDF4-python/h5netcdf? We could build a benchmark suite with your and others specific use cases from that, also including examples from the past. At least this would give us some comparison between different approaches.

hmaarrfk · 2022-12-03T14:40:00Z

We could build a benchmark suite

Building benchmarking infrastructure is non-trivial. I'm happy to help creating the datasets, and "donating" some "real world" dataset.

hmaarrfk · 2022-12-03T15:23:52Z

For this particular case, I found that it isn't dependent on my own nc file

import h5netcdf
from pathlib import Path
from time import perf_counter, sleep

import tempfile

with tempfile.TemporaryDirectory() as tmpdir:
    filename = Path(tmpdir) / 'h5netcdf_test.nc'

    h5nc_file = h5netcdf.File(filename, 'w')
    h5nc_file.dimensions = {'x': 1024}
    x = h5nc_file.dimensions['x']

    time_start = perf_counter()
    for i in range(1000):
        len(x)
    time_end = perf_counter()
    time_elapsed = time_end - time_start
    print(f"{time_elapsed:03.3f} --- Time to compute length 1000 times: ")
        
    time_start = perf_counter()
    for i in range(1000):
        x._h5ds
    time_end = perf_counter()
    time_elapsed = time_end - time_start
    print(f"{time_elapsed:03.3f} --- Time to access h5ds reference 1000 times")
    
    time_start = perf_counter()
    for i in range(1000):
        x._root._h5file[x._h5path]
    time_end = perf_counter()
    time_elapsed = time_end - time_start
    print(f"{time_elapsed:03.3f} --- Time to create h5ds reference 1000 times")
        
    h5nc_file.close()
    
# %% on main
# 0.053 --- Time to compute length 1000 times: 
# 0.046 --- Time to access h5ds reference 1000 times
# 0.046 --- Time to create h5ds reference 1000 times

# %% On https://github.com/h5netcdf/h5netcdf/pull/197
# 0.003 --- Time to compute length 1000 times: 
# 0.000 --- Time to access h5ds reference 1000 times
# 0.034 --- Time to create h5ds reference 1000 times

veenstrajelmer · 2024-08-21T13:09:21Z

I think I might be encountering the same issue when using xarray with engine="h5netcdf" (default is engine="netcdf4"). The memory usage is way lower with h5netcdf but reading the files is way slower. When profiling this I see that more than half the time is spent on name() in base.py, which seems not to be a complex function:
https://github.com/h5py/h5py/blob/f88df800a209d743d5dc1c8b30f0c1adc758ccf1/h5py/_hl/base.py#L287-L289

def name(self):
    """ Return the full name of this object.  None if anonymous. """
    return self._d(h5i.get_name(self.id))

However, the netcdf dataset I am testing with has >2410 variables and the call to name() is done >7300 times, which accounts for 29.8 seconds of the total 49.5 seconds for xr.open_mfdataset(). This seems to me a bit much. could it be that if you unfold your profiler it also ends up with a significant time spent on calls to name() or is it something related or completely different. I also found the h5py issue h5py/h5py#2395 to be potentially relevant.

hmaarrfk · 2024-08-21T13:15:11Z

I'm not sure if i cross referenced it here, but ultimately some of my speedups came in conflict with some file caching I couldn't figure out.

I tried to document it pydata/xarray#7359

veenstrajelmer · 2024-08-30T13:53:14Z

@hmaarrfk it seems that your xarray question is not being answered and the PR from @kmuehlbauer would be a useful fix but was also not merged. What is the best way to go forward? Is it realistic that either of these will be picked up further on the short term?

hmaarrfk · 2024-08-30T20:09:13Z

I'm honestly not sure. I've tried to understand the xarray caching file manager. It doesn't seem super intuitive to edit.

I feel like for many of my use cases, we've someewhat learned to live with it.

Sometimes it helps to learn that other people care. I open alot of issues on OSS projects, so sometimes its easy to think that "Mark is just opening them for awareness".

For example, I still owe this metadata:
pydata/xarray#9003 (comment)

hmaarrfk · 2024-08-30T22:07:44Z

Sometimes it helps to learn that other people care. I open alot of issues on OSS projects, so sometimes its easy to think that "Mark is just opening them for awareness".

I encourage you to comment on the issues I linked to.

veenstrajelmer · 2024-09-17T11:48:35Z

@hmaarrfk I cannot judge the last issue you linked, as for much of the other linked PR's and issues. I would gladly comment on issues if that seems helpful (I did on the one you linked just yet), but not sure where and who. I hope the open PR's/issues can be consolidated a bit and that the separate efforts can be combined.

kmuehlbauer mentioned this issue Dec 1, 2022

WIP: dimension performance #196

Open

3 tasks

kmuehlbauer added the enhancement label Jun 2, 2023

pnuu mentioned this issue Nov 3, 2023

Use cached properties to speed up dataset access #221

Open

3 tasks

veenstrajelmer mentioned this issue Aug 21, 2024

increase minimal xugrid version to speed up xu.open_dataset() Deltares/dfm_tools#968

Closed

6 tasks

This was referenced Aug 30, 2024

Consider h5netcdf dependency and default Deltares/dfm_tools#484

Open

high memory usage when appending uds to partitions list Deltares/xugrid#287

Closed

veenstrajelmer mentioned this issue Sep 17, 2024

Micro optimize dataset.isel for speed on large datasets pydata/xarray#9003

Open

4 tasks

veenstrajelmer mentioned this issue Oct 14, 2024

Python 3.13 is not supported mercator-ocean/copernicus-marine-toolbox#158

Closed

5 tasks

kmuehlbauer mentioned this issue Dec 29, 2024

Why not caching h5py dataset? #251

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving performance for h5netcdf #195

Improving performance for h5netcdf #195

hmaarrfk commented Nov 30, 2022

Summary of the h5py configuration

hmaarrfk commented Nov 30, 2022

hmaarrfk commented Nov 30, 2022

kmuehlbauer commented Dec 1, 2022

hmaarrfk commented Dec 1, 2022

kmuehlbauer commented Dec 1, 2022

hmaarrfk commented Dec 3, 2022

hmaarrfk commented Dec 3, 2022

veenstrajelmer commented Aug 21, 2024 •

edited

Loading

hmaarrfk commented Aug 21, 2024

veenstrajelmer commented Aug 30, 2024 •

edited

Loading

hmaarrfk commented Aug 30, 2024

hmaarrfk commented Aug 30, 2024

veenstrajelmer commented Sep 17, 2024

Improving performance for h5netcdf #195

Improving performance for h5netcdf #195

Comments

hmaarrfk commented Nov 30, 2022

What happened:

Expected Output

Anything else we need to know?:

Version

Summary of the h5py configuration

hmaarrfk commented Nov 30, 2022

hmaarrfk commented Nov 30, 2022

kmuehlbauer commented Dec 1, 2022

hmaarrfk commented Dec 1, 2022

kmuehlbauer commented Dec 1, 2022

hmaarrfk commented Dec 3, 2022

hmaarrfk commented Dec 3, 2022

veenstrajelmer commented Aug 21, 2024 • edited Loading

hmaarrfk commented Aug 21, 2024

veenstrajelmer commented Aug 30, 2024 • edited Loading

hmaarrfk commented Aug 30, 2024

hmaarrfk commented Aug 30, 2024

veenstrajelmer commented Sep 17, 2024

veenstrajelmer commented Aug 21, 2024 •

edited

Loading

veenstrajelmer commented Aug 30, 2024 •

edited

Loading