Support HDF5 reading/writing (with optional dependencies) #3520

ritchie46 · 2022-05-28T05:44:52Z

We can reduce friction by figuring out how to load data most efficiently to polars memory.

jorgecarleitao · 2022-05-28T05:47:02Z

If there is a backend for this in Rust, I think we could work it out in arrow2. It is a quite important format imo.

ritchie46 · 2022-05-28T05:52:50Z

That's a good idea too!

ghuls · 2022-05-28T21:43:19Z

HDF5 has a very big specification: https://docs.hdfgroup.org/hdf5/develop/_f_m_t3.html
and as HDF5 is very similar to a filesystem, data stored in HDF5 can be stored in quite a lot of different ways.

Rust bindings to libhdf5 can be found at: https://github.com/aldanor/hdf5-rust

ritchie46 · 2022-05-30T07:17:38Z

I think we should explore both. Rust backed under a feature flag, and python as optional dependency. I can imagine that it increases binary size quite a bit.

JackKelly · 2023-01-10T13:38:44Z

I'm really excited about Polars. But almost all of my data is in large HDF5 files (actually, NetCDF). I can convert to Parquet files. But loading directly from HDF5 into Polars would be ideal 🙂

Are there any plans to add support to Polars to read HDF5 (ideally lazily)?

ghuls · 2023-01-10T14:26:01Z

There is this rust crate for reading https://github.com/georust/netcdf But as far as I understand NetCDF contains mostly multidimensional data instead of 1D arrays like the arrow format, so I am not sure how useful it would be in general to even consider support for this.

JackKelly · 2023-01-10T19:43:46Z

Good point!

For my work, I use NetCDF for n-dimensional arrays, and 2d arrays.

But, if I'm reading this right, this comment suggests that tensor types might come to Polars at some point. It'd be absolutely amazing to be able to load n-dim NetCDF files directly into an n-dimensional in-memory array in Polars 🙂

aldanor · 2023-02-26T18:54:01Z

(hdf5-rust author here)

Rust bindings to libhdf5 can be found at: https://github.com/aldanor/hdf5-rust

These are not just bindings to libhdf5 (that's the hdf5-sys crate which is part of it, and which netcdf crate itself depends on), there's quite a bit more like the the whole #[derive(...)] shebang for struct-like types, thread-safe HDF5 handles management, etc.

Re: NetCDF, while it uses HDF5 as a storage medium, it's not the same thing, it's more like an opinionated meta-format on top of that that is very popular in some communities (e.g. geo).

I think we could make HDF5 work with polars, but would be nice to have something like a spec, or at least a wishlist with some examples – i.e. what do we want?

loading directly from HDF5 into Polars would be ideal

Problem is, polars/arrow/parquet etc are column based, whereas hdf5 is array-based. If you have a polars frame with columns a: i64, b: f64 and you plan on reading/writing that to HDF5, there's a few ways you can do this:

create a struct type { a: i64, b: f64 } and store it as one structured dataset
store each column separately, like /path/a, /path/b (I believe that's one of the possible ways to dump it to hdf5 from pandas?)
(can come up with other options)

There's even more ambiguity when reading existing data: if you have a structured HDF5 dataset with fields "a" and "b", you may want

to read it as frame with two columns 'a' and 'b'
series with a structured type
etc

One way to go would be to check what pandas does and do the same thing, so you can dump a dataframe from pandas and read it back from polars. Perhaps that's an easiest way to get started.

ritchie46 · 2023-02-26T19:20:22Z

One way to go would be to check what pandas does and do the same thing, so you can dump a dataframe from pandas and read it back from polars. Perhaps that's an easiest way to get started.

I believe that most of the hdf5 files that we are expected to be able to read are created by pandas. So, yes, I think we should start with supporting what they do.

JackKelly · 2023-03-01T15:28:22Z

I believe that most of the hdf5 files that we are expected to be able to read are created by pandas.
So, yes, I think we should start with supporting what they do.

I agree, copying Pandas' behaviour sounds like a great place to start!

ghuls · 2023-03-01T15:41:21Z

Pandas to_hdf() use pytables to store its dataframes: https://www.pytables.org/

PyTable basic fle format overview:
https://www.pytables.org/usersguide/file_format.html

dougbrennan · 2023-04-02T02:37:38Z

I have an application that uses h5py to read h5 data into pandas and I've started to convert this app from Pandas to Polars - the data is relatively big and I'm looking for more performance.

I use h5py to read the h5 datasets into numpy structured arrays (heterogeneous types) and the numpy structured arrays transfer very easily into pandas dataframes.

But getting that same data into a Polars dataframe is proving to be a problem - basically the structure of the numpy structured array is lost and I end up with a single column dataframe with an object dtype.

I suspect there are many users who get their h5 data via h5py and for these users, just supporting fast/easy construction of polars dataframes from numpy structured arrays would be ideal

dougbrennan · 2023-04-02T03:02:54Z

I believe that most of the hdf5 files that we are expected to be able to read are created by pandas.
So, yes, I think we should start with supporting what they do.

I agree, copying Pandas' behaviour sounds like a great place to start!

Copying Pandas behavior for creating dataframes from numpy structured arrays would be great!!!

FObersteiner · 2023-04-04T10:16:13Z

I use h5py to read the h5 datasets into numpy structured arrays (heterogeneous types) and the numpy structured arrays transfer very easily into pandas dataframes.

It seems np structured array is only a helper here, which the pandas dataframe constructor knows how to handle. So the 'clean' way would be to access the hdf5 directly. But I must admit I have no idea if that is the easier option.

Ben-Epstein · 2023-04-19T15:07:54Z

It may be work considering leveraging Vaex for this. You can read/write to hdf5 files natively, and they map directly to/from numpy arrays

import vaex
import numpy as np

x = np.arange(1000)
y = np.random.rand(1000)
z = np.array(['dog', 'cat']*500)


df_numpy = vaex.from_arrays(x=x, y=y, z=z)
display(df_numpy)

df_numpy.export("file.hdf5")

df_numpy_hdf5 = vaex.open("file.hdf5")

x = df_numpy_hdf5.x.to_numpy()
x.dtype, x

alexander-beedie · 2023-05-01T20:48:27Z

just supporting fast/easy construction of polars dataframes from numpy structured arrays would be ideal

Done... structured array support (both initialising frames and exporting from them) will be available in the upcoming 0.17.12 release.

denis-bz · 2023-08-06T15:34:01Z

Folks,

Tall towers like

dask, zarr ...
xarray
netCDF4
hdf5

have many users in geographic info -- but towers get shaky as you add more and more stories
(medieval cathedrals got higher and higher, a few collapsed).
See numpy indexes small amounts of data 1000 faster than xarray (2019)

Not sure what's going on under the hood

and Moving-away-hdf5 (2016) -- excellent.

Fwiw, my use case: 100 ior so 2 GB hdf5 files (24*30, 824, 848)
https://opendata.dwd.de/climate_environment/REA/COSMO_REA6/converted/hourly/2D/WS_100m.2D.201801.nc4
I want slices like wind_data[ :, :100, :100 ] .moveaxis( 0, -1 ) in numpy.
[ :, :n, :n ] takes 9 sec with xarray on my old iMac with 16 GB even for n=1.

JackKelly · 2023-08-06T18:35:47Z

On the topic of the performance of Zarr (and xarray), we've started a concerted effort to significantly speed up Zarr, possibly be re-writing a bunch of it from scratch in Rust (partly inspired by Polars!): zarr-developers/zarr-python#1479

I agree 100% with your general point: A lot of "scientific Python" code is built on 30 year old foundations. And those foundations are starting to look a little shaky! (Because high-performance code today looks quite different to high-performance code from 30 years ago). So I do agree that there's an argument for thinking hard about re-building some of these things from the ground-up.

denis-bz · 2023-08-07T14:39:14Z

@JackKelly

Seems that hdf5 files can be horribly complicated
(which I guess everybody knows, but I didn't --
after too-long surfing, h5stat of my .nc4 is 100 lines, wow:
2 GB flat to 1.5 GB compressed tree => s l o w reads).

From the top floor of this tower you won't even SEE the ancient crud way down below,
let alone be able to point a user to it.
Is there a testbench which can track time and space
through n layers of python, cython, dylibs ... in this simple case ?

hugopendlebury · 2024-01-27T14:12:33Z

I'm really excited about Polars. But almost all of my data is in large HDF5 files (actually, NetCDF). I can convert to Parquet files. But loading directly from HDF5 into Polars would be ideal 🙂

Are there any plans to add support to Polars to read HDF5 (ideally lazily)?

Interesting. Are you working with weather data ? I knocked together a quick project which ECMWFs eccodes library and exposes the data as arrow https://github.com/hugopendlebury/gribtoarrow

Would be interested in doing something similar with HDF5 do you have any sample files ?

denis-bz · 2024-01-27T16:45:01Z

Hi Hugo,

fwiw, a trival test case is Randomcube( gb= ), see at the end of Specify chunks in bytes
also numpy indexes small amounts of data 1000 faster than xarray.

On weather data, https://opendata.dwd.de/climate_environment/REA/COSMO_REA6/converted/hourly/2D/*
have 2 GB files (compressed to 1.5 so uncompress is slow, compress very slow).

A big unknown in runtimes is cache performance, SSD L1 L2 ...;
I imagine that they affect runtimes as much as CPU GHz
but don't have even a 0-order model, sum coefs * us.
Apple and Intel have surely put lots of $$$ and manpower into models of caches and filesystems --
they might have open test benches worth looking at, dunno.
(Is there a form of Amdahl's_law for CPU + multilevel caches ?)

cheers
-- denis

vsisl · 2024-03-22T10:47:37Z

Pandas to_hdf() use pytables to store its dataframes: https://www.pytables.org/

PyTable basic fle format overview: https://www.pytables.org/usersguide/file_format.html

In one of my projects, we use a python stack with pandas doing all the DataFrame stuff. We're currently depending on an SQL database but want to migrate to pytables. It would be amazing if polars offered an easy way how to load/store data from/to pytables HDF5 files!

stinodego · 2024-04-17T07:28:31Z

Pandas to_hdf() use pytables to store its dataframes: pytables.org
PyTable basic fle format overview: pytables.org/usersguide/file_format.html

In one of my projects, we use a python stack with pandas doing all the DataFrame stuff. We're currently depending on an SQL database but want to migrate to pytables. It would be amazing if polars offered an easy way how to load/store data from/to pytables HDF5 files!

Looks like pytables may be the way to go to support this on the Python side. That would probably be a good first step. We can look into Rust support later.

ritchie46 · 2024-04-17T07:32:52Z

Yes, we can start with pytables. For Rust support I first want to extend the plugin system with readers.

galbwe · 2024-04-26T14:38:54Z

Hi, can I have a go at implementing this with pytables?

stinodego · 2024-04-26T16:56:59Z

@galbwe Definitely! You can take inspiration from read/scan/write_delta, which should be comparable in the sense that it is I/O enabled by a third party dependency on the Python side.

timothyhutz · 2024-05-03T16:02:28Z

Python dev and Newbie Rust dev, I would like to try and implement that HDF5 crate and create a data load function..

https://github.com/aldanor/hdf5-rust

timothyhutz · 2024-05-10T06:14:55Z

(hdf5-rust author here)

Rust bindings to libhdf5 can be found at: https://github.com/aldanor/hdf5-rust

These are not just bindings to libhdf5 (that's the hdf5-sys crate which is part of it, and which netcdf crate itself depends on), there's quite a bit more like the the whole #[derive(...)] shebang for struct-like types, thread-safe HDF5 handles management, etc.

Re: NetCDF, while it uses HDF5 as a storage medium, it's not the same thing, it's more like an opinionated meta-format on top of that that is very popular in some communities (e.g. geo).

I think we could make HDF5 work with polars, but would be nice to have something like a spec, or at least a wishlist with some examples – i.e. what do we want?

loading directly from HDF5 into Polars would be ideal

Problem is, polars/arrow/parquet etc are column based, whereas hdf5 is array-based. If you have a polars frame with columns a: i64, b: f64 and you plan on reading/writing that to HDF5, there's a few ways you can do this:

create a struct type { a: i64, b: f64 } and store it as one structured dataset

store each column separately, like /path/a, /path/b (I believe that's one of the possible ways to dump it to hdf5 from pandas?)

(can come up with other options)

There's even more ambiguity when reading existing data: if you have a structured HDF5 dataset with fields "a" and "b", you may want

to read it as frame with two columns 'a' and 'b'

series with a structured type

etc

One way to go would be to check what pandas does and do the same thing, so you can dump a dataframe from pandas and read it back from polars. Perhaps that's an easiest way to get started.

I'm going to work on adding this client crate into polars in my fork, when I'm ready I'll submit a PR.

denis-bz · 2024-05-10T09:56:54Z

@timothyhutz, I'd welcome a 1-page spec and a little testbench first.
Is that realistic -- what do you think ?
(Pandas is deeply 2D, geopandas size and complexity pandas * 2.
Afaik it's moving to PyArrow; see also nanoarrow --

This entire extension is currently experimental and awaiting use-cases that will drive future development.

Added: do you have a short list of max 5 use cases for people to vote on:
must-have, would-be-nice, too-complex ?

Ciobrelis · 2024-06-28T18:08:36Z

When could one expect to have hdf5 implementation to be done?

stinodego · 2024-06-28T18:22:38Z

This does not have priority for us. A community contribution is welcome.

ritchie46 · 2024-07-27T12:18:22Z

Actually, I am working on Polars IO plugins and this might be the first candidate.

ritchie46 · 2024-08-01T06:18:17Z

Ok, I did some investigation and reading pandas style hdf5 files is a no go. Pandas style is just a glorified pickle. In the following example

import pandas as pd

df = pd.DataFrame({
    "a": ["foo", "bar", None, "h"],
    "b": [None, True, False, True],
    "c": [1.0, None, 3.0, 4.0],
})
df.to_hdf('data.h5', key='df', mode="w")

Only column "c" was written in a sane way. Column "a" and "b" were both pickled as an object type. No schema, nothing. We don't know what the original type was.

And going through python to pickle and infer types is a no-go. It is super slow, ambiguous and can execute random code (because of pickle).

JackKelly · 2024-08-02T13:48:45Z

For me, I'd mostly want to use Polars to read HDF5 (and NetCDF) files which were not created by Pandas. But maybe I'm an outlier?

Ciobrelis · 2024-08-02T14:14:46Z

This does not have priority for us. A community contribution is welcome.

Unfortunately, I am not so advanced...

For me, I'd mostly want to use Polars to read HDF5 (and NetCDF) files which were not created by Pandas. But maybe I'm an outlier?

Cannot talk for the entire community, but I would also use it to read HDF5 files generated by other sources than Pandas.

ritchie46 · 2024-08-02T17:14:04Z

Can you give me an example what is in those files then? Numerical arrays? I'd like some examples/spec to know how to implement it.

Maybe share some files.

JackKelly · 2024-08-02T18:52:55Z

For me, I mostly work with weather predictions and satellite data, so I mostly work with NetCDF (which uses HDF5 under the hood, IIUC).

NOAA's amazing Open Data Dissemination programme (NODD) has released 59 petabytes of data on cloud object storage. Some of that data is NetCDF. For example, the GFS warm start data is netcdf. (The warm start dataset is the second dataset listed on that AWS page)

hugopendlebury · 2024-08-02T19:59:06Z

For me, I mostly work with weather predictions and satellite data, so I mostly work with NetCDF (which uses HDF5 under the hood, IIUC).

NOAA's amazing Open Data Dissemination programme (NODD) has released 59 petabytes of data on cloud object storage. Some of that data is NetCDF. For example, the GFS warm start data is netcdf. (The warm start dataset is the second dataset listed on that AWS page)

Interesting since I also consume NOAA data but in GRIB Format, which I download using Amazon. Is there any advantage to NetCDF? From what I've seen it's less flexible than Grib but works better with tools like xarray (which are a bit restrictive and due to trying to create a cube of the data can fabricate results - since it create a cartesian product for the dimensions.

ritchie46 · 2024-08-04T08:57:26Z

I tried to open one of those files in pytables, but it cannot open it.

tables.open_file("gdas.t00z.sfcf003.nemsio")

---------------------------------------------------------------------------
HDF5ExtError                              Traceback (most recent call last)
Cell In[6], line 1
----> 1 tables.open_file("gdas.t00z.sfcf003.nemsio")

File ~/miniconda3/lib/python3.10/site-packages/tables/file.py:294, in open_file(filename, mode, title, root_uep, filters, **kwargs)
    289             raise ValueError(
    290                 "The file '%s' is already opened.  Please "
    291                 "close it before reopening in write mode." % filename)
    293 # Finally, create the File instance, and return it
--> 294 return File(filename, mode, title, root_uep, filters, **kwargs)

File ~/miniconda3/lib/python3.10/site-packages/tables/file.py:744, in File.__init__(self, filename, mode, title, root_uep, filters, **kwargs)
    741 self.params = params
    743 # Now, it is time to initialize the File extension
--> 744 self._g_new(filename, mode, **params)
    746 # Check filters and set PyTables format version for new files.
    747 new = self._v_new

File ~/miniconda3/lib/python3.10/site-packages/tables/hdf5extension.pyx:512, in tables.hdf5extension.File._g_new()

HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 836, in H5Fopen
    unable to synchronously open file
  File "H5F.c", line 796, in H5F__open_api_common
    unable to open file
  File "H5VLcallback.c", line 3863, in H5VL_file_open
    open failed
  File "H5VLcallback.c", line 3675, in H5VL__file_open
    open failed
  File "H5VLnative_file.c", line 128, in H5VL__native_file_open
    unable to open file
  File "H5Fint.c", line 2003, in H5F_open
    unable to read superblock
  File "H5Fsuper.c", line 392, in H5F__super_read
    file signature not found

End of HDF5 error back trace

Unable to open/create file 'gdas.t00z.sfcf003.nemsio'

I need a bit more handhelding. I don't know anything about those formats, so I need a good cherrypicked example and the type of data you want out of it.

JackKelly · 2024-08-05T13:46:43Z

No worries! I can try to help!

I think those .nemsio files aren't NetCDF (I'm actually not sure what the .nemsio extension means!)

Here are some files which definitely should be NetCDF: https://noaa-gfs-warmstart-pds.s3.amazonaws.com/index.html#gdas.20210102/00/RESTART/

JackKelly · 2024-08-05T14:03:47Z

On the question of what data I'd want from these files...

TBH, I probably haven't tried to open a NetCDF file in Pandas for years. I pretty much exclusively use xarray for NetCDF files because all the NetCDF files I work with are multi-dimensional datasets with geospatial coordinates (like lat / lon) and I need to process these files in a tool which has good support for geospatial operations on n-dim arrays.

To be specific: I often work with "numerical weather predictions" (NWPs) which are dense, gridded weather forecasts produced by physical models of the weather, run on huge supercomputers. NWPs are dense n-dim arrays, where n can be as high as 7. The 7 dimensions are:

3 spatial dimensions
2 time dimensions (the initialisation time and the forecast timestep)
the physical quantity being forecast (e.g. temperature or wind speed)
the ensemble member

Which is all to say that, if I'm honest, I',m afraid I probably wouldn't see myself using Polars to process NWPs (because I need to handle up to 7 dimensions, and I need to process the spatial coords).

Or is the "grand plan" is to build functionality into Polars that could enable something like "a Rust xarray" to be built on top of Polars, in a similar way to the way that Python's xarray uses Pandas?! (I think xarray mostly uses Pandas for its Index machinery) But I have no idea if that's even a sane path forwards! But I could be interested in helping to build an xarray in Rust (although we'd have to figure out if that's even a sane idea, first!)

ritchie46 added feature good first issue Good for newcomers labels May 28, 2022

zundertj mentioned this issue Dec 13, 2022

Support for HDF files (.h5 or .hdf) #5776

Closed

ghuls mentioned this issue Jan 20, 2023

Hdf5 file opening using polars #6258

Closed

alexander-beedie mentioned this issue May 1, 2023

feat(python): support transparent DataFrame init from numpy structured/record arrays. #8620

Merged

stinodego added enhancement New feature or an improvement of an existing feature and removed feature labels Jul 14, 2023

vsisl mentioned this issue Mar 26, 2024

[Python] pytables (HDF5) reader/writer #15308

Closed

stinodego changed the title ~~[Python] hdf5 reader/writer (with optional dependencies)~~ Support HDF5 reading/writing (with optional dependencies) Mar 27, 2024

stinodego added python Related to Python Polars accepted Ready for implementation labels Mar 27, 2024

github-project-automation bot added this to Backlog Mar 27, 2024

github-project-automation bot moved this to Ready in Backlog Mar 27, 2024

This was referenced Jun 18, 2024

remove vaex hard dependency radis/radis#656

Merged

📚 Add fast-memory-mapping alternative(s) to vaex radis/radis#658

Open

stinodego added help wanted Extra attention is needed and removed good first issue Good for newcomers labels Jun 28, 2024

denis-bz mentioned this issue Sep 27, 2024

Windkraft in Bayern → Datensatz: 1000 lines marktstammdatenregister-dev/mastr#56

Open

egorchakov mentioned this issue Oct 9, 2024

feat: add hdf5 support yaak-ai/rbyte#14

Merged

Support HDF5 reading/writing (with optional dependencies) #3520

Support HDF5 reading/writing (with optional dependencies) #3520

Comments

ritchie46 commented May 28, 2022

jorgecarleitao commented May 28, 2022

ritchie46 commented May 28, 2022

ghuls commented May 28, 2022

ritchie46 commented May 30, 2022

JackKelly commented Jan 10, 2023

ghuls commented Jan 10, 2023

JackKelly commented Jan 10, 2023

aldanor commented Feb 26, 2023 • edited Loading

ritchie46 commented Feb 26, 2023

JackKelly commented Mar 1, 2023

ghuls commented Mar 1, 2023 • edited Loading

dougbrennan commented Apr 2, 2023 • edited Loading

dougbrennan commented Apr 2, 2023

FObersteiner commented Apr 4, 2023 • edited Loading

Ben-Epstein commented Apr 19, 2023

alexander-beedie commented May 1, 2023 • edited Loading

denis-bz commented Aug 6, 2023

JackKelly commented Aug 6, 2023

denis-bz commented Aug 7, 2023 • edited Loading

hugopendlebury commented Jan 27, 2024

denis-bz commented Jan 27, 2024 • edited Loading

vsisl commented Mar 22, 2024

stinodego commented Apr 17, 2024

ritchie46 commented Apr 17, 2024

galbwe commented Apr 26, 2024

stinodego commented Apr 26, 2024

timothyhutz commented May 3, 2024 • edited Loading

timothyhutz commented May 10, 2024

denis-bz commented May 10, 2024 • edited Loading

Ciobrelis commented Jun 28, 2024

stinodego commented Jun 28, 2024

ritchie46 commented Jul 27, 2024

ritchie46 commented Aug 1, 2024

JackKelly commented Aug 2, 2024

Ciobrelis commented Aug 2, 2024

ritchie46 commented Aug 2, 2024

JackKelly commented Aug 2, 2024 • edited Loading

hugopendlebury commented Aug 2, 2024

ritchie46 commented Aug 4, 2024

JackKelly commented Aug 5, 2024

JackKelly commented Aug 5, 2024

aldanor commented Feb 26, 2023 •

edited

Loading

ghuls commented Mar 1, 2023 •

edited

Loading

dougbrennan commented Apr 2, 2023 •

edited

Loading

FObersteiner commented Apr 4, 2023 •

edited

Loading

alexander-beedie commented May 1, 2023 •

edited

Loading

denis-bz commented Aug 7, 2023 •

edited

Loading

denis-bz commented Jan 27, 2024 •

edited

Loading

timothyhutz commented May 3, 2024 •

edited

Loading

denis-bz commented May 10, 2024 •

edited

Loading

JackKelly commented Aug 2, 2024 •

edited

Loading