HDF5 read/write functionality #303

williamjameshandley · 2023-06-21T15:19:22Z

Description

This update implements a rudimentary method for reading/writing files to HDF5 files. The main strategy is to override the get, put and select functions of HDFStore so that in addition to the usual storage they also record the anesthetic type (Samples, NestedSamples, or MCMCSamples), alongside the _metadata which we have added. This is then sufficient to reconstruct the array from the HDF5 file. The to_hdf and read_hdf then simply ensure this new anesthetic.io.HDFStore is used in place of the usual pandas.io.pytables.HDFStore.

I have an interest in this for creating HDF5 databases of chains for fast upload/download/cacheing as part of unimpeded.

Checklist:

I have performed a self-review of my own code
My code is PEP8 compliant (flake8 anesthetic tests)
My code contains compliant docstrings (pydocstyle --convention=numpy anesthetic)
New and existing unit tests pass locally with my changes (python -m pytest)
I have added tests that prove my fix is effective or that my feature works
I have appropriately incremented the semantic version number in both README.rst and anesthetic/_version.py

codecov · 2023-06-21T15:30:50Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (61bed43) 100.00% compared to head (3c8fd73) 100.00%.

Additional details and impacted files

@@            Coverage Diff            @@
##            master      #303   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           30        32    +2     
  Lines         2691      2764   +73     
=========================================
+ Hits          2691      2764   +73

Impacted Files	Coverage Δ
anesthetic/__init__.py	`100.00% <100.00%> (ø)`
anesthetic/_version.py	`100.00% <100.00%> (ø)`
anesthetic/read/hdf.py	`100.00% <100.00%> (ø)`
anesthetic/samples.py	`100.00% <100.00%> (ø)`
anesthetic/testing.py	`100.00% <100.00%> (ø)`
anesthetic/utils.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

lukashergt

I have an interest in this for creating HDF5 databases of chains for fast upload/download/cacheing as part of unimpeded.

What is unimpeded?

How does this HDF5 implementation compare to pandas parquet, as documented in our Reading and writing section? How do speeds compare? How do file sizes compare?

Can we add a bullet point for the HDF5 implementation in our Reading and writing docs?

anesthetic/utils.py

williamjameshandley · 2023-06-23T17:22:18Z

What is unimpeded?

slide 10

How does this HDF5 implementation compare to pandas parquet, as documented in our Reading and writing section? How do speeds compare? How do file sizes compare?

It looks like the parquet is actually broken:

samples = read_chains('./tests/example_data/pc')
samples.to_parquet('pc.parquet')
pd.read_parquet('pc.parquet') #fails with KeyError 'weights'

In terms of how this hdf5 functionality is an improvement:

It reads whether it is a NestedSamples object, so it is self-describing in the right manner
It does not strip out metadata ['_labels', 'label', 'root', '_beta']

williamjameshandley · 2023-06-28T22:52:41Z

OK, I've adjusted the docs to now recommend hdf5, which is now at least tested and working, and made a note to fix parquet at some point in the future.

lukashergt · 2023-06-29T17:26:29Z

anesthetic/io.py

I think this file belongs in the read folder...?

done in f432f69

although now since we have to_hdf and from_hdf, it should really be the io folder rather than the read folder...

Do we want to make that change? That turns this into a 2.0.0 milestone, but that is what we are aiming for anyhow, right?

lukashergt · 2023-06-29T17:28:30Z

anesthetic/testing.py

Does this file belong in the tests folder?

no -- in the same way that pandas.testing doesn't -- it's general anesthetic testing functionality which other users/derivatives may wish to use.

lukashergt · 2023-06-29T18:22:48Z

docs/source/reading_writing.rst

+When loading in previously saved samples from hdf5, make sure to import the
+``anesthetic.io.read_hdf`` function, and not the ``pandas.read_hdf`` version. If
+you forget to do this, the samples will be read in as a ``DataFrame``, with a
+consequent loss of functionality
+
+
+* ``read_hdf``:

  ::

-      from pandas import read_parquet
-      from anesthetic import Samples  # or MCMCSamples, or NestedSamples
-      samples = Samples(read_parquet("filename.parquet"))
+      from anesthetic.io import read_hdf
+      samples = read_hdf("filename.h5", "samples")


In the interest of being similar to both read_chains and to pandas, I'd say it should be possible to import read_hdf directly from anesthetic, i.e.:

from anesthetic import read_hdf

done in 9426d0b

williamjameshandley added 6 commits June 21, 2023 13:05

Added example hdf5 reader/writer

3d8abb4

HDF5 functionality and tests

f20f0fb

Bumped version number

89ec30e

Updated pyproject.toml

247a5fa

Removed typing for python3.8

309d4ad

Lint corrections

d6c90af

williamjameshandley added this to the 2.1.0 milestone Jun 21, 2023

williamjameshandley added 2 commits June 21, 2023 17:01

Windows corrections

09b4d01

Fixed problems with pandas documentation

0d051a8

williamjameshandley requested a review from lukashergt June 21, 2023 18:53

Actually now making read_hdf documentation

40efc04

lukashergt requested changes Jun 22, 2023

View reviewed changes

anesthetic/utils.py Outdated Show resolved Hide resolved

Removed uncovered AttributeError

1f31c0d

williamjameshandley mentioned this pull request Jun 28, 2023

New pandas.DataFrame.attrs feature possibly useful for metadata such as root or label #236

Open

Updated the documentation

bed2ad6

williamjameshandley mentioned this pull request Jun 28, 2023

Fix parquet functionality #307

Open

williamjameshandley requested a review from lukashergt June 28, 2023 22:53

williamjameshandley added 4 commits June 29, 2023 09:18

Merge branch 'master' into hdf5_io

df96bd8

bumped version

a473b94

Merge branch 'master' into hdf5_io

0a37c47

Version bump

bb2ef05

lukashergt reviewed Jun 29, 2023

View reviewed changes

williamjameshandley added 5 commits June 29, 2023 20:39

Import read_hdf

9426d0b

Updated docs for new location

db6bc25

Moved hdf functionality to anesthetic.read

f432f69

Updated documentation

8747ccf

Merge branch 'master' into hdf5_io

8643387

williamjameshandley and others added 7 commits June 29, 2023 21:37

Bumped version

aa86347

moved circular import back out again

61e1f80

Removed inheritance from anesthetic

59008e6

fix hdf5 docstring thing

03e7ee4

Adjusted pandas.hdf to anesthetic.read_hdf

c0acee4

Merge branch 'hdf5_io' of github.com:handley-lab/anesthetic into hdf5_io

ac6aea0

Added back in class

3c8fd73

lukashergt approved these changes Jun 29, 2023

View reviewed changes

williamjameshandley merged commit 59af500 into master Jun 29, 2023

williamjameshandley deleted the hdf5_io branch June 29, 2023 21:25

lukashergt mentioned this pull request Jun 29, 2023

Make tests work without hdf5 #308

Closed

AdamOrmondroyd mentioned this pull request Jun 30, 2023

remove test_hdf5.h5 after test #309

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF5 read/write functionality #303

HDF5 read/write functionality #303

williamjameshandley commented Jun 21, 2023

codecov bot commented Jun 21, 2023 •

edited

Loading

lukashergt left a comment •

edited

Loading

williamjameshandley commented Jun 23, 2023

williamjameshandley commented Jun 28, 2023

lukashergt Jun 29, 2023

williamjameshandley Jun 29, 2023

williamjameshandley Jun 29, 2023 •

edited

Loading

lukashergt Jun 29, 2023

lukashergt Jun 29, 2023

williamjameshandley Jun 29, 2023

lukashergt Jun 29, 2023

lukashergt Jun 29, 2023

williamjameshandley Jun 29, 2023

HDF5 read/write functionality #303

HDF5 read/write functionality #303

Conversation

williamjameshandley commented Jun 21, 2023

Description

Checklist:

codecov bot commented Jun 21, 2023 • edited Loading

Codecov Report

lukashergt left a comment • edited Loading

Choose a reason for hiding this comment

williamjameshandley commented Jun 23, 2023

williamjameshandley commented Jun 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

williamjameshandley Jun 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jun 21, 2023 •

edited

Loading

lukashergt left a comment •

edited

Loading

williamjameshandley Jun 29, 2023 •

edited

Loading