Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF5 read/write functionality #303

Merged
merged 27 commits into from
Jun 29, 2023
Merged

HDF5 read/write functionality #303

merged 27 commits into from
Jun 29, 2023

Conversation

williamjameshandley
Copy link
Collaborator

Description

This update implements a rudimentary method for reading/writing files to HDF5 files. The main strategy is to override the get, put and select functions of HDFStore so that in addition to the usual storage they also record the anesthetic type (Samples, NestedSamples, or MCMCSamples), alongside the _metadata which we have added. This is then sufficient to reconstruct the array from the HDF5 file. The to_hdf and read_hdf then simply ensure this new anesthetic.io.HDFStore is used in place of the usual pandas.io.pytables.HDFStore.

I have an interest in this for creating HDF5 databases of chains for fast upload/download/cacheing as part of unimpeded.

Checklist:

  • I have performed a self-review of my own code
  • My code is PEP8 compliant (flake8 anesthetic tests)
  • My code contains compliant docstrings (pydocstyle --convention=numpy anesthetic)
  • New and existing unit tests pass locally with my changes (python -m pytest)
  • I have added tests that prove my fix is effective or that my feature works
  • I have appropriately incremented the semantic version number in both README.rst and anesthetic/_version.py

@williamjameshandley williamjameshandley added this to the 2.1.0 milestone Jun 21, 2023
@codecov
Copy link

codecov bot commented Jun 21, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (61bed43) 100.00% compared to head (3c8fd73) 100.00%.

Additional details and impacted files
@@            Coverage Diff            @@
##            master      #303   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           30        32    +2     
  Lines         2691      2764   +73     
=========================================
+ Hits          2691      2764   +73     
Impacted Files Coverage Δ
anesthetic/__init__.py 100.00% <100.00%> (ø)
anesthetic/_version.py 100.00% <100.00%> (ø)
anesthetic/read/hdf.py 100.00% <100.00%> (ø)
anesthetic/samples.py 100.00% <100.00%> (ø)
anesthetic/testing.py 100.00% <100.00%> (ø)
anesthetic/utils.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@lukashergt lukashergt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an interest in this for creating HDF5 databases of chains for fast upload/download/cacheing as part of unimpeded.

What is unimpeded?

How does this HDF5 implementation compare to pandas parquet, as documented in our Reading and writing section? How do speeds compare? How do file sizes compare?

Can we add a bullet point for the HDF5 implementation in our Reading and writing docs?

anesthetic/utils.py Outdated Show resolved Hide resolved
@williamjameshandley
Copy link
Collaborator Author

What is unimpeded?

slide 10

How does this HDF5 implementation compare to pandas parquet, as documented in our Reading and writing section? How do speeds compare? How do file sizes compare?

It looks like the parquet is actually broken:

samples = read_chains('./tests/example_data/pc')
samples.to_parquet('pc.parquet')
pd.read_parquet('pc.parquet') #fails with KeyError 'weights'

In terms of how this hdf5 functionality is an improvement:

  1. It reads whether it is a NestedSamples object, so it is self-describing in the right manner
  2. It does not strip out metadata ['_labels', 'label', 'root', '_beta']

@williamjameshandley
Copy link
Collaborator Author

OK, I've adjusted the docs to now recommend hdf5, which is now at least tested and working, and made a note to fix parquet at some point in the future.

anesthetic/io.py Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this file belongs in the read folder...?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in f432f69

Copy link
Collaborator Author

@williamjameshandley williamjameshandley Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

although now since we have to_hdf and from_hdf, it should really be the io folder rather than the read folder...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make that change? That turns this into a 2.0.0 milestone, but that is what we are aiming for anyhow, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this file belong in the tests folder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no -- in the same way that pandas.testing doesn't -- it's general anesthetic testing functionality which other users/derivatives may wish to use.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Comment on lines 98 to 109
When loading in previously saved samples from hdf5, make sure to import the
``anesthetic.io.read_hdf`` function, and not the ``pandas.read_hdf`` version. If
you forget to do this, the samples will be read in as a ``DataFrame``, with a
consequent loss of functionality


* ``read_hdf``:

::

from pandas import read_parquet
from anesthetic import Samples # or MCMCSamples, or NestedSamples
samples = Samples(read_parquet("filename.parquet"))
from anesthetic.io import read_hdf
samples = read_hdf("filename.h5", "samples")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the interest of being similar to both read_chains and to pandas, I'd say it should be possible to import read_hdf directly from anesthetic, i.e.:

from anesthetic import read_hdf

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 9426d0b

@williamjameshandley williamjameshandley merged commit 59af500 into master Jun 29, 2023
@williamjameshandley williamjameshandley deleted the hdf5_io branch June 29, 2023 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants