-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDF5 read/write functionality #303
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #303 +/- ##
=========================================
Coverage 100.00% 100.00%
=========================================
Files 30 32 +2
Lines 2691 2764 +73
=========================================
+ Hits 2691 2764 +73
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an interest in this for creating HDF5 databases of chains for fast upload/download/cacheing as part of
unimpeded
.
What is unimpeded
?
How does this HDF5 implementation compare to pandas parquet, as documented in our Reading and writing section? How do speeds compare? How do file sizes compare?
Can we add a bullet point for the HDF5 implementation in our Reading and writing docs?
It looks like the parquet is actually broken: samples = read_chains('./tests/example_data/pc')
samples.to_parquet('pc.parquet')
pd.read_parquet('pc.parquet') #fails with KeyError 'weights' In terms of how this hdf5 functionality is an improvement:
|
OK, I've adjusted the docs to now recommend hdf5, which is now at least tested and working, and made a note to fix parquet at some point in the future. |
anesthetic/io.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this file belongs in the read
folder...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in f432f69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
although now since we have to_hdf
and from_hdf
, it should really be the io
folder rather than the read
folder...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make that change? That turns this into a 2.0.0 milestone, but that is what we are aiming for anyhow, right?
anesthetic/testing.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this file belong in the tests
folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no -- in the same way that pandas.testing
doesn't -- it's general anesthetic testing functionality which other users/derivatives may wish to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
docs/source/reading_writing.rst
Outdated
When loading in previously saved samples from hdf5, make sure to import the | ||
``anesthetic.io.read_hdf`` function, and not the ``pandas.read_hdf`` version. If | ||
you forget to do this, the samples will be read in as a ``DataFrame``, with a | ||
consequent loss of functionality | ||
|
||
|
||
* ``read_hdf``: | ||
|
||
:: | ||
|
||
from pandas import read_parquet | ||
from anesthetic import Samples # or MCMCSamples, or NestedSamples | ||
samples = Samples(read_parquet("filename.parquet")) | ||
from anesthetic.io import read_hdf | ||
samples = read_hdf("filename.h5", "samples") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the interest of being similar to both read_chains
and to pandas, I'd say it should be possible to import read_hdf
directly from anesthetic
, i.e.:
from anesthetic import read_hdf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 9426d0b
Description
This update implements a rudimentary method for reading/writing files to HDF5 files. The main strategy is to override the
get
,put
andselect
functions ofHDFStore
so that in addition to the usual storage they also record the anesthetic type (Samples
,NestedSamples
, orMCMCSamples
), alongside the_metadata
which we have added. This is then sufficient to reconstruct the array from the HDF5 file. Theto_hdf
andread_hdf
then simply ensure this newanesthetic.io.HDFStore
is used in place of the usualpandas.io.pytables.HDFStore
.I have an interest in this for creating HDF5 databases of chains for fast upload/download/cacheing as part of
unimpeded
.Checklist:
flake8 anesthetic tests
)pydocstyle --convention=numpy anesthetic
)python -m pytest
)