-
-
Notifications
You must be signed in to change notification settings - Fork 414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Drop object-dtyped variables and coords before saving #2134
Conversation
I think this will also drop all text variables? |
arviz/data/utils.py
Outdated
coords[k] = v | ||
|
||
ndata = Dataset( | ||
data_vars=vars, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this copy all data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dataset
has just three kwargs and I'm passing all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the original question was about copying the data unnecessarly and wasting memory instead of more data than desired missing from the resulting dataset 🤔. If that were the case, it won't copy the data, xarray stores the arrays in datasets without copying them
eb777f8
to
ab07a7c
Compare
With the last push I included modified the coords dropping to have one The CI should go green now too. ...ready to review =) cc @OriolAbril |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, I like adding helper things to inferencedata that generally have object dtype, now I won't need to delete them manually before saving.
ab07a7c
to
a21c3a9
Compare
Codecov Report
@@ Coverage Diff @@
## main #2134 +/- ##
==========================================
+ Coverage 90.73% 90.74% +0.01%
==========================================
Files 118 118
Lines 12566 12582 +16
==========================================
+ Hits 11402 11418 +16
Misses 1164 1164
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Was the string coordinate saved as object? e.g. if one saves one of the example datasets, will it drop coords? |
I mean, that object dtype is totally valid for strings and there should not be any problems saving them to netcdf/zarr, it is just the compression that has problems with them and current main does not try to compress them. I think we probably should have some other way to handle these things. Ideally we should not push invalid data to InferenceData and then try to automatically fix these in our end. What if the object type used can be saved to netcdf by xarray but we just remove it. Should user now start save inference data objects manually? |
I don't think And arbitrary |
They are |
Also to enable adding strings to a particular array of strings basically need to use object type. |
git-subtree-dir: arviz/data/example_data git-subtree-split: ea49b34b27102d22b4d9e202215a05133dc2cea2
a21c3a9
to
f6ac0a5
Compare
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Unfortunately it appears you're right.. With that it seems almost impossible to tell what constitutes an "invalid" data type before attempting to save it. Also, it appears that some tests are now failing on the updated example data, so this update should probably be done in a separate PR. |
Part of the problem comes from numpy + string support. Not sure if there is going to be any fixes. I'm not sure if there is a way for us to test serializing the first item for each variable (with or without compression) and then dropping invalid items. |
I don't think there is any reasonable way forward given the no guarantee for round trip when it comes to dtypes. In my opinion scanning variables to check which can be saved and which not with try except is way too much work, and not the right place for such code, we save datasets as bulk so going variable per variable and coordinate per coordinate is much more work and might need interfacing with netcdf directly. I think we should close this PR and try to get logging into xarray so users know which variables and/or coordinates are the ones breaking the write process |
Yes, I have made my PyMC changes independent of this, and more informative errors at the xarray level sound great |
Description
Variables or coordinates with
object
dtype often cause problems when saving.In PyMC I'm refactoring stats to include a
"warning"
that that is a custom object containing detailed information about sampler divergences.This is, of course, not ideal, but a compromise until we can settle on a data structure that would be serializeable.
For this to work, we need the
"warning"
stat to be automatically dropped when doing.to_netcdf()
, but this is an ArviZ-level implementation.Therefore, this PR adds helper functions to drop
object
-dtyped variables/coords before saving.I believe this should give additional robustness against issues such as #2079.
Checklist
📚 Documentation preview 📚: https://arviz--2134.org.readthedocs.build/en/2134/