Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop object-dtyped variables and coords before saving #2134

Closed

Conversation

michaelosthege
Copy link
Contributor

@michaelosthege michaelosthege commented Oct 11, 2022

Description

Variables or coordinates with object dtype often cause problems when saving.

In PyMC I'm refactoring stats to include a "warning" that that is a custom object containing detailed information about sampler divergences.
This is, of course, not ideal, but a compromise until we can settle on a data structure that would be serializeable.

For this to work, we need the "warning" stat to be automatically dropped when doing .to_netcdf(), but this is an ArviZ-level implementation.

Therefore, this PR adds helper functions to drop object-dtyped variables/coords before saving.

I believe this should give additional robustness against issues such as #2079.

Checklist

  • Follows official PR format
  • New features are properly documented (with an example if appropriate)?
  • Includes new or updated tests to cover the new feature
  • Code style correct (follows pylint and black guidelines)
  • Changes are listed in changelog

📚 Documentation preview 📚: https://arviz--2134.org.readthedocs.build/en/2134/

@ahartikainen
Copy link
Contributor

I think this will also drop all text variables?

coords[k] = v

ndata = Dataset(
data_vars=vars,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this copy all data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Dataset has just three kwargs and I'm passing all of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the original question was about copying the data unnecessarly and wasting memory instead of more data than desired missing from the resulting dataset 🤔. If that were the case, it won't copy the data, xarray stores the arrays in datasets without copying them

@michaelosthege michaelosthege force-pushed the object-droppings branch 3 times, most recently from eb777f8 to ab07a7c Compare October 11, 2022 21:29
@michaelosthege michaelosthege marked this pull request as ready for review October 11, 2022 21:31
@michaelosthege
Copy link
Contributor Author

I think this will also drop all text variables?

With the last push I included modified the coords dropping to have one str coordinate and it's not dropped.

The CI should go green now too.

...ready to review =)

cc @OriolAbril

Copy link
Member

@OriolAbril OriolAbril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, I like adding helper things to inferencedata that generally have object dtype, now I won't need to delete them manually before saving.

arviz/tests/base_tests/test_data.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Oct 11, 2022

Codecov Report

Merging #2134 (a21c3a9) into main (b2503b5) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head a21c3a9 differs from pull request most recent head f6ac0a5. Consider uploading reports for the commit f6ac0a5 to get more accurate results

@@            Coverage Diff             @@
##             main    #2134      +/-   ##
==========================================
+ Coverage   90.73%   90.74%   +0.01%     
==========================================
  Files         118      118              
  Lines       12566    12582      +16     
==========================================
+ Hits        11402    11418      +16     
  Misses       1164     1164              
Impacted Files Coverage Δ
arviz/data/inference_data.py 83.75% <100.00%> (+0.80%) ⬆️
arviz/data/io_pystan.py 96.61% <0.00%> (-0.28%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@michaelosthege michaelosthege changed the title WIP: Drop object-dtyped variables and coords before saving Drop object-dtyped variables and coords before saving Oct 11, 2022
@ahartikainen
Copy link
Contributor

I think this will also drop all text variables?

With the last push I included modified the coords dropping to have one str coordinate and it's not dropped.

The CI should go green now too.

...ready to review =)

cc @OriolAbril

Was the string coordinate saved as object?

e.g. if one saves one of the example datasets, will it drop coords?

@ahartikainen
Copy link
Contributor

I mean, that object dtype is totally valid for strings and there should not be any problems saving them to netcdf/zarr, it is just the compression that has problems with them and current main does not try to compress them.

I think we probably should have some other way to handle these things.

Ideally we should not push invalid data to InferenceData and then try to automatically fix these in our end. What if the object type used can be saved to netcdf by xarray but we just remove it. Should user now start save inference data objects manually?

@michaelosthege
Copy link
Contributor Author

I don't think object is valid for strings. See arviz-devs/arviz_example_data#8 where now the "school" coordinate is of <U16 dtype which has .hasobject == False.

And arbitrary objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..

@ahartikainen
Copy link
Contributor

I don't think object is valid for strings. See arviz-devs/arviz_example_data#8 where now the "school" coordinate is of <U16 dtype which has .hasobject == False.

And arbitrary objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..

They are

pydata/xarray#2059

@ahartikainen
Copy link
Contributor

Also to enable adding strings to a particular array of strings basically need to use object type.

git-subtree-dir: arviz/data/example_data
git-subtree-split: ea49b34b27102d22b4d9e202215a05133dc2cea2
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@michaelosthege
Copy link
Contributor Author

Also to enable adding strings to a particular array of strings basically need to use object type.

Unfortunately it appears you're right..
When I updated the (non-)center_eight InferenceData in arviz-devs/arviz_example_data#8 the "school" coord had an <U16 dtype before saving.
Loading, however, results in a dtype('O').

With that it seems almost impossible to tell what constitutes an "invalid" data type before attempting to save it.

Also, it appears that some tests are now failing on the updated example data, so this update should probably be done in a separate PR.

@ahartikainen
Copy link
Contributor

Part of the problem comes from numpy + string support. Not sure if there is going to be any fixes.

I'm not sure if there is a way for us to test serializing the first item for each variable (with or without compression) and then dropping invalid items.

@OriolAbril
Copy link
Member

I don't think there is any reasonable way forward given the no guarantee for round trip when it comes to dtypes. In my opinion scanning variables to check which can be saved and which not with try except is way too much work, and not the right place for such code, we save datasets as bulk so going variable per variable and coordinate per coordinate is much more work and might need interfacing with netcdf directly.

I think we should close this PR and try to get logging into xarray so users know which variables and/or coordinates are the ones breaking the write process

@michaelosthege
Copy link
Contributor Author

Yes, I have made my PyMC changes independent of this, and more informative errors at the xarray level sound great

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants