Drop object-dtyped variables and coords before saving #2134

michaelosthege · 2022-10-11T14:49:52Z

Description

Variables or coordinates with object dtype often cause problems when saving.

In PyMC I'm refactoring stats to include a "warning" that that is a custom object containing detailed information about sampler divergences.
This is, of course, not ideal, but a compromise until we can settle on a data structure that would be serializeable.

For this to work, we need the "warning" stat to be automatically dropped when doing .to_netcdf(), but this is an ArviZ-level implementation.

Therefore, this PR adds helper functions to drop object-dtyped variables/coords before saving.

I believe this should give additional robustness against issues such as #2079.

Checklist

Follows official PR format
New features are properly documented (with an example if appropriate)?
Includes new or updated tests to cover the new feature
Code style correct (follows pylint and black guidelines)
Changes are listed in changelog

📚 Documentation preview 📚: https://arviz--2134.org.readthedocs.build/en/2134/

ahartikainen · 2022-10-11T15:15:22Z

I think this will also drop all text variables?

ahartikainen · 2022-10-11T15:16:30Z

arviz/data/utils.py

+        coords[k] = v
+
+    ndata = Dataset(
+        data_vars=vars,


Will this copy all data?

The Dataset has just three kwargs and I'm passing all of them.

I think the original question was about copying the data unnecessarly and wasting memory instead of more data than desired missing from the resulting dataset 🤔. If that were the case, it won't copy the data, xarray stores the arrays in datasets without copying them

michaelosthege · 2022-10-11T21:32:29Z

I think this will also drop all text variables?

With the last push I included modified the coords dropping to have one str coordinate and it's not dropped.

The CI should go green now too.

...ready to review =)

cc @OriolAbril

OriolAbril

Looks good to me, I like adding helper things to inferencedata that generally have object dtype, now I won't need to delete them manually before saving.

arviz/tests/base_tests/test_data.py

codecov · 2022-10-11T22:15:03Z

Codecov Report

Merging #2134 (a21c3a9) into main (b2503b5) will increase coverage by 0.01%.
The diff coverage is 100.00%.

❗ Current head a21c3a9 differs from pull request most recent head f6ac0a5. Consider uploading reports for the commit f6ac0a5 to get more accurate results

@@            Coverage Diff             @@
##             main    #2134      +/-   ##
==========================================
+ Coverage   90.73%   90.74%   +0.01%     
==========================================
  Files         118      118              
  Lines       12566    12582      +16     
==========================================
+ Hits        11402    11418      +16     
  Misses       1164     1164

Impacted Files	Coverage Δ
arviz/data/inference_data.py	`83.75% <100.00%> (+0.80%)`	⬆️
arviz/data/io_pystan.py	`96.61% <0.00%> (-0.28%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

ahartikainen · 2022-10-12T04:39:39Z

I think this will also drop all text variables?

With the last push I included modified the coords dropping to have one str coordinate and it's not dropped.

The CI should go green now too.

...ready to review =)

cc @OriolAbril

Was the string coordinate saved as object?

e.g. if one saves one of the example datasets, will it drop coords?

ahartikainen · 2022-10-12T10:40:22Z

I mean, that object dtype is totally valid for strings and there should not be any problems saving them to netcdf/zarr, it is just the compression that has problems with them and current main does not try to compress them.

I think we probably should have some other way to handle these things.

Ideally we should not push invalid data to InferenceData and then try to automatically fix these in our end. What if the object type used can be saved to netcdf by xarray but we just remove it. Should user now start save inference data objects manually?

michaelosthege · 2022-10-12T10:58:03Z

I don't think object is valid for strings. See arviz-devs/arviz_example_data#8 where now the "school" coordinate is of <U16 dtype which has .hasobject == False.

And arbitrary objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..

ahartikainen · 2022-10-12T11:37:36Z

I don't think object is valid for strings. See arviz-devs/arviz_example_data#8 where now the "school" coordinate is of <U16 dtype which has .hasobject == False.

And arbitrary objects will always cause problems, even if the underlying netcdf/zlib libraries can sometimes convert some of them..

They are

pydata/xarray#2059

ahartikainen · 2022-10-12T11:39:12Z

Also to enable adding strings to a particular array of strings basically need to use object type.

git-subtree-dir: arviz/data/example_data git-subtree-split: ea49b34b27102d22b4d9e202215a05133dc2cea2

…a/example_data'

review-notebook-app · 2022-10-12T18:25:19Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

michaelosthege · 2022-10-12T18:42:21Z

Also to enable adding strings to a particular array of strings basically need to use object type.

Unfortunately it appears you're right..
When I updated the (non-)center_eight InferenceData in arviz-devs/arviz_example_data#8 the "school" coord had an <U16 dtype before saving.
Loading, however, results in a dtype('O').

With that it seems almost impossible to tell what constitutes an "invalid" data type before attempting to save it.

Also, it appears that some tests are now failing on the updated example data, so this update should probably be done in a separate PR.

ahartikainen · 2022-10-12T21:06:28Z

Part of the problem comes from numpy + string support. Not sure if there is going to be any fixes.

I'm not sure if there is a way for us to test serializing the first item for each variable (with or without compression) and then dropping invalid items.

OriolAbril · 2022-10-13T10:19:39Z

I don't think there is any reasonable way forward given the no guarantee for round trip when it comes to dtypes. In my opinion scanning variables to check which can be saved and which not with try except is way too much work, and not the right place for such code, we save datasets as bulk so going variable per variable and coordinate per coordinate is much more work and might need interfacing with netcdf directly.

I think we should close this PR and try to get logging into xarray so users know which variables and/or coordinates are the ones breaking the write process

michaelosthege · 2022-10-13T12:51:57Z

Yes, I have made my PyMC changes independent of this, and more informative errors at the xarray level sound great

ahartikainen reviewed Oct 11, 2022

View reviewed changes

michaelosthege force-pushed the object-droppings branch 3 times, most recently from eb777f8 to ab07a7c Compare October 11, 2022 21:29

michaelosthege marked this pull request as ready for review October 11, 2022 21:31

OriolAbril reviewed Oct 11, 2022

View reviewed changes

arviz/tests/base_tests/test_data.py Outdated Show resolved Hide resolved

michaelosthege force-pushed the object-droppings branch from ab07a7c to a21c3a9 Compare October 11, 2022 21:53

OriolAbril approved these changes Oct 11, 2022

View reviewed changes

michaelosthege changed the title ~~WIP: Drop object-dtyped variables and coords before saving~~ Drop object-dtyped variables and coords before saving Oct 11, 2022

michaelosthege mentioned this pull request Oct 11, 2022

Collect sampler warnings only through stats pymc-devs/pymc#6192

Merged

9 tasks

michaelosthege added 2 commits October 12, 2022 10:53

Drop object variables/coords before saving to_netcdf or to_zarr

38f3a23

Check that example data has no object vars

e3d9423

michaelosthege mentioned this pull request Oct 12, 2022

Update 8-schools models and InferenceData arviz-devs/arviz_example_data#8

Merged

michaelosthege added 3 commits October 12, 2022 20:23

Update example data

2818db8

Squashed 'arviz/data/example_data/' content from commit ea49b34b

61978c1

git-subtree-dir: arviz/data/example_data git-subtree-split: ea49b34b27102d22b4d9e202215a05133dc2cea2

Merge commit '61978c1ce3ac81c8d173d383e8a2ddd392f2d058' as 'arviz/dat…

f6ac0a5

…a/example_data'

michaelosthege force-pushed the object-droppings branch from a21c3a9 to f6ac0a5 Compare October 12, 2022 18:25

michaelosthege closed this Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop object-dtyped variables and coords before saving #2134

Drop object-dtyped variables and coords before saving #2134

michaelosthege commented Oct 11, 2022 •

edited by github-actions bot

Loading

ahartikainen commented Oct 11, 2022

ahartikainen Oct 11, 2022

michaelosthege Oct 11, 2022

OriolAbril Oct 11, 2022

michaelosthege commented Oct 11, 2022

OriolAbril left a comment

codecov bot commented Oct 11, 2022 •

edited

Loading

ahartikainen commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

michaelosthege commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

review-notebook-app bot commented Oct 12, 2022

michaelosthege commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

OriolAbril commented Oct 13, 2022

michaelosthege commented Oct 13, 2022

Drop object-dtyped variables and coords before saving #2134

Drop object-dtyped variables and coords before saving #2134

Conversation

michaelosthege commented Oct 11, 2022 • edited by github-actions bot Loading

Description

Checklist

ahartikainen commented Oct 11, 2022

ahartikainen Oct 11, 2022

Choose a reason for hiding this comment

michaelosthege Oct 11, 2022

Choose a reason for hiding this comment

OriolAbril Oct 11, 2022

Choose a reason for hiding this comment

michaelosthege commented Oct 11, 2022

OriolAbril left a comment

Choose a reason for hiding this comment

codecov bot commented Oct 11, 2022 • edited Loading

Codecov Report

ahartikainen commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

michaelosthege commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

review-notebook-app bot commented Oct 12, 2022

michaelosthege commented Oct 12, 2022

ahartikainen commented Oct 12, 2022

OriolAbril commented Oct 13, 2022

michaelosthege commented Oct 13, 2022

michaelosthege commented Oct 11, 2022 •

edited by github-actions bot

Loading

codecov bot commented Oct 11, 2022 •

edited

Loading