Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite quantify and dequantify #17

Merged
merged 38 commits into from
Jul 17, 2020
Merged

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Jul 11, 2020

Since we now have the utils from #11, quantify and dequantify can be extended to also support coordinates.

I'm also hoping to fix #9, either by changing the check in Variable.data (I need to think more about the implications of that) or by stating that this will load non-dask arrays into memory.

@keewis
Copy link
Collaborator Author

keewis commented Jul 13, 2020

I still have to update the docstrings, but other than that this should be ready for review.

The failing tests are due to a design question: should we raise for quantify if no unit was given (neither by args nor by attrs)? And what about dequantify is called on a object without quantities (maybe dequantify was already called)?

I have been answering both questions with "no", but I didn't think too much before I wrote the code in conversion. I could be convinced either way: we might want to make sure people know the operation didn't do anything, but we might also want to allow something like

obj = ds.pint.dequantify()
... # more code
ds.pint.dequantify()  # or always dequantify in a function, e.g. before saving to a file

@TomNicholas
Copy link
Member

a design question: should we raise for quantify if no unit was given (neither by args nor by attrs)?

I would say yes. There is a distinction between specifying a dimensionless unit and not specifying a unit. @jthielen how do the CF conventions store "dimensionless" under the .units attribute? If the user calls quantify it should be because they think there are units on the data which should now be interpreted, and they will want to know if that assumption was wrong.

And what about dequantify is called on a object without quantities (maybe dequantify was already called)?

Similarly I think yes. The user thought there were units on the data, but were mistaken, so should be informed. If they want to cover both possibilities then they perhaps they should use a try except?

I am imagining the end-goal use of this package as always assuming the user is following this pattern:

# Load data using xarray

# Quantify using pint-xarray

# Go about your xarray analysis business all as normal, invoking pint-xarray ideally as little as possible

# Perhaps dequantify if you need to save the data

This approach would also make round-tripping simpler.

@jthielen
Copy link
Collaborator

I think the design decision comes down to what quantify/dequantify are expected to "do"

  • add units to/remove units from your numeric data structure?
  • make your numeric data structures Pint Quantities/remove Pint Quantities from your numeric data structures without loss of information?

I've always imagined it to be the latter due to its ease of use, and that is exactly why I proposed the names "quantify" and "dequantify" in hgrecco/pint#849 (and implemented it this way in MetPy's accessor). My responses to the questions below assume this point-of-view. That being said, if it is instead decided upon to be the former, then I could definitely see the point in erroring on quantify without units or dequantify on a non-Quantity.

a design question: should we raise for quantify if no unit was given (neither by args nor by attrs)?

I would say yes. There is a distinction between specifying a dimensionless unit and not specifying a unit. @jthielen how do the CF conventions store "dimensionless" under the .units attribute? If the user calls quantify it should be because they think there are units on the data which should now be interpreted, and they will want to know if that assumption was wrong.

I was going to say no here, since in Pint's view there is no distinction between specifying a dimensionless unit and not specifying a unit...ureg.Quantity(x, "dimensionless"), ureg.Quantity(x, ""), and ureg.Quantity(x) all give you the same thing. So, if the .quantify() method is viewed as "wrap all my numeric data in Quantities", then variables without units should just become dimensionless quantities, since that's what Pint does without units.

The CF canonical unit for dimensionless is "1". However, in many netCDF files I've seen "out in the wild" that do not fully conform to the CF conventions, they represent dimensionless variables with an empty unit string or no units attribute at all. Since this is still pint-xarray and not cf-xarray, I think we should be less strict here and follow Pint's behavior for dimensionless variables.

And what about dequantify is called on a object without quantities (maybe dequantify was already called)?

Similarly I think yes. The user thought there were units on the data, but were mistaken, so should be informed. If they want to cover both possibilities then they perhaps they should use a try except?

[...]

Again, I would say no...that instead of erroring, dequantify should do nothing on an object without quantities. I'd imagine a not-too-uncommon situation is for a user to end up with a "mixed Dataset" with some Quantity and some non-Quantity variables. We'd want a safe way for them to get to a Dataset that can be written out, and the easiest way to do so is just dequantify the Quantity variables and leave the others alone without the user having to go through and only dequantify the Quantity ones.

@TomNicholas
Copy link
Member

Thanks for the detailed reply @jthielen . I think your points are much stronger than mine, and now think that might be the best way to go!

I was going to say no here, since in Pint's view there is no distinction between specifying a dimensionless unit and not specifying a unit...

I did not realise this.

Since this is still pint-xarray and not cf-xarray, I think we should be less strict here and follow Pint's behavior for dimensionless variables.

That's a good point. Quirks with CF conventions should be handled by CF-specific code instead of pre-empted.

Perhaps the better way to think about this (and what you seem to be proposing) is that .Quantify should map directly onto the __init__ method of ureg.Quantity, and follow similar behaviour.

@jthielen
Copy link
Collaborator

Perhaps the better way to think about this (and what you seem to be proposing) is that .Quantify should map directly onto the __init__ method of ureg.Quantity, and follow similar behaviour.

Indeed! That's a much clearer way of phrasing it.

@keewis
Copy link
Collaborator Author

keewis commented Jul 14, 2020

the update of the docstrings is done and I fixed / removed the failing tests.

Something else we should probably discuss is whether or not to remove the "units" attribute in quantify. If we decide to do so, this isn't difficult: we just have to toggle the delete parameter to extract_unit_attributes.

I'm leaning towards removing since that way the attributes cannot be out-of-sync, but we could also add a keyword argument to control that.

@dcherian
Copy link

Something else we should probably discuss is whether or not to remove the "units" attribute in quantify. If we decide to do so, this isn't difficult: we just have to toggle the delete parameter to extract_unit_attributes.

Yes 👍 The units are now on the array itself so I think this is sensible. And as you point out, it won't be out of sync. Also DataArray.units == DataArray.attrs["units"] until we change it in xarray so keeping attrs["units"] would be really confusing

@codecov
Copy link

codecov bot commented Jul 14, 2020

Codecov Report

Merging #17 into master will increase coverage by 4.50%.
The diff coverage is 94.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #17      +/-   ##
==========================================
+ Coverage   82.41%   86.91%   +4.50%     
==========================================
  Files           7        7              
  Lines         614      688      +74     
==========================================
+ Hits          506      598      +92     
+ Misses        108       90      -18     
Impacted Files Coverage Δ
pint_xarray/accessors.py 74.82% <88.37%> (+12.15%) ⬆️
pint_xarray/conversion.py 95.20% <92.85%> (-0.95%) ⬇️
pint_xarray/tests/test_accessors.py 98.22% <100.00%> (+0.35%) ⬆️
pint_xarray/tests/test_conversion.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c376dd...08b2c1e. Read the comment docs.

@keewis
Copy link
Collaborator Author

keewis commented Jul 14, 2020

@jthielen, does this address your concerns from #9?

@jthielen
Copy link
Collaborator

@jthielen, does this address your concerns from #9?

Yes it does! Thanks for adding that. While it would be nice to not have to worry about the issue in the first place, making MemoryCachedArray wrappable by Pint with xarray still able to handle that properly could create a big mess. I'd much rather just recommend using Dask.

@keewis
Copy link
Collaborator Author

keewis commented Jul 14, 2020

then this should be ready for review and merge

Copy link
Collaborator

@jthielen jthielen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation-wise, this looks good to me! I just have two documentation questions:

  • how should we clarify to users that non-index coordinates are being quantified, but index coordinates are not at this point?
  • should probably document removal of units attribute (see comments below)

dict-like, it should map a variable name to the desired
unit (use the DataArray's name to refer to its data). If
not provided, will try to read them from
``DataArray.attrs['units']`` using pint's parser.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps good to comment something to the effect of "this will remove the 'units' attribute from DataArray.attrs"?

Dataset. It should map variable names to units (unit names
or ``pint.Unit`` objects). If not provided, will try to
read them from ``Dataset[var].attrs['units']`` using
pint's parser.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise here.

@keewis
Copy link
Collaborator Author

keewis commented Jul 15, 2020

how should we clarify to users that non-index coordinates are being quantified, but index coordinates are not at this point?

I was thinking about this, but then I thought I'd leave that to the PR that adds pseudo-support for indexes. Rereading the code, this not what happens: right now, it tries to quantify every variable, even if it is a dimension variable. This is something we want in the future, but right now that would mean we discard the units attribute (and get a UnitStrippedWarning). I'll make sure the attribute is properly set/updated and mention that in the documentation.

should probably document removal of units attribute (see comments below)

same here, I'll update the documentation

@keewis
Copy link
Collaborator Author

keewis commented Jul 15, 2020

done, I think? I updated the docstrings and modified the attribute extraction function to never change attrs in-place. Instead, there's a new attribute strip function that returns a copy without the attributes.

Copy link
Collaborator

@jthielen jthielen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Just one little pedantic comment that doesn't need to hold up merging if you don't want to.

pint_xarray/tests/test_conversion.py Outdated Show resolved Hide resolved
pint_xarray/tests/test_conversion.py Outdated Show resolved Hide resolved
pint_xarray/tests/test_conversion.py Outdated Show resolved Hide resolved
pint_xarray/tests/test_conversion.py Outdated Show resolved Hide resolved
pint_xarray/tests/test_conversion.py Outdated Show resolved Hide resolved
@keewis
Copy link
Collaborator Author

keewis commented Jul 17, 2020

that's a simple fix, so no problem. We have lots of these in xarray's tests, too (especially Dataset tests).

@keewis keewis merged commit 7d0096e into xarray-contrib:master Jul 17, 2020
@keewis keewis deleted the quantify branch July 17, 2020 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quantification loads data into memory
4 participants