-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read small integers as float32, not float64 #1840
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks okay. It needs some tests so please add those. In the future, this level of change is probably worth proposing with a issue first.
doc/whats-new.rst
Outdated
@@ -50,6 +50,9 @@ Enhancements | |||
- :py:func:`~plot.line()` learned to draw multiple lines if provided with a | |||
2D variable. | |||
By `Deepak Cherian <https://github.com/dcherian>`_. | |||
- Reduce memory usage when decoding a variable with a scale_factor, by | |||
converting 8-bit and 16-bit integers to float32 instead of float64. | |||
By `Zac Hatfield-Dodds <https://github.com/Zac-HD>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add this PR number as a reference.
xarray/coding/variables.py
Outdated
if data.dtype.itemsize <= 2 and \ | ||
np.issubdtype(data.dtype, np.integer) and \ | ||
'add_offset' not in attributes and \ | ||
2 ** -23 < float(attributes.get('scale_factor', 1)) < 2 ** 8: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if statement is fairly complicated and its not really clear what is happening, can you
- add some more comments
- use parentheses instead of line continuation breaks
- consider just passing in the scale factor instead of attributes
c4fbcea
to
f659398
Compare
Thanks - I was actually writing up an issue and decided it would be easier to demonstrate the proposed fix in a PR, but I'll open an issue first next time. The checkbox about flake8 could be removed from the issue template now - since #1824 we run flake8 on everything in CI so if tests pass flake8 is passing too. Re: tests: what do you (and @shoyer) think about using Hypothesis for some property-based tests of variable coding? "encoding then decoding is a no-op" is a classic property 😄 Upside, more powerful and better at finding edge cases; downside slower simply because it checks more cases (a configurable number). |
As mentioned in #1842, maybe we should also make a point not to upcast float32 input? One possible concern with changing precision from float64 -> float32 is that some reduce operations like mean could become due to lower precision. So it's a good think @fujiisoup wrote #1841 so we can specify With regards to Hypothesis: I haven't used it, but it does seem very intriguing. I'm sure that it could turn up quite a few bugs in xarray. Would it make sense to add it as an optional dependency to the test suite? Even if we use Hypothesis to cover more edge cases, I think we will still want normal test coverage for most behavior. |
1c3c759
to
77b7793
Compare
Added tests; float32 not upcast, float16 ("intended for storage of many floating-point values where higher precision is not needed, not for performing arithmetic") is upcast but only to float32. I'll open a new issue to add a basic suite of property-based tests to Xarray 😄 |
@@ -51,6 +51,10 @@ Enhancements | |||
- :py:func:`~plot.line()` learned to draw multiple lines if provided with a | |||
2D variable. | |||
By `Deepak Cherian <https://github.com/dcherian>`_. | |||
- Reduce memory usage when decoding a variable with a scale_factor, by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this will also effect encoding as well, e.g., writing float32 rather than float64 by default for scale offset encoded float32 data? Let's note that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
encoded = coder.encode(original) | ||
assert encoded.dtype == np.float32 | ||
roundtripped = coder.decode(encoded) | ||
assert_identical(original, roundtripped) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also add an explicit check that the decoded values are float32 (I don't think assert_identical does that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
xarray/coding/variables.py
Outdated
@@ -212,11 +212,31 @@ class CFScaleOffsetCoder(VariableCoder): | |||
decode_values = encoded_values * scale_factor + add_offset | |||
""" | |||
|
|||
@staticmethod | |||
def _choose_float_dtype(data, has_offset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this helper function use dtype
as an argument instead of data
? That would make it clearer that the result dtype is not data dependent.
xarray/coding/variables.py
Outdated
@@ -212,11 +212,31 @@ class CFScaleOffsetCoder(VariableCoder): | |||
decode_values = encoded_values * scale_factor + add_offset | |||
""" | |||
|
|||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a small nit, but I prefer not to use staticmethod
when a normal function will do. We actually say "Avoid @staticmethod
" in our internal Google style guide (which has gotten out of sync with the public one).
AKA the "I just wasted 4.6 TB of memory" patch.
77b7793
to
2e25e5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. @jhamman any further concerns?
This all looks good now. I merged in master and resolved a small conflict in the docs. I'll merge after the tests pass. |
flake8 xarray
(now part of tests)whats-new.rst
for all changesMost satellites produce images with color depth in the range of eight to sixteen bits, which are therefore often stored as unsigned integers (with the quality mask in another variable). If you're lucky, they also have a
scale_factor
attribute and Xarray can automatically convert the integers to floats representing albedo.This is fantastically convenient, and avoids all the bit-depth bugs from misremembered specifications. However, loading data as float64 when float32 is sufficient doubles memory usage in IO (even on multi-TB datasets...). While immediately downcasting helps, it's no substitute for doing the right thing first.
So this patch does some conservative checks, and if we can be sure float32 is safe we use that instead.