Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read small integers as float32, not float64 #1840

Merged
merged 3 commits into from
Jan 23, 2018

Conversation

Zac-HD
Copy link
Contributor

@Zac-HD Zac-HD commented Jan 19, 2018

Most satellites produce images with color depth in the range of eight to sixteen bits, which are therefore often stored as unsigned integers (with the quality mask in another variable). If you're lucky, they also have a scale_factor attribute and Xarray can automatically convert the integers to floats representing albedo.

This is fantastically convenient, and avoids all the bit-depth bugs from misremembered specifications. However, loading data as float64 when float32 is sufficient doubles memory usage in IO (even on multi-TB datasets...). While immediately downcasting helps, it's no substitute for doing the right thing first.

So this patch does some conservative checks, and if we can be sure float32 is safe we use that instead.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks okay. It needs some tests so please add those. In the future, this level of change is probably worth proposing with a issue first.

@@ -50,6 +50,9 @@ Enhancements
- :py:func:`~plot.line()` learned to draw multiple lines if provided with a
2D variable.
By `Deepak Cherian <https://github.com/dcherian>`_.
- Reduce memory usage when decoding a variable with a scale_factor, by
converting 8-bit and 16-bit integers to float32 instead of float64.
By `Zac Hatfield-Dodds <https://github.com/Zac-HD>`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this PR number as a reference.

if data.dtype.itemsize <= 2 and \
np.issubdtype(data.dtype, np.integer) and \
'add_offset' not in attributes and \
2 ** -23 < float(attributes.get('scale_factor', 1)) < 2 ** 8:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if statement is fairly complicated and its not really clear what is happening, can you

  1. add some more comments
  2. use parentheses instead of line continuation breaks
  3. consider just passing in the scale factor instead of attributes

@Zac-HD Zac-HD force-pushed the efficient-scaling branch from c4fbcea to f659398 Compare January 19, 2018 06:51
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Jan 19, 2018

Thanks - I was actually writing up an issue and decided it would be easier to demonstrate the proposed fix in a PR, but I'll open an issue first next time.

The checkbox about flake8 could be removed from the issue template now - since #1824 we run flake8 on everything in CI so if tests pass flake8 is passing too.

Re: tests: what do you (and @shoyer) think about using Hypothesis for some property-based tests of variable coding? "encoding then decoding is a no-op" is a classic property 😄 Upside, more powerful and better at finding edge cases; downside slower simply because it checks more cases (a configurable number).

@shoyer
Copy link
Member

shoyer commented Jan 20, 2018

As mentioned in #1842, maybe we should also make a point not to upcast float32 input?

One possible concern with changing precision from float64 -> float32 is that some reduce operations like mean could become due to lower precision. So it's a good think @fujiisoup wrote #1841 so we can specify dtype in reductions :).

With regards to Hypothesis: I haven't used it, but it does seem very intriguing. I'm sure that it could turn up quite a few bugs in xarray. Would it make sense to add it as an optional dependency to the test suite? Even if we use Hypothesis to cover more edge cases, I think we will still want normal test coverage for most behavior.

@Zac-HD Zac-HD force-pushed the efficient-scaling branch from 1c3c759 to 77b7793 Compare January 20, 2018 11:58
@Zac-HD
Copy link
Contributor Author

Zac-HD commented Jan 20, 2018

Added tests; float32 not upcast, float16 ("intended for storage of many floating-point values where higher precision is not needed, not for performing arithmetic") is upcast but only to float32.

I'll open a new issue to add a basic suite of property-based tests to Xarray 😄

@@ -51,6 +51,10 @@ Enhancements
- :py:func:`~plot.line()` learned to draw multiple lines if provided with a
2D variable.
By `Deepak Cherian <https://github.com/dcherian>`_.
- Reduce memory usage when decoding a variable with a scale_factor, by
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this will also effect encoding as well, e.g., writing float32 rather than float64 by default for scale offset encoded float32 data? Let's note that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

encoded = coder.encode(original)
assert encoded.dtype == np.float32
roundtripped = coder.decode(encoded)
assert_identical(original, roundtripped)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add an explicit check that the decoded values are float32 (I don't think assert_identical does that).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -212,11 +212,31 @@ class CFScaleOffsetCoder(VariableCoder):
decode_values = encoded_values * scale_factor + add_offset
"""

@staticmethod
def _choose_float_dtype(data, has_offset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this helper function use dtype as an argument instead of data? That would make it clearer that the result dtype is not data dependent.

@@ -212,11 +212,31 @@ class CFScaleOffsetCoder(VariableCoder):
decode_values = encoded_values * scale_factor + add_offset
"""

@staticmethod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small nit, but I prefer not to use staticmethod when a normal function will do. We actually say "Avoid @staticmethod" in our internal Google style guide (which has gotten out of sync with the public one).

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @jhamman any further concerns?

@jhamman
Copy link
Member

jhamman commented Jan 23, 2018

This all looks good now. I merged in master and resolved a small conflict in the docs. I'll merge after the tests pass.

@jhamman jhamman merged commit 65e5f05 into pydata:master Jan 23, 2018
@Zac-HD Zac-HD mentioned this pull request Feb 25, 2018
5 tasks
@Zac-HD Zac-HD deleted the efficient-scaling branch April 19, 2018 02:50
@kmuehlbauer kmuehlbauer mentioned this pull request Apr 1, 2023
4 tasks
kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this pull request Apr 1, 2023
kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this pull request Apr 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants