Read small integers as float32, not float64 #1840

Zac-HD · 2018-01-19T03:40:51Z

Closes DataArray read from netcdf with unexpected type #1842
Tests added
Tests passed
Passes flake8 xarray (now part of tests)
Fully documented, including whats-new.rst for all changes

Most satellites produce images with color depth in the range of eight to sixteen bits, which are therefore often stored as unsigned integers (with the quality mask in another variable). If you're lucky, they also have a scale_factor attribute and Xarray can automatically convert the integers to floats representing albedo.

This is fantastically convenient, and avoids all the bit-depth bugs from misremembered specifications. However, loading data as float64 when float32 is sufficient doubles memory usage in IO (even on multi-TB datasets...). While immediately downcasting helps, it's no substitute for doing the right thing first.

So this patch does some conservative checks, and if we can be sure float32 is safe we use that instead.

jhamman

This generally looks okay. It needs some tests so please add those. In the future, this level of change is probably worth proposing with a issue first.

jhamman · 2018-01-19T04:42:36Z

doc/whats-new.rst

@@ -50,6 +50,9 @@ Enhancements
 - :py:func:`~plot.line()` learned to draw multiple lines if provided with a
  2D variable.
  By `Deepak Cherian <https://github.com/dcherian>`_.
+- Reduce memory usage when decoding a variable with a scale_factor, by
+  converting 8-bit and 16-bit integers to float32 instead of float64.
+  By `Zac Hatfield-Dodds <https://github.com/Zac-HD>`_.


Add this PR number as a reference.

jhamman · 2018-01-19T04:51:00Z

xarray/coding/variables.py

+        if data.dtype.itemsize <= 2 and \
+                np.issubdtype(data.dtype, np.integer) and \
+                'add_offset' not in attributes and \
+                2 ** -23 < float(attributes.get('scale_factor', 1)) < 2 ** 8:


This if statement is fairly complicated and its not really clear what is happening, can you

add some more comments

use parentheses instead of line continuation breaks

consider just passing in the scale factor instead of attributes

Zac-HD · 2018-01-19T06:58:14Z

Thanks - I was actually writing up an issue and decided it would be easier to demonstrate the proposed fix in a PR, but I'll open an issue first next time.

The checkbox about flake8 could be removed from the issue template now - since #1824 we run flake8 on everything in CI so if tests pass flake8 is passing too.

Re: tests: what do you (and @shoyer) think about using Hypothesis for some property-based tests of variable coding? "encoding then decoding is a no-op" is a classic property 😄 Upside, more powerful and better at finding edge cases; downside slower simply because it checks more cases (a configurable number).

shoyer · 2018-01-20T01:06:07Z

As mentioned in #1842, maybe we should also make a point not to upcast float32 input?

One possible concern with changing precision from float64 -> float32 is that some reduce operations like mean could become due to lower precision. So it's a good think @fujiisoup wrote #1841 so we can specify dtype in reductions :).

With regards to Hypothesis: I haven't used it, but it does seem very intriguing. I'm sure that it could turn up quite a few bugs in xarray. Would it make sense to add it as an optional dependency to the test suite? Even if we use Hypothesis to cover more edge cases, I think we will still want normal test coverage for most behavior.

Zac-HD · 2018-01-20T12:39:30Z

Added tests; float32 not upcast, float16 ("intended for storage of many floating-point values where higher precision is not needed, not for performing arithmetic") is upcast but only to float32.

I'll open a new issue to add a basic suite of property-based tests to Xarray 😄

shoyer · 2018-01-20T19:19:29Z

doc/whats-new.rst

@@ -51,6 +51,10 @@ Enhancements
 - :py:func:`~plot.line()` learned to draw multiple lines if provided with a
  2D variable.
  By `Deepak Cherian <https://github.com/dcherian>`_.
+- Reduce memory usage when decoding a variable with a scale_factor, by


It looks like this will also effect encoding as well, e.g., writing float32 rather than float64 by default for scale offset encoded float32 data? Let's note that.

shoyer · 2018-01-20T19:20:58Z

xarray/tests/test_coding.py

+    encoded = coder.encode(original)
+    assert encoded.dtype == np.float32
+    roundtripped = coder.decode(encoded)
+    assert_identical(original, roundtripped)


Let's also add an explicit check that the decoded values are float32 (I don't think assert_identical does that).

shoyer · 2018-01-20T19:23:19Z

xarray/coding/variables.py

@@ -212,11 +212,31 @@ class CFScaleOffsetCoder(VariableCoder):
        decode_values = encoded_values * scale_factor + add_offset
    """

+    @staticmethod
+    def _choose_float_dtype(data, has_offset):


Can you make this helper function use dtype as an argument instead of data? That would make it clearer that the result dtype is not data dependent.

shoyer · 2018-01-20T20:00:11Z

xarray/coding/variables.py

@@ -212,11 +212,31 @@ class CFScaleOffsetCoder(VariableCoder):
        decode_values = encoded_values * scale_factor + add_offset
    """

+    @staticmethod


This is a small nit, but I prefer not to use staticmethod when a normal function will do. We actually say "Avoid @staticmethod" in our internal Google style guide (which has gotten out of sync with the public one).

AKA the "I just wasted 4.6 TB of memory" patch.

shoyer

Looks good to me. @jhamman any further concerns?

jhamman · 2018-01-23T19:04:24Z

This all looks good now. I merged in master and resolved a small conflict in the docs. I'll merge after the tests pass.

jhamman requested changes Jan 19, 2018

View reviewed changes

Zac-HD force-pushed the efficient-scaling branch from c4fbcea to f659398 Compare January 19, 2018 06:51

fmaussion mentioned this pull request Jan 19, 2018

DataArray read from netcdf with unexpected type #1842

Closed

Zac-HD force-pushed the efficient-scaling branch from 1c3c759 to 77b7793 Compare January 20, 2018 11:58

shoyer reviewed Jan 20, 2018

View reviewed changes

Zac-HD added 2 commits January 21, 2018 14:35

Tests now include a clean flake8 run

5320826

Read small integers as float32, not float64

2e25e5e

AKA the "I just wasted 4.6 TB of memory" patch.

Zac-HD force-pushed the efficient-scaling branch from 77b7793 to 2e25e5e Compare January 21, 2018 03:36

Zac-HD mentioned this pull request Jan 21, 2018

Add a suite of property-based tests with Hypothesis #1846

Open

shoyer approved these changes Jan 21, 2018

View reviewed changes

Merge branch 'master' into efficient-scaling

8238eb6

jhamman approved these changes Jan 23, 2018

View reviewed changes

jhamman merged commit 65e5f05 into pydata:master Jan 23, 2018

Zac-HD mentioned this pull request Feb 25, 2018

v0.10.1 Release #1821

Closed

5 tasks

shoyer mentioned this pull request Mar 28, 2018

How to avoid the auto convert variable dtype from float32 to float64 when read netCDF file use open_dataset? #1008

Closed

Zac-HD deleted the efficient-scaling branch April 19, 2018 02:50

kmuehlbauer mentioned this pull request Apr 1, 2023

cf-coding #7654

Closed

4 tasks

kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this pull request Apr 1, 2023

convert to float32 to keep pydata#1840 in sync

4f4e1b6

kmuehlbauer added a commit to kmuehlbauer/xarray that referenced this pull request Apr 4, 2023

convert to float32 to keep pydata#1840 in sync

e877aa7

yt87 mentioned this pull request May 22, 2024

Regression/#1840: decoding to float64 instead of float32 #9041

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read small integers as float32, not float64 #1840

Read small integers as float32, not float64 #1840

Zac-HD commented Jan 19, 2018 •

edited

Loading

jhamman left a comment

jhamman Jan 19, 2018

jhamman Jan 19, 2018

Zac-HD commented Jan 19, 2018

shoyer commented Jan 20, 2018

Zac-HD commented Jan 20, 2018

shoyer Jan 20, 2018

Zac-HD Jan 21, 2018

shoyer Jan 20, 2018

Zac-HD Jan 21, 2018

shoyer Jan 20, 2018

shoyer Jan 20, 2018

shoyer left a comment

jhamman commented Jan 23, 2018

Read small integers as float32, not float64 #1840

Read small integers as float32, not float64 #1840

Conversation

Zac-HD commented Jan 19, 2018 • edited Loading

jhamman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zac-HD commented Jan 19, 2018

shoyer commented Jan 20, 2018

Zac-HD commented Jan 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer left a comment

Choose a reason for hiding this comment

jhamman commented Jan 23, 2018

Zac-HD commented Jan 19, 2018 •

edited

Loading