Optimize range calculation operations #344

gbrener · 2017-05-05T00:23:03Z

Optimize the operations that datashader uses to calculate the x_range and y_range variables when the ranges are not provided by the user. Addresses issue #336 and some comments that came out of #332 . Agg2 now takes the same amount of time with df.persist() as it had previously with cachey.

Add a caching feature to Canvas which allows it to avoid recalculating the x_range and y_range when the Canvas object is reused for multiple aggregations. This caching feature can be turned off when instantiating the Canvas object Canvas(..., cache_ranges=False), or be circumvented by calling cvs.points(..., recalc_ranges=True) or cvs.line(..., recalc_ranges=True).

Update filetimes.py to do its own caching of the range variables. Also add a --recalc-ranges flag to see timing difference when the range variables are not cached.

Optimize the operations that datashader uses to calculate the x_range and y_range variables when the ranges are not provided by the user. Add a caching feature that avoids recalculating the x_range and y_range when the Canvas is reused for multiple aggregations. Caching feature can be turned off when instantiating the Canvas object, or invalidated while calling cvs.points(..., recalc_ranges=True) or cvs.line(..., recalc_ranges=True)

jbednar

Looks great, thanks! Do you have results showing timings before and after the change? I'm not sure what you mean by 30s, 60s, and 70s above, since the entire Agg2 should take 2s or less...

jbednar · 2017-05-05T14:25:35Z

datashader/core.py

-        self.x_range = tuple(x_range) if x_range is not None else x_range
-        self.y_range = tuple(y_range) if y_range is not None else y_range
+        self.x_range = tuple(x_range) if x_range is not None else None
+        self.y_range = tuple(y_range) if y_range is not None else None


I guess you found this way of writing it clearer (as it would seem to have no effect)? I personally would maybe prefer:

self.x_range = None if x_range is None else tuple(x_range) self.y_range = None if y_range is None else tuple(y_range)

Ok, not a problem - happy to make the change if it's easier to read.

jbednar · 2017-05-05T14:50:30Z

datashader/glyphs.py

@@ -29,10 +29,12 @@ def validate(self, in_dshape):
            raise ValueError('y must be real')

    def _compute_x_bounds(self, df):
-        return df[self.x].min(), df[self.x].max()
+        xs = df[self.x].values
+        return xs.min(), xs.max()


Should this and _compute_y_bounds be @jit so that Numba can combine the max and min into a single pass?

Good point, I'll investigate.

@jbednar I had some difficulties using numba with dask. For this reason there are now two versions of the bounds computations: one is for pandas dataframes, relying on numba and only doing one pass through the data (as we discussed). The other is a memoized version of the existing code (with np.nanmin and np.nanmax). The memoization gets us the same performance as cachey did - which is great news - but still does two passes through the data. The memoization does not work with pandas dataframes because they're mutable, whereas dask dataframes are not. However there is still a solid speedup to the pandas side of datashader. Hopefully someone with more dask knowledge will be able to incorporate numba into the memoized function for an additional potential speedup on the first aggregation.

Remove caching feature from previous commits after discussion with Jim. Convert arr.min() and arr.max() calls to np.nanmin(arr) and np.nanmax(arr) to more-closely emulate the NaN-handling behavior of df.min() and df.max()

Use toolz.memoize - similar to how other code in glyphs.py is optimized - to cache the x/y bound computations. This takes advantage of the fact that dask dataframes are immutable/hashable, and has the desired result that the cache_ranges feature had before.

The tests are failing because pandas DataFrames are not hashable. Rather than using dask.dataframe.hashing.hash_pandas_object, it is probably more efficient to simply recalculate the min/max. So memoization only happens for dask.

Since I know of no straightforward way to incorporate numba functions into dask graphs, there are now two versions of the min/max computations; the pandas ones use numba, and the dask ones use memoization (relying on dask dataframe immutability).

Nothing has changed except the names, so that the tests can return to passing state again.

Update unit test so that it passes a numpy array to numba function instead of dataframe (as it was before).

gbrener · 2017-05-06T00:56:28Z

@jbednar Updated results posted here: #129

martindurant · 2017-05-07T16:33:23Z

datashader/glyphs.py

+    @staticmethod
+    @ngjit
+    def _compute_x_bounds(xs):
+        minval = maxval = xs[0]


This is good in many cases, but probably falls over if the first item happens to be nan.
Another little loop to skip over any initial nan values would do here, and I suppose there should be a way to raise an error if it happens all values are nan.

Good catch @martindurant. Those cases are covered now. Also added the check for the empty array condition.

Also, I'm on vacation this week - feel free to make additional changes to this PR as you see fit.

Add checks to bounds computations for cases where NaN(s) occur at the beginning of the x/y arrays, or the arrays are all NaNs, or arrays are empty.

Although the former code was shorter, separating bounds computations into two loops should yield a slight speed improvement for the common case.

jbednar · 2017-05-08T18:56:01Z

datashader/glyphs.py

@@ -32,25 +31,49 @@ def validate(self, in_dshape):
    @staticmethod
    @ngjit
    def _compute_x_bounds(xs):
+        if len(xs) == 0:
+            raise ValueError('x coordinate array is empty.')


Computing len() for a Dask dataframe is non-trivial, since it is loaded lazily, but I think this entire branch can be avoided by simply setting minval=maxval=np.NaN.

In that case x < minval and x < maxval will both always be False, so you would only ever get NaN out.

True; that would need an extra isnan(minval), etc. inside the loop. @jcrist points out that at this point the chunk of data has already been loaded, and so the len() should be cheap, so I'll leave it as-is.

Instead of doing 2 loops as suggested, revert back to 1 loop (to avoid computing the len on a dask dataframe).

gbrener · 2017-05-08T19:58:30Z

@jbednar How's this?

martindurant · 2017-05-08T20:16:38Z

datashader/glyphs.py

                    minval = x
-                elif x > maxval:
+                if np.isnan(maxval) or x > maxval:


Yes, I think this does the right thing at the slight cost of checking nan every loop.

Indeed. But at least it can be reused for dask at some point in the future. These methods are only used for pandas at the moment; the ones used for dask are called by the same names, but have the _dask suffix.

Ok, I'll go ahead and merge this, and we can separately consider further optimizations in the future. Thanks, all!

gbrener added the in progress label May 5, 2017

gbrener requested a review from jbednar May 5, 2017 00:29

jbednar approved these changes May 5, 2017

View reviewed changes

gbrener added 6 commits May 5, 2017 16:20

Revert caching feature. min/max -> nanmin/nanmax

206601e

Remove caching feature from previous commits after discussion with Jim. Convert arr.min() and arr.max() calls to np.nanmin(arr) and np.nanmax(arr) to more-closely emulate the NaN-handling behavior of df.min() and df.max()

Only memoize the dask versions of max/min

53ce5f3

The tests are failing because pandas DataFrames are not hashable. Rather than using dask.dataframe.hashing.hash_pandas_object, it is probably more efficient to simply recalculate the min/max. So memoization only happens for dask.

Revert to old method names

64eec97

Nothing has changed except the names, so that the tests can return to passing state again.

Update test

430a1b5

Update unit test so that it passes a numpy array to numba function instead of dataframe (as it was before).

gbrener mentioned this pull request May 6, 2017

Recommended file format for large files #129

Closed

This was referenced May 6, 2017

Memory usage of min/max operations due to nan-checking #336

Closed

Performance degradation of repeated aggregations with distributed scheduler and client.persist() on a single machine #332

Open

General performance optimizations #313

Closed

martindurant reviewed May 7, 2017

View reviewed changes

jbednar added this to the 0.5.0 milestone May 7, 2017

gbrener added 3 commits May 8, 2017 00:11

Update glyphs.py

54425e2

Add checks to bounds computations for cases where NaN(s) occur at the beginning of the x/y arrays, or the arrays are all NaNs, or arrays are empty.

Fix typo

a08193f

Tweak bounds computations to be 2 loops

75897e1

Although the former code was shorter, separating bounds computations into two loops should yield a slight speed improvement for the common case.

jbednar reviewed May 8, 2017

View reviewed changes

Revert back to 1 loop implementation

eed1d22

Instead of doing 2 loops as suggested, revert back to 1 loop (to avoid computing the len on a dask dataframe).

martindurant reviewed May 8, 2017

View reviewed changes

jbednar merged commit c0d378c into master May 8, 2017

jbednar deleted the optimize_max_min branch May 8, 2017 21:09

jbednar removed the in progress label May 8, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize range calculation operations #344

Optimize range calculation operations #344

gbrener commented May 5, 2017 •

edited

Loading

jbednar left a comment •

edited

Loading

jbednar May 5, 2017

gbrener May 5, 2017

jbednar May 5, 2017

gbrener May 5, 2017

gbrener May 6, 2017 •

edited

Loading

gbrener commented May 6, 2017

martindurant May 7, 2017

gbrener May 8, 2017

gbrener May 8, 2017

jbednar May 8, 2017

martindurant May 8, 2017

jbednar May 8, 2017

gbrener commented May 8, 2017

martindurant May 8, 2017

gbrener May 8, 2017 •

edited

Loading

jbednar May 8, 2017

Optimize range calculation operations #344

Optimize range calculation operations #344

Conversation

gbrener commented May 5, 2017 • edited Loading

jbednar left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbrener May 6, 2017 • edited Loading

Choose a reason for hiding this comment

gbrener commented May 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbrener commented May 8, 2017

Choose a reason for hiding this comment

gbrener May 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gbrener commented May 5, 2017 •

edited

Loading

jbednar left a comment •

edited

Loading

gbrener May 6, 2017 •

edited

Loading

gbrener May 8, 2017 •

edited

Loading