Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize range calculation operations #344

Merged
merged 11 commits into from
May 8, 2017
Merged

Optimize range calculation operations #344

merged 11 commits into from
May 8, 2017

Conversation

gbrener
Copy link
Contributor

@gbrener gbrener commented May 5, 2017

Optimize the operations that datashader uses to calculate the x_range and y_range variables when the ranges are not provided by the user. Addresses issue #336 and some comments that came out of #332 . Agg2 now takes the same amount of time with df.persist() as it had previously with cachey.

Add a caching feature to Canvas which allows it to avoid recalculating the x_range and y_range when the Canvas object is reused for multiple aggregations. This caching feature can be turned off when instantiating the Canvas object Canvas(..., cache_ranges=False), or be circumvented by calling cvs.points(..., recalc_ranges=True) or cvs.line(..., recalc_ranges=True).

Update filetimes.py to do its own caching of the range variables. Also add a --recalc-ranges flag to see timing difference when the range variables are not cached.

Optimize the operations that datashader uses to calculate the x_range and y_range variables when the ranges are not provided by the user. Add a caching feature that avoids recalculating the x_range and y_range when the Canvas is reused for multiple aggregations. Caching feature can be turned off when instantiating the Canvas object, or invalidated while calling cvs.points(..., recalc_ranges=True) or cvs.line(..., recalc_ranges=True)
@gbrener gbrener requested a review from jbednar May 5, 2017 00:29
Copy link
Member

@jbednar jbednar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! Do you have results showing timings before and after the change? I'm not sure what you mean by 30s, 60s, and 70s above, since the entire Agg2 should take 2s or less...

self.x_range = tuple(x_range) if x_range is not None else x_range
self.y_range = tuple(y_range) if y_range is not None else y_range
self.x_range = tuple(x_range) if x_range is not None else None
self.y_range = tuple(y_range) if y_range is not None else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you found this way of writing it clearer (as it would seem to have no effect)? I personally would maybe prefer:

self.x_range = None if x_range is None else tuple(x_range)
self.y_range = None if y_range is None else tuple(y_range)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, not a problem - happy to make the change if it's easier to read.

@@ -29,10 +29,12 @@ def validate(self, in_dshape):
raise ValueError('y must be real')

def _compute_x_bounds(self, df):
return df[self.x].min(), df[self.x].max()
xs = df[self.x].values
return xs.min(), xs.max()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this and _compute_y_bounds be @jit so that Numba can combine the max and min into a single pass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll investigate.

Copy link
Contributor Author

@gbrener gbrener May 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbednar I had some difficulties using numba with dask. For this reason there are now two versions of the bounds computations: one is for pandas dataframes, relying on numba and only doing one pass through the data (as we discussed). The other is a memoized version of the existing code (with np.nanmin and np.nanmax). The memoization gets us the same performance as cachey did - which is great news - but still does two passes through the data. The memoization does not work with pandas dataframes because they're mutable, whereas dask dataframes are not. However there is still a solid speedup to the pandas side of datashader. Hopefully someone with more dask knowledge will be able to incorporate numba into the memoized function for an additional potential speedup on the first aggregation.

gbrener added 6 commits May 5, 2017 16:20
Remove caching feature from previous commits after discussion with Jim. Convert arr.min() and arr.max() calls to np.nanmin(arr) and np.nanmax(arr) to more-closely emulate the NaN-handling behavior of df.min() and df.max()
Use toolz.memoize - similar to how other code in glyphs.py is optimized - to cache the x/y bound computations. This takes advantage of the fact that dask dataframes are immutable/hashable, and has the desired result that the cache_ranges feature had before.
The tests are failing because pandas DataFrames are not hashable. Rather than using dask.dataframe.hashing.hash_pandas_object, it is probably more efficient to simply recalculate the min/max. So memoization only happens for dask.
Since I know of no straightforward way to incorporate numba functions into dask graphs, there are now two versions of the min/max computations; the pandas ones use numba, and the dask ones use memoization (relying on dask dataframe immutability).
Nothing has changed except the names, so that the tests can return to passing state again.
Update unit test so that it passes a numpy array to numba function instead of dataframe (as it was before).
@gbrener
Copy link
Contributor Author

gbrener commented May 6, 2017

@jbednar Updated results posted here: #129

@staticmethod
@ngjit
def _compute_x_bounds(xs):
minval = maxval = xs[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good in many cases, but probably falls over if the first item happens to be nan.
Another little loop to skip over any initial nan values would do here, and I suppose there should be a way to raise an error if it happens all values are nan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @martindurant. Those cases are covered now. Also added the check for the empty array condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm on vacation this week - feel free to make additional changes to this PR as you see fit.

@jbednar jbednar added this to the 0.5.0 milestone May 7, 2017
gbrener added 3 commits May 8, 2017 00:11
Add checks to bounds computations for cases where NaN(s) occur at the beginning of the x/y arrays, or the arrays are all NaNs, or arrays are empty.
Although the former code was shorter, separating bounds computations into two loops should yield a slight speed improvement for the common case.
@@ -32,25 +31,49 @@ def validate(self, in_dshape):
@staticmethod
@ngjit
def _compute_x_bounds(xs):
if len(xs) == 0:
raise ValueError('x coordinate array is empty.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Computing len() for a Dask dataframe is non-trivial, since it is loaded lazily, but I think this entire branch can be avoided by simply setting minval=maxval=np.NaN.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case x < minval and x < maxval will both always be False, so you would only ever get NaN out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True; that would need an extra isnan(minval), etc. inside the loop. @jcrist points out that at this point the chunk of data has already been loaded, and so the len() should be cheap, so I'll leave it as-is.

Instead of doing 2 loops as suggested, revert back to 1 loop (to avoid computing the len on a dask dataframe).
@gbrener
Copy link
Contributor Author

gbrener commented May 8, 2017

@jbednar How's this?

minval = x
elif x > maxval:
if np.isnan(maxval) or x > maxval:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think this does the right thing at the slight cost of checking nan every loop.

Copy link
Contributor Author

@gbrener gbrener May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. But at least it can be reused for dask at some point in the future. These methods are only used for pandas at the moment; the ones used for dask are called by the same names, but have the _dask suffix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll go ahead and merge this, and we can separately consider further optimizations in the future. Thanks, all!

@jbednar jbednar merged commit c0d378c into master May 8, 2017
@jbednar jbednar deleted the optimize_max_min branch May 8, 2017 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants