-
-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize range calculation operations #344
Conversation
Optimize the operations that datashader uses to calculate the x_range and y_range variables when the ranges are not provided by the user. Add a caching feature that avoids recalculating the x_range and y_range when the Canvas is reused for multiple aggregations. Caching feature can be turned off when instantiating the Canvas object, or invalidated while calling cvs.points(..., recalc_ranges=True) or cvs.line(..., recalc_ranges=True)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks! Do you have results showing timings before and after the change? I'm not sure what you mean by 30s, 60s, and 70s above, since the entire Agg2 should take 2s or less...
datashader/core.py
Outdated
self.x_range = tuple(x_range) if x_range is not None else x_range | ||
self.y_range = tuple(y_range) if y_range is not None else y_range | ||
self.x_range = tuple(x_range) if x_range is not None else None | ||
self.y_range = tuple(y_range) if y_range is not None else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you found this way of writing it clearer (as it would seem to have no effect)? I personally would maybe prefer:
self.x_range = None if x_range is None else tuple(x_range)
self.y_range = None if y_range is None else tuple(y_range)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, not a problem - happy to make the change if it's easier to read.
datashader/glyphs.py
Outdated
@@ -29,10 +29,12 @@ def validate(self, in_dshape): | |||
raise ValueError('y must be real') | |||
|
|||
def _compute_x_bounds(self, df): | |||
return df[self.x].min(), df[self.x].max() | |||
xs = df[self.x].values | |||
return xs.min(), xs.max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this and _compute_y_bounds
be @jit so that Numba can combine the max and min into a single pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I'll investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbednar I had some difficulties using numba with dask. For this reason there are now two versions of the bounds computations: one is for pandas dataframes, relying on numba and only doing one pass through the data (as we discussed). The other is a memoized version of the existing code (with np.nanmin
and np.nanmax
). The memoization gets us the same performance as cachey did - which is great news - but still does two passes through the data. The memoization does not work with pandas dataframes because they're mutable, whereas dask dataframes are not. However there is still a solid speedup to the pandas side of datashader. Hopefully someone with more dask knowledge will be able to incorporate numba into the memoized function for an additional potential speedup on the first aggregation.
Remove caching feature from previous commits after discussion with Jim. Convert arr.min() and arr.max() calls to np.nanmin(arr) and np.nanmax(arr) to more-closely emulate the NaN-handling behavior of df.min() and df.max()
Use toolz.memoize - similar to how other code in glyphs.py is optimized - to cache the x/y bound computations. This takes advantage of the fact that dask dataframes are immutable/hashable, and has the desired result that the cache_ranges feature had before.
The tests are failing because pandas DataFrames are not hashable. Rather than using dask.dataframe.hashing.hash_pandas_object, it is probably more efficient to simply recalculate the min/max. So memoization only happens for dask.
Since I know of no straightforward way to incorporate numba functions into dask graphs, there are now two versions of the min/max computations; the pandas ones use numba, and the dask ones use memoization (relying on dask dataframe immutability).
Nothing has changed except the names, so that the tests can return to passing state again.
Update unit test so that it passes a numpy array to numba function instead of dataframe (as it was before).
@staticmethod | ||
@ngjit | ||
def _compute_x_bounds(xs): | ||
minval = maxval = xs[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good in many cases, but probably falls over if the first item happens to be nan.
Another little loop to skip over any initial nan values would do here, and I suppose there should be a way to raise an error if it happens all values are nan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @martindurant. Those cases are covered now. Also added the check for the empty array condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm on vacation this week - feel free to make additional changes to this PR as you see fit.
Add checks to bounds computations for cases where NaN(s) occur at the beginning of the x/y arrays, or the arrays are all NaNs, or arrays are empty.
Although the former code was shorter, separating bounds computations into two loops should yield a slight speed improvement for the common case.
datashader/glyphs.py
Outdated
@@ -32,25 +31,49 @@ def validate(self, in_dshape): | |||
@staticmethod | |||
@ngjit | |||
def _compute_x_bounds(xs): | |||
if len(xs) == 0: | |||
raise ValueError('x coordinate array is empty.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Computing len() for a Dask dataframe is non-trivial, since it is loaded lazily, but I think this entire branch can be avoided by simply setting minval=maxval=np.NaN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case x < minval and x < maxval will both always be False, so you would only ever get NaN out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True; that would need an extra isnan(minval), etc. inside the loop. @jcrist points out that at this point the chunk of data has already been loaded, and so the len() should be cheap, so I'll leave it as-is.
Instead of doing 2 loops as suggested, revert back to 1 loop (to avoid computing the len on a dask dataframe).
@jbednar How's this? |
datashader/glyphs.py
Outdated
minval = x | ||
elif x > maxval: | ||
if np.isnan(maxval) or x > maxval: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this does the right thing at the slight cost of checking nan every loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. But at least it can be reused for dask at some point in the future. These methods are only used for pandas at the moment; the ones used for dask are called by the same names, but have the _dask
suffix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll go ahead and merge this, and we can separately consider further optimizations in the future. Thanks, all!
Optimize the operations that datashader uses to calculate the x_range and y_range variables when the ranges are not provided by the user. Addresses issue #336 and some comments that came out of #332 . Agg2 now takes the same amount of time with df.persist() as it had previously with cachey.
Add a caching feature to Canvas which allows it to avoid recalculating the x_range and y_range when the Canvas object is reused for multiple aggregations. This caching feature can be turned off when instantiating the Canvas objectCanvas(..., cache_ranges=False)
, or be circumvented by callingcvs.points(..., recalc_ranges=True)
orcvs.line(..., recalc_ranges=True)
.Update
filetimes.py
to do its own caching of the range variables. Also add a--recalc-ranges
flag to see timing difference when the range variables are not cached.