-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fixes GH9311 groupby on datetime64 #9345
Conversation
@@ -1504,7 +1504,11 @@ def aggregate(self, values, how, axis=0): | |||
raise NotImplementedError | |||
out_shape = (self.ngroups,) + values.shape[1:] | |||
|
|||
if is_numeric_dtype(values.dtype): | |||
if is_datetime: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this do to operations like groupby('foo').mean()
for datetime columns? Currently these raise DataError: No numeric types to aggregate
but perhaps they shouldn't...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mean of datetime64 is a can of worms. It can be done, but completely separate from this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed... wanted to make sure to understand if that was changing that behavior for groupby aggregations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, I think that should be cause before here though (iow there is a test for it now that asserts that it fails)
This looks like some nice fixes but also looks likely under-tested -- you've changed a lot of functions and only added tests for one of them. |
@@ -1487,7 +1487,7 @@ def wrapper(*args, **kwargs): | |||
(how, dtype_str)) | |||
return func, dtype_str | |||
|
|||
def aggregate(self, values, how, axis=0): | |||
def aggregate(self, values, how, axis=0, is_datetime=False): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can't put a flag like this here, its very odd. There are already introspection determinations in the grouper objects.
I would probably scale this back to only change operations without any possibility of precision/overflow issues -- I think that's Also, it is almost impossible to write too many tests :). |
Awesome. Thanks for the feedback. I'll have updates, and more tests in here shortly. |
@shoyer or @jreback - I'm very close to having the above fixes ready on this PR. I do need a bit of guidance on one thing though. Now that the Cython functions are using iNaT for nan values on the integer functions, they are returning the integer value (-9223372036854775808) back in the ndarray, as I would expect. In order to get it working so that na functions will work (like fillna), I need to replace those values back with something that will be evaluated as na. Is there a helper function somewhere to do that? |
@iwschris iNaT -> NaT happens automatically when you cast back to datetime64. You can use |
Awesome. Thanks! |
aff1443
to
2219ef7
Compare
@shoyer -- Pretty sure this is ready for review again. I was able to keep the safe casting turned on in internals.py and I validated that vbench is still looking pretty good. Here's the output from vbench:
|
@@ -4,6 +4,11 @@ | |||
# or we get a bootstrapping problem | |||
from StringIO import StringIO | |||
|
|||
MAX_INT8 = 127 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the dtype properties instead of hard coding these, e.g., https://github.com/pydata/pandas/blob/484f66814a32324248e3d03203ea991ed896fd7a/pandas/core/common.py#L52
OK, this looks much more reasonable to me. I'm slightly troubled by the |
@@ -85,6 +87,8 @@ | |||
_dataframe_apply_whitelist = \ | |||
_common_apply_whitelist | frozenset(['dtypes', 'corrwith']) | |||
|
|||
_non_arithmetic_agg = ('first', 'last', 'min', 'max', 'nth', 'count') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be a frozenset
@shoyer Thanks for looking at it so quickly. I'll make the changes suggested, and I'll use a profiler to figure out why |
Alright, I found it. Here's the relevant function: def _try_coerce_result(self, result):
""" reverse of try_coerce_args """
if isinstance(result, np.ndarray):
if result.dtype == 'i8':
result = tslib.array_to_datetime(
result.astype(object).ravel()).reshape(result.shape)
elif result.dtype.kind in ['i', 'f', 'O']:
result = result.astype('M8[ns]', casting='safe')
elif isinstance(result, (np.integer, np.datetime64)):
result = lib.Timestamp(result)
return result We aren't even hitting the
What's more, this block... if result.dtype == 'i8':
result = tslib.array_to_datetime(
result.astype(object).ravel()).reshape(result.shape) ...I don't believe is necessary any more, as I think it was covering for this error in numpy 1.6. I'm going to clean this function up and make sure all the tests are passing and that vbench looks better, and I'll push up my changes. |
@iwschris Interesting -- sounds like a good plan to me! |
@iwschris The point of that coercion was to handle the cases where a returned input was non-i8. E.g. an op was applied to a datetime64 (say mean in a groubpy) that returned a float / object. So it prob wasn't hit very much (maybe not tested at all). |
2219ef7
to
0acdaab
Compare
@shoyer I think I've made all the changes that you requested, and the groupby performance is fixed. Let me know if you see anything else that needs to be changed. |
d33dae7
to
01e6526
Compare
Cool. I think that last push takes care of everything mentioned up to this point. Anything else? |
just FYI
The reason we always want to take a view on a M8/m8 object is that it does NOT copy. so these DO need to be treated separately (from a regular int64) |
01e6526
to
9fff246
Compare
I see. Good info. Newest stuff now pushed. |
@jreback interesting -- didn't realize that about view vs astype for datetime64 |
@iwschris can you just run another vbench vs master (you can limit it with |
You bet. Should have it in a few minutes. |
@jreback looks like objects weren't accounted for in that group of |
@iwschris right, yeh that should be the else. |
9fff246
to
dd89607
Compare
datetime64 columns were changing at the nano-second scale when applying a groupby aggregator.
dd89607
to
5f6cbf8
Compare
|
@jreback - vbench output posted, and the modification for Let me know if there is anything else. |
vbench looks pretty good to me -- some nice speedups for grouped aggregations! |
the vbenchs are due to other PR's actually. Here is what I get
@iwschris I think you were benching against an older version. in any event it looks fine. ping when green |
Good to know for future PR's. It just takes a fetch and a rebase, right? |
yep, then I run with
if you originally checked out with (or you can set the tracking branch, e.g. |
Yep, our vbench's match now. |
@jreback it's green. |
BUG: Fixes GH9311 groupby on datetime64
Thanks for working with me on it! I'll take a peek at #4095. |
Indeed, really nicely done |
…already fixed in master)
datetime64 columns were changing at the nano-second scale when
applying a groupby aggregator.
closes #9311
closes #6620