-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unexpected behaviour when grouping datetime column containing null-values, SeriesGroupby #10979
Comments
Sorry, I forgot to show the version but it I did check I was using the most up-to-date version, 0.16.2. Maybe one of my dependencies is old. Here is the output of
|
ahh, this was actually a regression in 0.16.2 to 0.16.0, fixed in master in any event, so will be in 0.17.0 |
Ok cool. There's a simple workaround for now. Cheers! |
@eoincondron np, thanks for the report. If you'd like to see where this actually happend (xref #10980) would be great. |
Ok, I'm just learning how to use Git but I'll give it a try. |
contributing docs are here: http://pandas.pydata.org/pandas-docs/stable/contributing.html |
…already fixed in master)
This bug should be reopened because it still persists when one group has all NaT values: df = pd.DataFrame({'datetime': pd.date_range('20150903', periods=4),
'groups': ['a', 'b']*2})
df.loc[0, 'datetime'] = pd.NaT
df.loc[2, 'datetime'] = pd.NaT
df.groupby('groups').datetime.min() which results in:
Note that the DataFrameGroupBy handles this correctly: df.groupby('groups')[['datetime']].min() and gives as result:
My pandas version:
|
@jreback Sorry about that! Thanks! |
I found some unexpected behaviour when looking for the group minima of a datetime column containing null values. It appears that when the
min
method is called on aSeriesGroupBy
of dtypedatetime64
with null values, the values are cast to floats before the minima are computed. Consider the following:The
float
value ofpd.NaT
is-2^63
and so it is determined to be the minimum of any group which contains it. The expected behaviour would be for null values to be ignored and the minima of the non-null values returned asdatetime64
objects. Interestingly, themax
method seems to work as expected;The
min
method of theDataFrameGroupBy
object is kind of half way between; it fails to ignore the null-values and givespd.NaT
as the min of any group which contains it but it does return the correct data type:I tried to trace the source of the error and I got as far as the call to
where 'obj' is a (the only) set of values in self._iterate_slices. Within
self.grouper.aggregate
the linesand
seem relevant. It might be worth noting that
self.aggregate(lambda x: np.min(x, axis=self.axis)
has the desired output whileself.aggregate(np.min)
does not. Also, changing the definition of themin
method tofixes this particular problem.
The text was updated successfully, but these errors were encountered: