agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 if all items in a single group are NaT #12821

lvphj · 2016-04-07T12:24:32Z

The example below shows two variations of a dataframe which contains a date column set to datetime64[ns] format.

In the first example, there is a single missing (NaT) date. After groupby and agg(), the dtypes of all the columns in the aggregated dataframe are the same as the original dataframe, as expected (and as desired).

However, in the second example, there are several missing dates, arranged so that all the dates in one group are NaT. After the same groupby and agg() procedures, the dtype of the date column is changed to float64. This is undesired behaviour in my situation and I believe it is a bug.

Code Sample, a copy-pastable example if possible

# Introduce single missing values in the date column
print('Datafreme with single missing date value')
print('========================================')
phjTempDF = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                          'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                          'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                          'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})

phjTempDF = phjTempDF.sort_values(['gender','age','date'])

phjTempDF.ix[1,'date'] = 'missing'

# Convert date to datetime64 format
phjTempDF['date'] = pd.to_datetime(phjTempDF['date'],errors='coerce')

print('\nWhole dataframe')
print('---------------')
print(phjzempdf)
print('\nOriginal types')

print('---------------')
print(phjTempDF.dtypes)

phjTempDF = phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'}).reset_index(drop=False)

print('\nAggregated dataframe')
print('--------------------')
print(phjzempdf)
print('\nPost-aggregation types')

print('-----------------------')
print(phjTempDF.dtypes)

# Introduce multiple missing values in the date column (one group contains all missing values)
# Introduce single missing values in the date column
print('\n\nDataframe with multiple missing dates values')
print('============================================')
phjTempDF = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                          'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                          'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                          'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})

phjTempDF = phjTempDF.sort_values(['gender','age','date'])

phjTempDF.ix[[1,2,5],'date'] = 'missing'

# Convert date to datetime64 format
phjTempDF['date'] = pd.to_datetime(phjTempDF['date'],errors='coerce')

print('\nWhole dataframe')
print('---------------')
print(phjzempdf)
print('\nOriginal types')#
print('---------------')
print(phjTempDF.dtypes)

phjTempDF = phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'}).reset_index(drop=False)

print('\nAggregated dataframe')
print('--------------------')
print(phjzempdf)
print('\nPost-aggregation types')#
print('-----------------------')
print(phjTempDF.dtypes)

Expected Output

The expected output would be for the dtypes in the dataframe after aggregation to be the same as those in the original dataframe.

output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 1.5.6
setuptools: 3.6
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: 2.3.0
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2016-04-07T13:12:41Z

So this works for a frame aggregation, but doesn't for the .agg, ok a bug.

pull-requests welcome.

In [32]:  phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).first()
Out[32]: 
                            date  id
gender age                          
female old                   NaT   2
       young 2015-12-05 14:19:00  11
male   old   2015-12-04 01:00:00   4
       young 2015-02-04 02:34:00   1

In [33]: phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'})
Out[33]: 
                      date  id
gender age                    
female old             NaN   2
       young  1.449325e+18  11
male   old    1.449191e+18   4
       young  1.423017e+18   1

facaiy · 2016-04-08T08:47:22Z

I'd like to take a look.

jreback added Bug Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate labels Apr 7, 2016

jreback added this to the 0.18.1 milestone Apr 7, 2016

jreback mentioned this issue Apr 21, 2016

Operations on NaT returning float instead of datetime64[ns] #12941

Closed

jreback modified the milestones: 0.18.2, 0.18.1 Apr 21, 2016

facaiy mentioned this issue Apr 26, 2016

BUG: agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 #12992

Closed

lvphj mentioned this issue May 15, 2016

Converting float64 values to datetime64[ns] format using pd.to_datetime results in NaT if errors='coerce' #13180

Closed

jreback modified the milestones: 0.19.0, 0.18.2 Jul 6, 2016

jreback mentioned this issue Jul 29, 2016

Unexpected behaviour when grouping datetime column containing null-values, SeriesGroupby #10979

Closed

jreback modified the milestones: 0.19.0, 0.20.0 Jul 29, 2016

jreback closed this as completed in 63e8f68 Aug 6, 2016

jreback mentioned this issue Aug 28, 2016

first/last converts datetime into float #14104

Closed

jreback mentioned this issue Oct 29, 2016

Wrong index type from groupby on an empty dataframe using user-defined agg func #14538

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 if all items in a single group are NaT #12821

agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 if all items in a single group are NaT #12821

lvphj commented Apr 7, 2016

jreback commented Apr 7, 2016

facaiy commented Apr 8, 2016

agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 if all items in a single group are NaT #12821

agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 if all items in a single group are NaT #12821

Comments

lvphj commented Apr 7, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Apr 7, 2016

facaiy commented Apr 8, 2016

output of `pd.show_versions()`