Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agg() function on groupby dataframe changes dtype of datetime64[ns] column to float64 if all items in a single group are NaT #12821

Closed
lvphj opened this issue Apr 7, 2016 · 2 comments
Labels
Bug Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@lvphj
Copy link

lvphj commented Apr 7, 2016

The example below shows two variations of a dataframe which contains a date column set to datetime64[ns] format.

In the first example, there is a single missing (NaT) date. After groupby and agg(), the dtypes of all the columns in the aggregated dataframe are the same as the original dataframe, as expected (and as desired).

However, in the second example, there are several missing dates, arranged so that all the dates in one group are NaT. After the same groupby and agg() procedures, the dtype of the date column is changed to float64. This is undesired behaviour in my situation and I believe it is a bug.

Code Sample, a copy-pastable example if possible

# Introduce single missing values in the date column
print('Datafreme with single missing date value')
print('========================================')
phjTempDF = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                          'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                          'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                          'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})

phjTempDF = phjTempDF.sort_values(['gender','age','date'])

phjTempDF.ix[1,'date'] = 'missing'

# Convert date to datetime64 format
phjTempDF['date'] = pd.to_datetime(phjTempDF['date'],errors='coerce')

print('\nWhole dataframe')
print('---------------')
print(phjzempdf)
print('\nOriginal types')

print('---------------')
print(phjTempDF.dtypes)

phjTempDF = phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'}).reset_index(drop=False)

print('\nAggregated dataframe')
print('--------------------')
print(phjzempdf)
print('\nPost-aggregation types')

print('-----------------------')
print(phjTempDF.dtypes)

# Introduce multiple missing values in the date column (one group contains all missing values)
# Introduce single missing values in the date column
print('\n\nDataframe with multiple missing dates values')
print('============================================')
phjTempDF = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
                          'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
                          'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
                          'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})

phjTempDF = phjTempDF.sort_values(['gender','age','date'])

phjTempDF.ix[[1,2,5],'date'] = 'missing'

# Convert date to datetime64 format
phjTempDF['date'] = pd.to_datetime(phjTempDF['date'],errors='coerce')

print('\nWhole dataframe')
print('---------------')
print(phjzempdf)
print('\nOriginal types')#
print('---------------')
print(phjTempDF.dtypes)

phjTempDF = phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'}).reset_index(drop=False)

print('\nAggregated dataframe')
print('--------------------')
print(phjzempdf)
print('\nPost-aggregation types')#
print('-----------------------')
print(phjTempDF.dtypes)

Expected Output

The expected output would be for the dtypes in the dataframe after aggregation to be the same as those in the original dataframe.

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.4.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.0
nose: 1.3.7
pip: 1.5.6
setuptools: 3.6
Cython: None
numpy: 1.10.1
scipy: None
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: 2.3.0
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Apr 7, 2016

So this works for a frame aggregation, but doesn't for the .agg, ok a bug.

pull-requests welcome.

In [32]:  phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).first()
Out[32]: 
                            date  id
gender age                          
female old                   NaT   2
       young 2015-12-05 14:19:00  11
male   old   2015-12-04 01:00:00   4
       young 2015-02-04 02:34:00   1

In [33]: phjTempDF.sort_values(['gender','age','id']).groupby(['gender','age']).agg({'date': 'first','id': 'first'})
Out[33]: 
                      date  id
gender age                    
female old             NaN   2
       young  1.449325e+18  11
male   old    1.449191e+18   4
       young  1.423017e+18   1

@jreback jreback added Bug Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Difficulty Intermediate labels Apr 7, 2016
@jreback jreback added this to the 0.18.1 milestone Apr 7, 2016
@facaiy
Copy link
Contributor

facaiy commented Apr 8, 2016

I'd like to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
3 participants