Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rounding errors with Timestamps and .last() #19526

Closed
jbandlow opened this issue Feb 3, 2018 · 5 comments
Closed

Rounding errors with Timestamps and .last() #19526

jbandlow opened this issue Feb 3, 2018 · 5 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby
Milestone

Comments

@jbandlow
Copy link
Contributor

jbandlow commented Feb 3, 2018

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd 
In [3]: ts = pd.Timestamp('2016-10-14 21:00:44.557')                                   
In [4]: pd.DataFrame({'a': range(2), 'b': [ts, pd.NaT]}).groupby('a').b.last()         
Out[4]:                                                                                
a                                                                                      
0   2016-10-14 21:00:44.556999936                                                      
1                             NaT                                                      
Name: b, dtype: datetime64[ns]   

Problem description

The value of the Timestamp in the output is not equal to the value in the input.

Expected Output

Out[4]:                                                                                
a                                                                                      
0   2016-10-14 21:00:44.557                                                            
1                       NaT                                                            
Name: b, dtype: datetime64[ns]

The issue is a bit subtle. Here are three related examples, all of which do work as expected:

I haven't been able to repro with .nth(-1) in place of last():

In [1]: import pandas as pd 
In [3]: ts = pd.Timestamp('2016-10-14 21:00:44.557') 
In [5]: pd.DataFrame({'a': range(2), 'b': [ts, pd.NaT]}).groupby('a').b.nth(-1)        
Out[5]:                                                                                
a                                                                                      
0   2016-10-14 21:00:44.557                                                            
1                       NaT                                                            
Name: b, dtype: datetime64[ns]

I haven't been able to repro if NaT is not present:

In [6]: pd.DataFrame({'a': range(2), 'b': [ts, ts]}).groupby('a').b.last()
Out[6]: 
a
0   2016-10-14 21:00:44.557
1   2016-10-14 21:00:44.557
Name: b, dtype: datetime64[ns]

I haven't been able to repro for times which are integer seconds:

In [7]: ts = pd.Timestamp('2016-10-14 21:00:44')
In [8]: pd.DataFrame({'a': range(2), 'b': [ts, pd.NaT]}).groupby('a').b.last()
Out[8]: 
a
0   2016-10-14 21:00:44
1                   NaT
Name: b, dtype: datetime64[ns]

Output of pd.show_versions()

INSTALLED VERSIONS                                                                     
------------------                                                                     
commit: None                                                                           
python: 3.6.3.final.0                                                                  
python-bits: 64                                                                        
OS: Linux                                                                              
OS-release: 4.13.0-32-generic                                                          
machine: x86_64                                                                        
processor: x86_64                                                                      
byteorder: little                                                                      
LC_ALL: None                                                                           
LANG: en_US.UTF-8                                                                      
LOCALE: en_US.UTF-8                                                                    

pandas: 0.22.0                                                                         
pytest: 3.2.1                                                                          
pip: 9.0.1                                                                             
setuptools: 38.2.4                                                                     
Cython: 0.26.1                                                                         
numpy: 1.14.0                                                                          
scipy: 0.19.1                                                                          
pyarrow: None                                                                          
xarray: None                                                                           
IPython: 6.1.0                                                                         
sphinx: 1.6.3                                                                          
patsy: 0.4.1                                                                           
dateutil: 2.6.1                                                                        
pytz: 2017.3                                                                           
blosc: None                                                                            
bottleneck: 1.2.1                                                                      
tables: 3.4.2                                                                          
numexpr: 2.6.2                                                                         
feather: None                                                                          
matplotlib: 2.1.0                                                                      
openpyxl: 2.4.8                                                                        
xlrd: 1.1.0                                                                            
xlwt: 1.3.0                                                                            
xlsxwriter: 1.0.2                                                                      
lxml: 4.1.0                                                                            
bs4: 4.6.0                                                                             
html5lib: 0.9999999                                                                    
sqlalchemy: 1.1.13                                                                     
pymysql: None                                                                          
psycopg2: None                                                                         
jinja2: 2.9.6                                                                          
s3fs: None                                                                             
fastparquet: None                                                                      
pandas_gbq: None                                                                       
pandas_datareader: None
@jbrockmendel
Copy link
Member

Can you reproduce without the groupby step?

@jbandlow
Copy link
Contributor Author

jbandlow commented Feb 4, 2018

I can't repro without groupby. I did a little more experimentation, and it appears that the set of aggregation functions that have the issue are exactly these (max, min, first, last).

@jreback
Copy link
Contributor

jreback commented Feb 4, 2018

this should not be going thru a float round trip which is the symptom you are seeing
if you want to have a more detailed look and see where the issue is

@jbandlow
Copy link
Contributor Author

jbandlow commented Feb 4, 2018

Thanks for the hint @jreback . The offending block seems to be here where the integer result is getting cast as float. The coercion back to a datetime happens in here, and from what I can tell, that "does the right thing" with iNaT.

This is my first look at pandas internals, but if the fix is likely to be just ripping out those lines and adding a test, I can probably manage to put a PR together.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2018

it’s not likely to be ‘ripping it out’ but pls have a look

@jreback jreback added Groupby Dtype Conversions Unexpected or buggy dtype conversions Difficulty Intermediate labels Feb 4, 2018
@jreback jreback added this to the 0.23.0 milestone Feb 4, 2018
@jreback jreback closed this as completed in 983d71f Feb 6, 2018
harisbal pushed a commit to harisbal/pandas that referenced this issue Feb 28, 2018
closes pandas-dev#19526

Author: Jason Bandlow <[email protected]>

Closes pandas-dev#19530 from jbandlow/timestamp_float_conversion and squashes the following commits:

2fb23d6 [Jason Bandlow] merge
af37225 [Jason Bandlow] BUG: Fix ts precision issue with groupby and NaT (pandas-dev#19526)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants