Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 20927 fix resolves read_sas error for dates/datetimes greater than 2262-04-11 #28047

Merged
merged 43 commits into from
May 25, 2020
Merged

Issue 20927 fix resolves read_sas error for dates/datetimes greater than 2262-04-11 #28047

merged 43 commits into from
May 25, 2020

Conversation

paul-lilley
Copy link
Contributor

@mroeschke
Copy link
Member

Couple of thoughts:

  1. It's more idiomatic for pandas use Periods to work with out of range dates, so we may want to return period objects in this case: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#representing-out-of-bounds-spans

  2. We should probably raise a UserWarning nonetheless to warn the user that different data dtype is being returned in this case.

@mroeschke mroeschke added IO SAS SAS: read_sas Period Period data type Datetime Datetime data dtype labels Aug 21, 2019
@paul-lilley
Copy link
Contributor Author

Thanks for the thoughts @mroeschke

I'll add the UserWarning about returning different types.

As the code is at the moment, it will return datetime.date/datetime for all rows if there is at least one SAS date/time that is > 2262-04-11.
However, if read_sas is called as an iterator, chunks with all SAS date/time < 2262-04-11 will return with np.datetime64 and other chunks with at least one SAS date/time that is > 2262-04-11 will return with datetime.date/datetime - i.e. the caller will get a mix of datatypes.

I get some odd results for Periods with dates larger than 2262-04-11, e.g.

import pandas as pd
g = pd.Period(year=2262, month=4, day=11, freq='D')
print(g.start_time)
g2 = pd.Period(year=2262, month=4, day=12, freq='D')
print(g2.start_time)

gives
2262-04-11 00:00:00 which is correct, but then
1677-09-21 00:25:26.290448384 which is clearly wrong for 2262-04-12

This is maybe a bug in Period (?) but for the time being returning datetime.date/time gives a better representation of the original SAS dataset

@mroeschke
Copy link
Member

Could you open up a new issue for the Period overflowing? That looks like a bug.

datetime.date do not have a lot of support in pandas compared to datetime.datetime, so I am a little weary of returning those objects.

@paul-lilley
Copy link
Contributor Author

Ok, I'll raise a new issue for the Period overflow.

I'll also alter read_sas to return datetime.datetime for both dates and datetimes > 2262-04-11. There will still be the possibility of a mix of np.datetime64 and datetime.datetime types in the final dataset if read_sas is called as an iterator.

…s > pd.Timestamp.max

Also added test for iterator behaviour,
chunks that have all dates/datetimes < pd.Timestamp.max will return np.datetime64 as usual,
chunks with any date/datetime > pd.Timestamp.max will return datetime.datetime
@pep8speaks
Copy link

pep8speaks commented Aug 22, 2019

Hello @paul-lilley! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-25 20:25:35 UTC

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do our other parsers do for out of bounds datetimes?

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved
pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved
pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved
doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved
@jreback
Copy link
Contributor

jreback commented Oct 6, 2019

can you merge master and update to comments

@paul-lilley
Copy link
Contributor Author

Hi, apologies for the delayed response. I'll try to address Tom's points, then merge master and update the comments.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our default behavior of to_datetime is to convert to datetime when OOB
so i think it would be ok to just make this change w/o a keyword arg

pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved
@paul-lilley
Copy link
Contributor Author

Hi @jreback
I've made the minor changes from @bashtage and merged master, can you take a look? Thanks.

@paul-lilley paul-lilley requested a review from jreback February 12, 2020 13:44
@WillAyd
Copy link
Member

WillAyd commented May 20, 2020

Looks like this one fell through but is close - @paul-lilley can you fix the merge conflict?

@paul-lilley
Copy link
Contributor Author

Hi @WillAyd - I've merged master (it went through without merge conflicts)

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noticed that the docstring does not quite align with the code.

pandas/io/sas/sas7bdat.py Outdated Show resolved Hide resolved
@paul-lilley
Copy link
Contributor Author

Hi @WillAyd , @bashtage I've altered the comment and merged master

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 minor comments, ping on green

pandas/io/sas/sas7bdat.py Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
@jreback jreback merged commit d75100d into pandas-dev:master May 25, 2020
@jreback
Copy link
Contributor

jreback commented May 25, 2020

thanks @paul-lilley very nice!

@jbrockmendel
Copy link
Member

@paul-lilley im trying to understand the comment on test_max_sas_date

    # NB. max datetime in SAS dataset is 31DEC9999:23:59:59.999
    #    but this is read as 29DEC9999:23:59:59.998993 by a buggy
    #    sas7bdat module

Does this mean that the test here is still wrong and it really should be coming back with datetime(9999, 12, 31, 23, 59, 59, 999000)?

@jbrockmendel
Copy link
Member

@bashtage do you have convenient access to SAS and willingness to help me track down #28047 (comment)

@paullilley
Copy link

Hi @jbrockmendel
Correct, it really should be coming back with datetime(9999, 12, 31, 23, 59, 59, 999000) as max datetime in SAS dataset is 31DEC9999:23:59:59.999 (ie. millisecs precision not microsecs).
However, I'd recommend moving all the SAS dataset reading to use https://github.com/Roche/pyreadstat (currently used in Pandas for SPSS data) as it is way faster

@bashtage
Copy link
Contributor

I don't have access to SAS easily, but @paullilley answered.

@paullilley
Copy link

I do have access to SAS if needed.

@jbrockmendel
Copy link
Member

Thanks @paul-lilley. When I read the file with pyreadstat it comes back as datetime(9999, 12, 29, 23, 59, 59), i.e. the day is 29 instead of 31 and the seconds are rounded down. Is there a chance the file was written with Dec 29 instead of Dec 31, or is there a problem that common to both the pandas and pyreadstat readers?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype IO SAS SAS: read_sas Period Period data type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_sas OverflowError: int too big to convert with '31DEC9999'd formatted as DATE9.
9 participants