-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix error when reading postgres table with timezone #7139 #7364
Conversation
I did not yet test your code, but one remark we may have to think about: You also have the So we could also do that if you specify |
(this comment became really long and there are cliff notes at the bottom) Passing the column to The cause is if you have a
And this code converts all of the However, when the bug in #7139 was triggered, what happens is that the In the sake of uniformity, it probably makes sense for Now the second part. I thought about somehow putting the UTC conversion information into For example, in some cases you want to specify the format of the datetime field you are parsing, and the only way to do that is via Granted, the above functionality is not currently possible because Frankly, the whole thing is probably going to be confusing to end users when different My personal opinion is that if Cliff notes:
|
(Just rebased on master so travis will re-run and fix that spurious failure - didn't make any changes) |
Well actually I did have to make a minor update to |
Just rebased on master, all tests still passing. Though there is a big wall of text up there, basically taking this change will fix the issue, cause no new problems and would be perfectly compatible with future improvements or changes. I'm in no hurry to get it committed, but I think it is a pretty safe change as far as changes go. |
@jorisvandenbossche ? prob push this |
@danbirken Sorry for totally dropping the ball here. I try to look at it and test it with my case as soon as possible. |
db17ed6
to
6c98596
Compare
Alright well here is an updated version rebased on master (and updating the doc notes to milestone 0.15.0). I haven't done a detailed look to see if anything has changed in the past 3 months that would make this change no longer good, but I did a cursory glance and everything still seems to apply and be relevant and correct. |
@danbirken OK, I got to look at this. Given you explanation, I am currently thinking:
But to conlude: by default, for now, I think we should do what |
This was fixed back in October and is a real issue. Any chance it can get merged so we can get it in 0.15.3? Many thanks for the fixed! |
@phaefele if you look at the last comment by @jorisvandenbossche there is still an open issue on this. |
The original thing still broken. I just trimmed down the change to what I hope is the minimum viable change that fixes the original bug and the other thing that @jorisvandenbossche brought up without adding anything to the read_sql_table* interfaces. Will be posted by end of tonight, just waiting for tests to finish. |
6c98596
to
efbc460
Compare
There you go. Differences from previous change:
So this still fixes the initial bug, doesn't clutter up all of the
Which essentially does what |
you don't want to use
|
It doesn't work because this doesn't return a So after thinking about it more and based on what you said and that nice Datetime accessor interface, I think the best solution is just to convert everything to UTC and make the column a datetime64 column. This is a super simple change and is much simpler than even the current one. It would operate just like the json serializer (and probably the other ones too, I didn't check):
So it takes your nice timezoned datetime, converts to UTC, and then you are dealing with UTC from now on. This is sort of enforcing a best practice, but it is consistent and clearly somebody already decided that it was the right approach for json. So I agree with that person and think this is the right approach for sql serialization/deserialization. It probably will break backwards compatibility for some rare people, but given this bug prevents a lot of people from using this feature, it probably isn't that big a deal. or... The current changelist I think is the "minimum viable change" which fixes the bug, disturbs the least amount of other stuff and has the sanest fallback. |
Just ran into this issue, would like to see it merged too ;) |
P.s for people looking for a partial workaround (since it too me hours to figure this out)
I had TZ-aware datetimes in the DB with TZ=UTC, but they were read with Postgres default client TZ offset. Since this adjusted for local DST the resulting objects had multiple TZ offsets, triggering the bug. |
efbc460
to
2e1233b
Compare
Alright I've scrapped my previous work and coded what I argued in my last post and what I believe is the correct change. It also is the simplest change. I believe the correct behavior here is to convert If pandas had better native support for a Series containing timestamps with different time zones it would be a different story, but pandas does not currently. Re-writing This is a BC break, but I don't think it will have a huge impact because a) it only affects people importing tables from SQL, not any core pandas functionality and b) this feature is currently broken so it probably isn't in wide usage. Additionally, if this did break your code the fix is very simple:
|
@danbirken Thanks a lot! So to summarize:
That the I was also thinking if it wouldn't be an option to keep it as is (object dtype with Timestamps with fixed offset), and fix it so that also the error now (different offsets) does return that. But, this would it make it more complex to fix it I think? And, if you want to work with such data (with a FixedOffset), you will almost always first have to convert it UTC before you can do something with it. So to conclude my thoughts, I think this is fine. Maybe just add a note about this in the docs? |
Just to clarify: is the fundamental issue that neither Series, DateTimeIndexes, nor np.datetime64 can correctly handle arrays of datetimes with multiple time zones? I think it would make sense in that case that if you pass This way you always have a "high fidelity" option that simply avoids this incompatibility. At the moment (at least the last time I checked) passing |
So what you get back from the sql query are a list of values like
So if we want to preserve the exact information of the database, the options are to store it as object columns with datetime.datetime or Timestamp values. A series with For now, |
As far as my knowledge goes, I agree with everything in the most recent comment.
My first change like 9 months ago did this (though with UTC conversion not the default). I think this is wrong because pandas is right now just not built to support a Series of Timestamps which have different time zones. I think people should be forced, willingly or not, into UTC because that is what pandas supports. The deeper I go into the rabbit hole the more incompatibilities I find. For example: https://github.com/pydata/pandas/blob/master/pandas/tseries/index.py#L310 If you access a Datetime-like Series with different time zones via the DatetimeProperties accessor, pandas will just convert it to UTC anyways without warning.
The problem is that the default situation stores them as an And of course, if anybody wants the timezone information added it is very easy to do that via |
2e1233b
to
8b4eb76
Compare
I did just make one minor edit where I added a "Notes" section to the |
@danbirken fully agree. One comment: I think the note you added is not fully correct for |
`read_sql_table()` will break if it reads a table with a `timestamp with time zone` column if individual rows within that column have different time zones. This is very common due to daylight savings time. Pandas right now does not have good support for a Series containing datetimes with different time zones (hence this bug). So this change simply converts a `timestamp with time zone` column into UTC during import, which pandas has great support for.
8b4eb76
to
5cc35f4
Compare
Good point, I updated the documentation for both I don't think this change should do anything about
So I think it will just lead to more confusion. My opinion is when a great hero comes along and creates a datetime w/tz storage system for a pandas Series, or decides time zones aren't supported in pandas, that will fix the permanent issue of the |
There was in the meantime a conflict in the whatsnew file, so I merged as c3eeb57 (so you didn't have to rebase again ..) @danbirken Thank you very much for fixing this, and for your endurance, as this look a bit longer than necessary. But in the end, a good fix was put in! |
👍 |
Yes - thank you for addressing that! On Thu, Apr 16, 2015 at 5:55 PM, Dan Birken [email protected]
Paul Haefele[email protected] |
Closes #7139, merged via c3eeb57
Fixes an issue where read_sql_table() will throw an error if it is
reading a postgres table with
timestamp with time zone
fields thatcontain entries with different time zones (such as for DST).
This also adds a new keyword, convert_dates_to_utc, which optionally
allows the caller to convert
timestamp with time zone
datetimes intoUTC, which allows pandas to store them internally as
numpy.datetime64
s, allowing for more efficient storage and operationspeed on these columns.