Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix error when reading postgres table with timezone #7139 #7364

Closed

Conversation

danbirken
Copy link
Contributor

Closes #7139, merged via c3eeb57

Fixes an issue where read_sql_table() will throw an error if it is
reading a postgres table with timestamp with time zone fields that
contain entries with different time zones (such as for DST).

This also adds a new keyword, convert_dates_to_utc, which optionally
allows the caller to convert timestamp with time zone datetimes into
UTC, which allows pandas to store them internally as
numpy.datetime64s, allowing for more efficient storage and operation
speed on these columns.

@jorisvandenbossche
Copy link
Member

I did not yet test your code, but one remark we may have to think about:

You also have the parse_dates keyword argument. At this moment, if I have a column with timestamp with time zone, it returns datetime.datetime with some psycopg2.tz.FixedOffsetTimezone.
However, if I specify to parse this as a datetime with parse_dates='col_name', I also get a column of object dtype but filled with pd.Timestamp values with the same timezone, which is also something odd (maybe this changed with your new code).

So we could also do that if you specify parse_dates, it is in that case converted to a datetime64? (and with the consequence that it is converted to utc, as datetime64 does not really support timezones?)

@danbirken
Copy link
Contributor Author

(this comment became really long and there are cliff notes at the bottom)

Passing the column to parse_dates doesn't change the behavior, but you are right that there are two code paths here that can result in the column either being full of datetime.datetime or full of Timestamps. However, Timestamp and datetime.datetime as essentially equivalent (Timestamp is a subclass of datetime.datetime) so this is weird but not a problem.

The cause is if you have a timestamp with time zone column and it contains all timestamps with the same time zone, then this is the code that is generating the final column:

values, tz = tslib.datetime_to_datetime64(arg)
return DatetimeIndex._simple_new(values, None, tz=tz)

And this code converts all of the datetime.datetimes into datetime64[ns] and extracts the timezone information, and then when it gets turned back into the DatetimeIndex, the timezone is put back and somehow all of the values becomes pd.Timestamps that are equivalent to the initial datetime.datetime (with tzinfo). This has always worked and my change does nothing to this existing behavior.

However, when the bug in #7139 was triggered, what happens is that the tslib.datetime_to_datetime64(arg) call fails (because it can't extract a uniform timezone offset for all the values), so my change to the code at this point just returns the column back in its initial form, which is a list of datetime.datetime values that were coming from the postgres database.

In the sake of uniformity, it probably makes sense for to_datetime to convert these datetime.datetime values into Timestamp values in this case, because it certainly is able to do so. I didn't do this in my change for the sake of simplicity, but I easily could do it.


Now the second part. I thought about somehow putting the UTC conversion information into parse_dates, but I think having a separate field is better.

For example, in some cases you want to specify the format of the datetime field you are parsing, and the only way to do that is via parse_dates, IE `parse_dates={'custom-dt': '%Y-%m-%dT%H:%M:%S.%f %Z'}, and then you might additionally want the choice of whether you wish to preserve the time zone information or instead convert to UTC.

Granted, the above functionality is not currently possible because to_datetime with a format does not currently do anything with timezone information (i think), but that is something that is certainly possible to add in the future, so I think this is the most flexible interface.


Frankly, the whole thing is probably going to be confusing to end users when different datetime fields are going to become different things and nobody will know why. My change is designed to fix the issue without changing a single other current behavior of pandas, but do nothing to fix the confusion.

My personal opinion is that if pandas is given a column it can convert into a datetime64[ns] it should do that. If that means converting to UTC, then so be it. For the people that don't want that, I think you could have a way to disable that behavior (like having convert_dates_to_utc=True be the default, and letting people opt out of it if they want). This will result in the most uniformity, because pandas is really good at converting datetime-like things into datetime64s if you let it. However, this would be a BC-break, and I have no idea how popular the use cases are out there that rely on timezones staying intact.


Cliff notes:

  1. I can make to_datetime also return pd.Timestamp values in the special bug case from the original issue. I didn't do it for simplicity, but I could, but (i think) it shouldn't make that much of a difference since datetime.datetime and Timestamp are both datetime.datetimess.

  2. I think having a separate field outside of parse_dates for determining whether or not to convert stuff to UTC is better.

  3. If we are willing to make idealogical changes and BC breaks, I think converting everything to UTC by default and letting people opt out of it will result in less end user confusion and is better than the current situation (basically just flipping convert_dates_to_utc=False to convert_dates_to_utc=True in this change).

@danbirken
Copy link
Contributor Author

(Just rebased on master so travis will re-run and fix that spurious failure - didn't make any changes)

@danbirken
Copy link
Contributor Author

Well actually I did have to make a minor update to test_timeseries.py because some weird timezone offset issue was causing it to fail on the python 3.4 version, but all is well now.

@danbirken
Copy link
Contributor Author

Just rebased on master, all tests still passing. Though there is a big wall of text up there, basically taking this change will fix the issue, cause no new problems and would be perfectly compatible with future improvements or changes.

I'm in no hurry to get it committed, but I think it is a pretty safe change as far as changes go.

@jreback jreback added the SQL label Jul 7, 2014
@jreback jreback added this to the 0.15.0 milestone Jul 7, 2014
@jreback
Copy link
Contributor

jreback commented Oct 2, 2014

@jorisvandenbossche ? prob push this

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.15.1, 0.15.0 Oct 2, 2014
@jorisvandenbossche
Copy link
Member

@danbirken Sorry for totally dropping the ball here. I try to look at it and test it with my case as soon as possible.

@danbirken danbirken force-pushed the psql-dt-with-time-zone-fix branch from db17ed6 to 6c98596 Compare October 3, 2014 00:40
@danbirken
Copy link
Contributor Author

Alright well here is an updated version rebased on master (and updating the doc notes to milestone 0.15.0).

I haven't done a detailed look to see if anything has changed in the past 3 months that would make this change no longer good, but I did a cursory glance and everything still seems to apply and be relevant and correct.

@jorisvandenbossche
Copy link
Member

@danbirken OK, I got to look at this. Given you explanation, I am currently thinking:

  • just preventing that an error occurs seems simple: the return would then be an object column of Timestamps or datetime.datetimes (as with read_sql_query).
  • on adding the convert_dates_to_utc I am not fully sure, I think we should also find another name (maybe just utc as in to_datetime

But to conlude: by default, for now, I think we should do what to_datetime does when it gets such data, which is returning an object series with Timestamps.
This would be a rather simple fix, and maybe we can already put that in quickly (for 0.15, so at least the error is fixed)? And think a bit more about if we want to add a keyword to do the conversion to utx automatically, or that we just say the user should do tz_convert('UTC') themselves.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 2, 2015
@phaefele
Copy link

phaefele commented Mar 3, 2015

This was fixed back in October and is a real issue. Any chance it can get merged so we can get it in 0.15.3? Many thanks for the fixed!

@jreback
Copy link
Contributor

jreback commented Mar 4, 2015

@phaefele if you look at the last comment by @jorisvandenbossche there is still an open issue on this.
You are welcome to address.

@danbirken
Copy link
Contributor Author

The original thing still broken. I just trimmed down the change to what I hope is the minimum viable change that fixes the original bug and the other thing that @jorisvandenbossche brought up without adding anything to the read_sql_table* interfaces.

Will be posted by end of tonight, just waiting for tests to finish.

@danbirken danbirken force-pushed the psql-dt-with-time-zone-fix branch from 6c98596 to efbc460 Compare March 4, 2015 01:31
@danbirken
Copy link
Contributor Author

There you go.

Differences from previous change:

  • Got rid of all of the convert_dates_to_utc for the SQL functions
  • Made to_datetime turn the datetime.datetimes into Timestamps if possible

So this still fixes the initial bug, doesn't clutter up all of the read_sql_* functions, and allows this pretty convenient syntax:

# Assume we've imported a df from a sql table, and the
# 'dt' column was a `datetime with time zone` column
In [1]: df
Out[1]:
    id                         dt
0   39  2011-01-01 00:00:00-08:00
1   40  2011-01-01 00:00:00-08:00
2   41  2011-01-01 00:00:00-08:00

In [2]: df['dt64'] = df['dt'].apply(lambda x: x.to_datetime64())

In [3]: df
Out[3]:
    id                         dt                dt64
0   39  2011-01-01 00:00:00-08:00 2011-01-01 08:00:00
1   40  2011-01-01 00:00:00-08:00 2011-01-01 08:00:00
2   41  2011-01-01 00:00:00-08:00 2011-01-01 08:00:00

Which essentially does what convert_dates_to_utc did. The only potential remote backwards compatible issue that I can think of is this will convert anything to a Timestamp that can possibly be converted (like a datetime.date), but I think that is better anyways for uniformity sake and there is no data loss.

@jreback
Copy link
Contributor

jreback commented Mar 4, 2015

you don't want to use .apply on a datetimelike, rather use the vectorize solns (also to_datetime64()) converts to a numpy datetime64 which doesn't handle time zones correctly (e.g. it sets them to your local time zone which is weird)

In [1]: df = DataFrame({'date' : pd.date_range('20130101',periods=5,tz='CET')})

In [2]: df
Out[2]: 
                        date
0  2013-01-01 00:00:00+01:00
1  2013-01-02 00:00:00+01:00
2  2013-01-03 00:00:00+01:00
3  2013-01-04 00:00:00+01:00
4  2013-01-05 00:00:00+01:00

In [3]: df['date'].dt.tz_localize(None)
Out[3]: 
0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
4   2013-01-05
dtype: datetime64[ns]

In [4]: df['date'].dt.tz_convert('EST')
Out[4]: 
0    2012-12-31 18:00:00-05:00
1    2013-01-01 18:00:00-05:00
2    2013-01-02 18:00:00-05:00
3    2013-01-03 18:00:00-05:00
4    2013-01-04 18:00:00-05:00
dtype: object

In [5]: df['date'].dt.tz_convert('EST').dt.tz_localize(None)
Out[5]: 
0   2012-12-31 18:00:00
1   2013-01-01 18:00:00
2   2013-01-02 18:00:00
3   2013-01-03 18:00:00
4   2013-01-04 18:00:00
dtype: datetime64[ns]

@danbirken
Copy link
Contributor Author

It doesn't work because this doesn't return a DatetimeIndex, it returns just an np.array() full of Timestamps. In this special case you can't return a DatetimeIndex because that only works if the timezone is uniform across all of them, which it isn't. This is why this change is tricky. If your database column happens to have all the timestamps with the same time zone, you get a DatetimeIndex and life is good and everything works. But if you happen to have different timezones, then DatetimeIndex is out, so you have a bunch of weird options.

So after thinking about it more and based on what you said and that nice Datetime accessor interface, I think the best solution is just to convert everything to UTC and make the column a datetime64 column. This is a super simple change and is much simpler than even the current one. It would operate just like the json serializer (and probably the other ones too, I didn't check):

In [19]: df = pd.DataFrame({'date' : pd.date_range('20130309',periods=4,tz='US/Eastern')})

In [20]: df
Out[20]:
                        date
0  2013-03-09 00:00:00-05:00
1  2013-03-10 00:00:00-05:00
2  2013-03-11 00:00:00-04:00
3  2013-03-12 00:00:00-04:00

In [21]: df['date']
Out[21]:
0    2013-03-09 00:00:00-05:00
1    2013-03-10 00:00:00-05:00
2    2013-03-11 00:00:00-04:00
3    2013-03-12 00:00:00-04:00
Name: date, dtype: object

In [22]: df2 = pd.read_json(df.to_json())

In [23]: df2
Out[23]:
                 date
0 2013-03-09 05:00:00
1 2013-03-10 05:00:00
2 2013-03-11 04:00:00
3 2013-03-12 04:00:00

In [24]: df2['date']
Out[24]:
0   2013-03-09 05:00:00
1   2013-03-10 05:00:00
2   2013-03-11 04:00:00
3   2013-03-12 04:00:00
Name: date, dtype: datetime64[ns]

So it takes your nice timezoned datetime, converts to UTC, and then you are dealing with UTC from now on. This is sort of enforcing a best practice, but it is consistent and clearly somebody already decided that it was the right approach for json. So I agree with that person and think this is the right approach for sql serialization/deserialization. It probably will break backwards compatibility for some rare people, but given this bug prevents a lot of people from using this feature, it probably isn't that big a deal.

or...

The current changelist I think is the "minimum viable change" which fixes the bug, disturbs the least amount of other stuff and has the sanest fallback.

@mangecoeur
Copy link
Contributor

Just ran into this issue, would like to see it merged too ;)

@mangecoeur
Copy link
Contributor

P.s for people looking for a partial workaround (since it too me hours to figure this out)

  • you can set the default client connection Timezone to UTC in postgresql.conf (location of this file varies depending on OS). This can help certain cases:

I had TZ-aware datetimes in the DB with TZ=UTC, but they were read with Postgres default client TZ offset. Since this adjusted for local DST the resulting objects had multiple TZ offsets, triggering the bug.

@danbirken danbirken force-pushed the psql-dt-with-time-zone-fix branch from efbc460 to 2e1233b Compare March 24, 2015 02:09
@danbirken
Copy link
Contributor Author

Alright I've scrapped my previous work and coded what I argued in my last post and what I believe is the correct change. It also is the simplest change. I believe the correct behavior here is to convert timestamp with time zone columns into UTC, at which point all pandas functionality works as expected.

If pandas had better native support for a Series containing timestamps with different time zones it would be a different story, but pandas does not currently. Re-writing DatetimeIndex to support this would be an unknown amount of work, it would make it less efficient, and probably is a feature that very few people care about. It also would be a complicated change that touches a lot of stuff. This is a simple change which touches very few things.

This is a BC break, but I don't think it will have a huge impact because a) it only affects people importing tables from SQL, not any core pandas functionality and b) this feature is currently broken so it probably isn't in wide usage. Additionally, if this did break your code the fix is very simple:

In [1]: df = pd.read_sql_table('testing', engine)

In [2]: df
Out[2]:
   id                  dt
0   1 2011-01-01 08:00:00
1   2 2011-06-01 07:00:00

# Oh no, my timezones!  Oh wait, I can easily get them back.  And remember, if my
# datetimes had different timezones then this feature would have never worked in the first
# place so there is no BC break.

In [3]: df['dt'] = df['dt'].dt.tz_localize('UTC').dt.tz_convert('US/Pacific')

In [4]: df
Out[4]:
   id                         dt
0   1  2011-01-01 00:00:00-08:00
1   2  2011-06-01 00:00:00-07:00

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.16.1, Next Major Release Mar 24, 2015
@jorisvandenbossche
Copy link
Member

@danbirken Thanks a lot!

So to summarize:

  • Previously, you got:
    • with a column of TIMESTAMP WITH TIMEZONE type, it resulted in a timestamp with timezone information like tz=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None).
    • but when having different offsets: error (the issue being fixed here)
  • Behaviour after this fix:
    • In both cases, the values are converted to utc, resulting in a column of datetime64 values without timezone information.

That the psycopg2.tz.FixedOffsetTimezone is gone is indeed a backwards incompatible change, but maybe not that bad? (as this is not a very handy timezone information anyway). The only question I have is if we maybe should explicitely say it is UTC (so store it as Timestamps with utc timezone).

I was also thinking if it wouldn't be an option to keep it as is (object dtype with Timestamps with fixed offset), and fix it so that also the error now (different offsets) does return that. But, this would it make it more complex to fix it I think? And, if you want to work with such data (with a FixedOffset), you will almost always first have to convert it UTC before you can do something with it.

So to conclude my thoughts, I think this is fine. Maybe just add a note about this in the docs?

@mangecoeur
Copy link
Contributor

Just to clarify: is the fundamental issue that neither Series, DateTimeIndexes, nor np.datetime64 can correctly handle arrays of datetimes with multiple time zones?

I think it would make sense in that case that if you pass parse_dates=False then no SQL TIMESAMP to datetime conversion should be performed so you just get an array of python datetime objects as returned by the DB driver.

This way you always have a "high fidelity" option that simply avoids this incompatibility. At the moment (at least the last time I checked) passing parse_dates=False means you don't attempt to parse any datetime-like columns (such as text or integer time stamps) but SQL TIMESAMP are still passed through pd.to_datetime which means you end up dealing with these issues...

@jorisvandenbossche
Copy link
Member

So what you get back from the sql query are a list of values like dt.datetime(2012, 1, 1, 0, 0, tzinfo=pytz.FixedOffset(-480)). But you are correct:

  • datetime64 dtype cannot hold timezone information (although it does display it, but that is just converted to sytem timezone on display)
  • so a Series with datetime64 dtype cannot hold timezone info as well
  • a Series can have object dtype to hold Timestamp or datetime.datetime values with a timezone
  • a DatetimeIndex has to have one timezone (and the different offsets are often actually one timezone if expressed with a name as 'CET', but expressed as FixedOffsets it are different timezones)

So if we want to preserve the exact information of the database, the options are to store it as object columns with datetime.datetime or Timestamp values. A series with datetime.datetime values is actually what you get back from read_sql_query when having datetimes with timezones (one timezone or multiple, that does not matter in this case).

For now, parse_dates is only used to specify to parse a specific column to datetimes (which is typically used for parsing string columns), and it is a bit independent of harmonizing the column types of the resulting dataframe with the sql types in the database (the _harmonize_columns method).
But I do like the suggestion of using parse_dates=False for this (this option was not used up to now, it was just converted in []).

@danbirken
Copy link
Contributor Author

As far as my knowledge goes, I agree with everything in the most recent comment.

I was also thinking if it wouldn't be an option to keep it as is (object dtype with Timestamps with fixed offset), and fix it so that also the error now (different offsets) does return that. But, this would it make it more complex to fix it I think? And, if you want to work with such data (with a FixedOffset), you will almost always first have to convert it UTC before you can do something with it.

My first change like 9 months ago did this (though with UTC conversion not the default). I think this is wrong because pandas is right now just not built to support a Series of Timestamps which have different time zones. I think people should be forced, willingly or not, into UTC because that is what pandas supports. The deeper I go into the rabbit hole the more incompatibilities I find.

For example: https://github.com/pydata/pandas/blob/master/pandas/tseries/index.py#L310

If you access a Datetime-like Series with different time zones via the DatetimeProperties accessor, pandas will just convert it to UTC anyways without warning.

In [6]: df = pandas.DataFrame()

In [7]: df['dt'] = [datetime.datetime.now(tz=pytz.UTC), datetime.datetime.now(tz=pytz.timezone('US/Eastern'))]

In [8]: df['dt']
Out[8]:
0    2015-03-24 19:00:55.496832+00:00
1    2015-03-24 15:00:55.496860-04:00
Name: dt, dtype: object

In [9]: df['dt'].dt.hour
Out[9]:
0    19
1    19
dtype: int64

In [10]: df['dt'].dt.to_pydatetime()
Out[10]:
array([datetime.datetime(2015, 3, 24, 19, 0, 55, 496832),
       datetime.datetime(2015, 3, 24, 19, 0, 55, 496860)], dtype=object)

That the psycopg2.tz.FixedOffsetTimezone is gone is indeed a backwards incompatible change, but maybe not that bad? (as this is not a very handy timezone information anyway). The only question I have is if we maybe should explicitely say it is UTC (so store it as Timestamps with utc timezone).

The problem is that the default situation stores them as an np.datetime64 series which is a more efficient container for datetime values. Once you add timezone information they instead are stored as an object array full of Timestamp values with timezone information. Since in practice these two different Series will generally behave the same, it makes sense for them to be stored by default in the more efficient container.

And of course, if anybody wants the timezone information added it is very easy to do that via .dt.tz_localize('UTC')

@danbirken danbirken force-pushed the psql-dt-with-time-zone-fix branch from 2e1233b to 8b4eb76 Compare March 24, 2015 19:16
@danbirken
Copy link
Contributor Author

I did just make one minor edit where I added a "Notes" section to the read_sql_* functions warning people that datetime values with time zones would be converted to UTC.

@jorisvandenbossche
Copy link
Member

@danbirken fully agree.
I think having an option to just preserve the raw output of the database (datetime.datetimes with fixedoffset) would be nice (@danbirken what do you think of parse_dates=False for this? but it can go in another PR). But indeed having it as object Timestamps by default is not that usefull.

One comment: I think the note you added is not fully correct for read_sql_query?



`read_sql_table()` will break if it reads a table with a `timestamp
with time zone` column if individual rows within that column have
different time zones. This is very common due to daylight savings time.

Pandas right now does not have good support for a Series containing
datetimes with different time zones (hence this bug).  So this change
simply converts a `timestamp with time zone` column into UTC during
import, which pandas has great support for.
@danbirken danbirken force-pushed the psql-dt-with-time-zone-fix branch from 8b4eb76 to 5cc35f4 Compare March 26, 2015 03:00
@danbirken
Copy link
Contributor Author

Good point, I updated the documentation for both read_sql_query and read_sql_table.

I don't think this change should do anything about parse_dates, because that is already kind of a disaster and I don't want to make it more complicated for the future. I think for this particular case of not parsing datetime columns, the best way to do it would be a new parameter like automatically_parse_datetime_cols=True, and then you can set that to false if you want. But, I don't think this works either, because pandas itself will automatically convert regular datetimes to datetime64s if it can, regardless of the source of the data:

In [7]: df = pd.DataFrame({'dt': [datetime.datetime.now(), datetime.datetime.now()]})

In [8]: df['dt']
Out[8]:
0   2015-03-25 20:32:18.378324
1   2015-03-25 20:32:18.378333
Name: dt, dtype: datetime64[ns]

So I think it will just lead to more confusion.

My opinion is when a great hero comes along and creates a datetime w/tz storage system for a pandas Series, or decides time zones aren't supported in pandas, that will fix the permanent issue of the parse_dates conundrum. For now I would rather leave it to them.

@jorisvandenbossche
Copy link
Member

There was in the meantime a conflict in the whatsnew file, so I merged as c3eeb57 (so you didn't have to rebase again ..)

@danbirken Thank you very much for fixing this, and for your endurance, as this look a bit longer than necessary. But in the end, a good fix was put in!

@danbirken
Copy link
Contributor Author

👍

@phaefele
Copy link

Yes - thank you for addressing that!

On Thu, Apr 16, 2015 at 5:55 PM, Dan Birken [email protected]
wrote:

[image: 👍]


Reply to this email directly or view it on GitHub
#7364 (comment).

Paul Haefele

[email protected]
+1-416-902-6326
ca.linkedin.com/in/phaefele/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: error when reading postgres table with timezone
5 participants