-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Localize Series when calling to_datetime with utc=True (#6415) #17109
Conversation
@@ -2130,7 +2130,7 @@ def test_set_index_datetime(self): | |||
'2011-07-19 08:00:00', '2011-07-19 09:00:00'], | |||
'value': range(6)}) | |||
df.index = pd.to_datetime(df.pop('datetime'), utc=True) | |||
df.index = df.index.tz_localize('UTC').tz_convert('US/Pacific') | |||
df.index = df.index.tz_convert('US/Pacific') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for future reference, can you explain here why you needed to change this test?
pandas/core/tools/datetimes.py
Outdated
@@ -508,6 +508,10 @@ def _convert_listlike(arg, box, format, name=None, tz=tz): | |||
from pandas import Series | |||
values = _convert_listlike(arg._values, False, format) | |||
result = Series(values, index=arg.index, name=arg.name) | |||
if utc and isinstance(values, np.ndarray): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be inside convert_list_like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how to include this logic in convert_listlike as is.
._values
is called on Series
before it's passed into convert_listlike which turns the Series
into a np.array
, so convert_listlike doesn't "know" that a Series
was the original argument.
Not sure why this was done originally, but I could have convert_listlike handle a Series
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure how easy that is, but you could.
@@ -270,6 +270,32 @@ def test_to_datetime_utc_is_true(self): | |||
expected = pd.DatetimeIndex(data=date_range) | |||
tm.assert_index_equal(result, expected) | |||
|
|||
# GH 6415: UTC=True with Series | |||
data = ['20100102 121314', '20100102 121315'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make a new test
data = ['20100102 121314', '20100102 121315'] | ||
result = pd.to_datetime(pd.Series(data), | ||
format='%Y%m%d %H%M%S', | ||
utc=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should work if box is True iow an Index or Series is passed
doc/source/whatsnew/v0.21.0.txt
Outdated
@@ -260,7 +260,7 @@ Conversion | |||
|
|||
- Bug in assignment against datetime-like data with ``int`` may incorrectly converte to datetime-like (:issue:`14145`) | |||
- Bug in assignment against ``int64`` data with ``np.ndarray`` with ``float64`` dtype may keep ``int64`` dtype (:issue:`14001`) | |||
|
|||
- Bug in :func:`to_datetime` incorrectly localizing dates in `Series` when `utc=True` (:issue:`6415`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a sub section as it's a major change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What section in the whatsnew should I put this under? Enhancement? API breaking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this an API breaking change. Enhancement implies it adds functionality that's beneficial and isn't patching existing behavior per se.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a major change that is API breaking.
12b9c58
to
e85263d
Compare
doc/source/whatsnew/v0.21.0.txt
Outdated
UTC Localization with Series | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Previously, :func:`to_datetime` did not localize datetime ``Series`` data as when ``utc=True`` was passed. Now, :func:`to_datetime` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just say what its going to do, this is redundant otherwise.
doc/source/whatsnew/v0.21.0.txt
Outdated
Previously, :func:`to_datetime` did not localize datetime ``Series`` data as when ``utc=True`` was passed. Now, :func:`to_datetime` | ||
will correctly localize `Series` with a `datetime64[ns, UTC]` data type. (:issue:`6415`) | ||
|
||
Old Behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
previous behaviour
pandas/core/tools/datetimes.py
Outdated
@@ -359,7 +359,9 @@ def _convert_listlike(arg, box, format, name=None, tz=tz): | |||
return DatetimeIndex(arg, tz=tz, name=name) | |||
except ValueError: | |||
pass | |||
|
|||
from pandas import Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should also work for a DTI.
use ABCSeries rather than importing.
I am not sure this is the best location for this.
pandas/core/tools/datetimes.py
Outdated
@@ -379,11 +381,12 @@ def _convert_listlike(arg, box, format, name=None, tz=tz): | |||
raise TypeError('arg must be a string, datetime, list, tuple, ' | |||
'1-d array, or Series') | |||
|
|||
arg = _ensure_object(arg) | |||
obj_arg = _ensure_object(arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are you changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I was trying to have _convert_listlike
handle Series
arguments directly, I was using the arg
variable to keep track if a Series
was initially passed so I know to use dt.tz_localize()
after the dates have been parsed. arg = _ensure_object(arg)
converts arg
from a Series
to a np.array
so I no longer know what data structure was originally passed.
I guess I could create a boolean variable arg_is_series
in the beginning and keep track of this information that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you revert the obj_arg
thing, hard to tell what's changing here
pandas/core/tools/datetimes.py
Outdated
if is_datetime64_dtype(result) and box: | ||
result = DatetimeIndex(result, tz=tz, name=name) | ||
# GH 6415 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too specific change here I think (see my comment above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to make a wrapper function which handles the utc argument (IOW if its a DTI / Series and utc is true it will localize, otherwise it will pass thru).
@@ -270,6 +270,38 @@ def test_to_datetime_utc_is_true(self): | |||
expected = pd.DatetimeIndex(data=date_range) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you didn't have to change any tests here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned previously to move these tests to their own function. Is that what you were referring to?
doc/source/whatsnew/v0.21.0.txt
Outdated
UTC Localization with Series | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
:func:`to_datetime` now correctly localizes datetime ``Series`` data as when ``utc=True`` was passed with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove correctly. localizes to UTC when utc=True
is passed
pandas/core/tools/datetimes.py
Outdated
if isinstance(arg, ABCSeries): | ||
arg = arg.dt.tz_localize('UTC') | ||
elif isinstance(arg, DatetimeIndex): | ||
arg = arg.tz_convert(None).tz_localize('UTC') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, why are you converting to naive for a DTI?
pandas/core/tools/datetimes.py
Outdated
@@ -379,11 +381,12 @@ def _convert_listlike(arg, box, format, name=None, tz=tz): | |||
raise TypeError('arg must be a string, datetime, list, tuple, ' | |||
'1-d array, or Series') | |||
|
|||
arg = _ensure_object(arg) | |||
obj_arg = _ensure_object(arg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you revert the obj_arg
thing, hard to tell what's changing here
I noticed that some IO SQL tests have been failing. It appears that in the conversion from SQL to DataFrame the function I feel we shouldn't assume these database dates are utc by default i.e. remove hardcoded |
@mroeschke yeah, that sql issue might be a tricky issue, as I recall we had a long discussion about that. The But, that also dates back from the time that a Series could not hold timezone aware data. So now, I think it would be OK (or even better) to have it as timezone aware but still UTC data (the reason that it has to be utc, if I recall, is that the database uses fixed offsets, which can lead to different timezones around DST, which results in errors when converting that to a datetime64 column. Therefore we decided to always convert timezone aware data to UTC) So given that, it might be that we just need to update the tests? |
actually the db issue might be more complex and just changing tests is not the correct thing to do. See what is actually returned currently in the database. IIRC postgres stores a datetime-aware tz, but its actually in the local timezone, IOW it loses the offset info (it does convert to UTC from wherever it was, it just stores it locally). SO you need to reverse this on de-serialization. |
@jreback looking at the issues I linked to (but it is a long time ago I actually looked at it ..), the data we get from postgres are datetime objects with |
@jorisvandenbossche postgres loses information in the timezone. They are UTC dates, but in the local timezone. you need to match the existing behavior. |
Sorry, what do you mean? Do you mean existing behaviour of |
I agree with @jorisvandenbossche For reference, postgres
But more importantly, the psycopg2 driver returns datetime objects with |
Hello @mroeschke! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on September 01, 2017 at 07:00 Hours UTC |
Codecov Report
@@ Coverage Diff @@
## master #17109 +/- ##
==========================================
- Coverage 91.03% 91.02% -0.02%
==========================================
Files 163 163
Lines 49580 49581 +1
==========================================
- Hits 45137 45130 -7
- Misses 4443 4451 +8
Continue to review full report at Codecov.
|
doc/source/whatsnew/v0.21.0.txt
Outdated
|
||
pd.to_datetime(s, utc=True) | ||
|
||
This new behavior will also localize datetime columns in DataFrames returned from :func:`read_sql` which previously returned datetime columns as naive UTC. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is would fighten me. you have to be expicit that this is only for columns that are already datetime tz-aware.
.. _whatsnew_0210.api.utc_localization_with_series: | ||
|
||
UTC Localization with Series | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue number and provide a short expl.
pandas/core/tools/datetimes.py
Outdated
if arg.tz is None: | ||
arg = arg.tz_localize('UTC') | ||
else: | ||
arg = arg.tz_convert('UTC') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a BIG change here. we don't convert existing tz-aware timezones. why are you doing this?
pandas/core/tools/datetimes.py
Outdated
if is_datetime64_dtype(result) and box: | ||
result = DatetimeIndex(result, tz=tz, name=name) | ||
# GH 6415 | ||
elif arg_is_series and utc: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is kludgy. pls find a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I am doing this is because a Series is converted to a numpy array for the date conversion, and _convert_listlike
needs an indicator to return back a Series.
The alternative is to have _convert_listlike
return an array and have to_datetime
do the conversion and localization. Which option is preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, name this is_series
and do this outside of the convert_listlike function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this if here, instead do it inside the maybe_convert_to_utc function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (didn't look in detail) that this kludgyness is coming from the fact that for a Series _convert_listlike
now converts an array (box
is to False when dealing with a Series), while for DatetimeIndex is returns a DatetimeIndex. And, an array cannot hold tz aware data, so the utc
keyword needs to be handles separately for Series and DatetimeIndex, when using this current implementation.
But, can't we just remove the box=False
for converting Series objects? And so unify the code for returning Series vs DatetimeIndex (conversion from DatetimeIndex to Series (instead of from array to Series) should be rather performant?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche right idea but not directly related to box
.
_convert_listlike
can essentially be summarized in two operations:
- If the data already has a datetime dtype, take a short cut.
- Otherwise, make sure the data is of object dtype and try to parse.
In step 2, a Series is passed into _ensure_object
which turns a Series into a numpy array with object dtype. Therefore, the kludge is due to attempting to have _convert_listlike
"remember" that a Series was passed.
But sure, I can try to move this to the _maybe_convert_to_utc
function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but for for step 2, why do you then need to have _convert_listlike
handle Series vs arrays/lists of strings differently?
(As far as I can see, this is because for a Series the values are not converted to a DatetimeIndex at the end of _convert_listlike
, and so the handling of utc has to be done afterwards again, but I can be wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first fix for this issue was to have _convert_listlike
just return an array and have to_datetime
feed that array into a Series constructor and then localize to utc. To avoid leaking the conversion logic out of _convert_listlike
, I am trying to have _convert_listlike
return a localized Series outright.
So yes, for a Series the values are not converted to a DatetimeIndex (which can handle the localization with the tz
arg) at the end of _convert_listlike
, and when passing the values to a Series constructor, an extra step is need to localize the Series to utc i.e. Series(converted_array).dt.tz_localize('UTC')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my suggestion is: can't we just also for Series return a DatetimeIndex (which already handles the utc keyword fine), and then afterwards just convert this to a Series as we now convert the array to a Series (then there is no need to either leak the conversion out of _convert_listlike
or to deal with Series construction inside _convert_listlike
).
(but it might be that there are other reasons for having it as an array that might break, I don't know)
utc=True) | ||
expected = pd.Series(expected_data) | ||
tm.assert_series_equal(result, expected) | ||
result = pd.to_datetime(pd.Index(data), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use blank lines in between different sub-tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of duplicating lots of code, break this big tests into separate tests and use parametrize to deal with Series/Index converions
pandas/tests/io/test_sql.py
Outdated
@@ -1319,14 +1323,15 @@ def check(col): | |||
def test_date_parsing(self): | |||
# No Parsing | |||
df = sql.read_sql_table("types_test_data", self.conn) | |||
|
|||
# Now that GH 6415 is fixed, dates are automatically parsed to UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write these reversed, IOW
dates are automatically parsed to UTC (GH 6415). A future reader likely won't look at the reference unless the comment is not good enough., the comment needs to stand on its own.
pandas/tests/io/test_sql.py
Outdated
@@ -1352,7 +1357,11 @@ def test_datetime(self): | |||
# with read_table -> type information from schema used | |||
result = sql.read_sql_table('test_datetime', self.conn) | |||
result = result.drop('index', axis=1) | |||
tm.assert_frame_equal(result, df) | |||
# After GH 6415, dates outbound from a db will be localized to UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if d.Af is localized to UTC (IOW before to_sql). does it work?
pandas/tests/io/test_sql.py
Outdated
# comes back as datetime64 | ||
tm.assert_series_equal(res['a'], to_datetime(df['a'])) | ||
# GH 6415 comes back as datetime64[ns, UTC] | ||
tm.assert_series_equal(res['a'], to_datetime(df['a'], utc=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use result, expected here, otherwise too hard to read
@mroeschke Based on the diff in the sql tests, it seems that you now have |
@jorisvandenbossche Ah yes, you're right. I think I was too aggressive in changing the SQL tests. I will have to dig deeper into SQL routines to see if I can determine the original SQL type and pass that info along to the |
Addressed your comments @jorisvandenbossche and all green. |
pandas/io/sql.py
Outdated
@@ -821,7 +821,12 @@ def _harmonize_columns(self, parse_dates=None): | |||
|
|||
if (col_type is datetime or col_type is date or | |||
col_type is DatetimeTZDtype): | |||
self.frame[col_name] = _handle_date_column(df_col) | |||
if col_type is DatetimeTZDtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just
utc = col_type is DatetimeTZDtype
self.frame[.....] = _handle_date_column(df_col, utc=utc)
doc/source/whatsnew/v0.21.0.txt
Outdated
UTC Localization with Series | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Previously, :func:`to_datetime` did not localize datetime ``Series`` data when ``utc=True`` was passed. Now, :func:`to_datetime` will correctly localize `Series` with a `datetime64[ns, UTC]` data type to be consistent with how list-like and Index data are handled. (:issue:`6415`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use double backticks around Index
and Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also around the datetime64[ns, UTC]
; otherwise these get lost in the text
say dtype instead of 'data type'
pandas/tests/io/test_sql.py
Outdated
.astype('datetime64[ns, UTC]')) | ||
col = expected.DateColWithTz | ||
assert is_datetime64tz_dtype(col.dtype) | ||
# Removed ".astype('datetime64[ns, UTC]')"after GH 6415 was fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this comment; you can explain why things are, but refering to the past is confusing
pandas/tests/test_multilevel.py
Outdated
@@ -2130,7 +2130,8 @@ def test_set_index_datetime(self): | |||
'2011-07-19 08:00:00', '2011-07-19 09:00:00'], | |||
'value': range(6)}) | |||
df.index = pd.to_datetime(df.pop('datetime'), utc=True) | |||
df.index = df.index.tz_localize('UTC').tz_convert('US/Pacific') | |||
# Removed 'tz_localize('utc') below after GH 6415 was fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
[('2013-01-01 00:00:00-01:00', None), | ||
('2013-01-01 00:00:00-01:00', 'datetime64[ns]'), | ||
('2013-01-01 01:00:00', None), | ||
('2013-01-01 01:00:00', 'datetime64[ns]')]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can write this a bit shorter by doing:
@pytest.mark.parametrize("data", ['2013-01-01 00:00:00-01:00', '2013-01-01 00:00:00-01:00'])
@pytest.mark.parametrize("dtype", [None, 'datetime64[ns]'])
def test...
then you get all combinations of the two automatically.
But I still don't fully understand the point of the test (or for the '2013-01-01 01:00:00' case. Isn't that covered by the test above?)
What is not tested above is a string that included tz offset, or a series with an already datetime64 dtype (both aware or naive) So I would just test those two cases? (and both also seem distinct, so would put them in different tests)
Modify test case Comment about test edit, move conversion logic to convert_listlike Add new section in whatsnew and update test Alter SQL tests Modify whatsnew and make new wrapper function to handle UTC conversion Simiplified whatsnew and reverted arg renaming
Fixed the whatsnew and extraneous comments and created additional tests for tz-aware strings and Series with I also stumbled upon this inconsistency when I was trying to add an additional test for
I think these should be consistent and |
I think all outstanding comments were addressed now, so time to finally get this merged :-)
I think Out[4] is certainly correct, for the |
git diff upstream/master -u -- "*.py" | flake8 --diff
When
_convert_listlike
passes anp.array
into aSeries
, theSeries
seems to initially be of naive datetime dtype. LocalizedSeries
ifutc=True
.I made one minor change to an existing test that
tz_localize
and index when it originally set the index with aSeries
withpd.to_datetime(...utc=True)
.