-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: sql different offsets (1/2 modify infer_dtype) #30492
BUG: sql different offsets (1/2 modify infer_dtype) #30492
Conversation
detect columns containing only datetimes/NULL recognized as object with function _is_datetime_column_with_dtype_object and coerce them to UTC for read_sql and to_sql
If the user gives a series of datetimes with mixed timezones, presumably they did it on purpose. So we should try to preserve that information rather than convert to UTC. Or am I missing something here? |
I don't think that we are really losing information here (see sections to_sql and read_sql below and please correct me if I'm wrong). to_sqlMixed offsets and actually mixed timezones will prevent to_sql from working because the function io.sql._sqlalchemy_type uses the .dt acessor and this raises a ValueError with mixed timezones. Since the offsets are not saved in the database anyways (for postgres and SQLite3 AFAIK) converting to UTC should solve the issue without changes for the user. read_sqlIf all datetimes are in the same timezone and offset then we get a Series with dtype datetime64 and no offset. Exampleimport datetime
import psycopg2
import sqlalchemy
import pandas as pd
import pytz
from pandas.io.sql import read_sql, to_sql
# CREATE ENGINE
# engine = sqlalchemy.create_engine("CONNECTION_STRING")
# create datetimes with timezone that have different offsets
data = [[datetime.datetime(2019, 11, 14, 15, 12, 0, tzinfo = datetime.timezone.utc),
datetime.datetime(2019, 11, 14, 15, 12, 0, tzinfo = datetime.timezone.utc)],
[datetime.datetime(2019, 11, 7, 13, 37, 4, tzinfo = datetime.timezone.utc),
datetime.datetime(2019, 8, 14, 15, 12, 0, tzinfo = datetime.timezone.utc)]]
df = pd.DataFrame(data, columns = ['ts_tz','ts_tz2'])
# save data - no exceptions occur
to_sql(frame = df,
name = 'timezone_diff_offsets_test',
con = engine,
schema = 'public',
if_exists = 'replace',
index = False)
# read data - offsets are dropped and dtype of column ts_tz is datetime64[ns, UTC]
df = read_sql('SELECT * FROM timezone_diff_offsets_test', engine)
print(df,'\n')
print(df.dtypes) Output of master branch
Output of my branch (bugfix_sql_different_offsets)
EDIT: typos |
Can you add test cases? It's a little unclear what this is doing without those. With respect to writing mixed timezones, I'm not sure about this. I actually think the ValueError getting raised makes more sense |
@jbrockmendel we have a president of converting timezones (which should include mixed) to UTC when reading from DBs. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#datetime-data-types When writing mixed timezones, I agree with @WillAyd in that raising an error is preferable to coercing to UTC if we can't maintain the mixed timezones. |
pandas/io/sql.py
Outdated
# handle datetime Series with different offsets | ||
# see GH30207 | ||
if _is_datetime_series_with_different_offsets(df_col): | ||
data_frame[col_name] = to_datetime(df_col, utc=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just pass utc=True
into data_frame[col_name] = _handle_date_column(df_col, format=fmt)
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are right it should be utc=True. I suppose this was already infered by pandas but we should always enforce it because as you mentionned:
When reading TIMESTAMP WITH TIME ZONE types, pandas will convert the data to UTC. source
Please note that we still need the logic from L149 to L152 btw it should be "elif _is_datetime_series_with_different_offsets" (I shall make a commit soon with this correction and also the utc=True from above) in the case where we have different offsets (with postgres psycopg2 adds the offset to the local timezone which we do not want anyways) because is_datetime64tz_dtype(df_col) is False in this case.
pandas/io/sql.py
Outdated
@@ -97,6 +97,38 @@ def _handle_date_column(col, utc=None, format=None): | |||
return to_datetime(col, errors="coerce", format=format, utc=utc) | |||
|
|||
|
|||
def _is_datetime_series_with_different_offsets(df_col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think infer_dtype
can detect whether the dtype should be upcast to datetimetz
. Otherwise the logic should live there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or use maybe_convert_objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested infer_dtype, maybe_convert_objects and is_datetime64tz_dtype, and my function _is_datetime_series_with_different_offsets. Only my function works. So we should keep it unless someone has a better idea.
Test of infert_dtype, maybe_convert_objects, is_datetime64tz_dtype, _is_datetime_series_with_different_offsets
from pandas.api.types import infer_dtype
from pandas._libs.lib import maybe_convert_objects
from pandas.api.types import is_datetime64tz_dtype
from pandas.io.sql import _is_datetime_series_with_different_offsets
import pandas as pd
import psycopg2
import datetime
import numpy as np
# this is the kind of data we receive from a psycopg2 cursor when we have a timestamptz column in a postgres table
data_with_diff_offsets = [datetime.datetime(2019, 11, 14, 16, 12, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60)),
datetime.datetime(2019, 8, 7, 15, 37, 4, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120))]
data_with_identical_offsets = [datetime.datetime(2019, 11, 14, 16, 12, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60)),
datetime.datetime(2019, 8, 7, 15, 37, 4, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60))]
print('infer_dtype with different offsets')
print('as list:',infer_dtype(data_with_diff_offsets))
print('as Series:',infer_dtype(pd.Series(data_with_diff_offsets)),'\n')
print('infer_dtype with identical offsets')
print('as list:',infer_dtype(data_with_identical_offsets))
print('as Series:',infer_dtype(pd.Series(data_with_identical_offsets)),'\n')
print('maybe_convert_objects (only works with numpy array) with different offsets')
print('dtype='+str(maybe_convert_objects(np.array(data_with_diff_offsets)).dtype),'\n')
print('maybe_convert_objects with identical offsets')
print('dtype='+str(maybe_convert_objects(np.array(data_with_identical_offsets)).dtype),'\n')
print('is_datetime64tz_dtype (used in _parse_columns for read_sql) with different offsets')
print('as list:',is_datetime64tz_dtype(data_with_diff_offsets))
print('as Series:',is_datetime64tz_dtype(pd.Series(data_with_diff_offsets)),'\n')
print('is_datetime64tz_dtype with identical offsets')
print('as list:',is_datetime64tz_dtype(data_with_identical_offsets))
print('as Series:',is_datetime64tz_dtype(pd.Series(data_with_identical_offsets)),'\n')
print('_is_datetime_series_with_different_offsets with different offsets')
print(_is_datetime_series_with_different_offsets(pd.Series(data_with_diff_offsets)),'\n')
print('_is_datetime_series_with_different_offsets with identical offsets')
print(_is_datetime_series_with_different_offsets(pd.Series(data_with_identical_offsets)))
infer_dtype with different offsets
as list: datetime
as Series: datetime
infer_dtype with identical offsets
as list: datetime
as Series: datetime64
maybe_convert_objects (only works with numpy array) with different offsets
dtype=object
maybe_convert_objects with identical offsets
dtype=object
is_datetime64tz_dtype (used in _parse_columns for read_sql) with different offsets
as list: False
as Series: False
is_datetime64tz_dtype with identical offsets
as list: False
as Series: True
_is_datetime_series_with_different_offsets with different offsets
True
_is_datetime_series_with_different_offsets with identical offsets
False
EDIT: improve variable names and prints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't re-invent the wheel, @mroeschke is correct here, if infer_dtype is not doing its job exactly right then then let's fix that, rather than creating an all new mostly untested code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to change infer_dtype we'd have to add an additional possible return value such as "datetimetz" because it currently does not differentiate between naive and timezone aware datetimes (see example below). And I don't think we want to convert postgres naive datetimes into UTC.
from pandas.api.types import infer_dtype
import datetime
assert infer_dtype([datetime.datetime(2019,1,1,0,0,0, tzinfo = datetime.timezone.utc)]) == 'datetime'
assert infer_dtype([datetime.datetime(2019,1,1,0,0,0)]) == 'datetime'
Should we do that? Otherwise I see 2 other possibilities:
- change is_datetime64tz_dtype so it returns True even with different offsets
- simplify the logic in _is_datetime_series_with_different_offsets. Please note this is less "precise" as what I did.
This would be sufficient (if we do this we could get rid of the function and do it directly in _parse_date_columns):
if infer_dtype(df_col) == 'datetime' and df_col.dtype == 'object':
return True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would rather infer_dryoe return a new type (option 1)
we have extensive test coverage in this routine so adding this would be ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think infer_dtype
should be able to return "datetimetz"
, somewhere around here.
Line 1290 in 7ab73a9
# if all values are nan/NaT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made a new commit with an update on infer_dtype (it now separates between datetime and datetimetz). It seems like quite a big change (infer_dtype is used in lots of places).
I had to modify a method of SQLTable : _sqlalchemy_type to include this "datetimetz". Also in the same function the error message when using to_sql with different offsets/timezones changed so I had to kinda reeinvent the wheel there to give a useful error message 😿
…ion) 1) improve docstring of _is_datetime_series_with_different_offsets 2) in _parse_date_columns enforce utc if the dtype of df_col is recognized as datetime with tz 3) corect mistake in _parse_date_columns (elif _is_datetime_series_with_different_offsets) 4) make to_sql raise again with diff timezones/offsets by deleting "except ValueError as e" block (L1015-L1025)
So I have pushed another commit with the suggestions from @mroeschke. EDIT: sorry I forgot to do local testing before pushing 😞 . I see there are some errors. I suppose they may be due to this utc=True I added. I will run some tests locally to figure this out. I'd gladly add tests but I am really lost in test_sql.py I don't even know where to start. I suppose I could add tests in one of the functions here. It would be easier if I had access to the database (of if I could use my own for local testing). Maybe I just need some more time to figure this out but if some wants to help I thought of some testing (it passed) with postgres already (see below). Additional testsimport datetime
import psycopg2
import sqlalchemy
import pandas as pd
import pytz
from pandas.io.sql import read_sql, to_sql
# CREATE ENGINE
# engine = sqlalchemy.create_engine("CONNECTION_STRING")
# create test data
df = pd.DataFrame()
## CASE 1 (all UTC)
## no exception should occur with to_sql,
## it should be saved as timestamp with timezone (where compatible e.g. postgres)
## read_sql should give back datetime64[ns, UTC]
## and not object dtype (discard local offsets winter/summer time)
df['ts_utc_winter_summer'] = [datetime.datetime(2019, 11, 7, 13, 37, 4,
tzinfo=datetime.timezone.utc),
datetime.datetime(2019, 8, 14, 15, 12, 0,
tzinfo=datetime.timezone.utc)]
## CASE 2 (different offsets)
## to_sql should raise (it did with master already)
df['ts_diff_offset'] = [datetime.datetime(2019, 11, 14, 16, 12, 0,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60)),
datetime.datetime(2019, 8, 7, 15, 37, 4,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120))]
## CASE 3 (different timezones)
## to_sql should raise (it did with master already)
df['ts_diff_tz'] = [datetime.datetime(2019, 11, 14, 16, 12, 0,
tzinfo=datetime.timezone.utc),
datetime.datetime(2019, 11, 14, 16, 12, 0,
tzinfo=pytz.timezone('Europe/Paris'))]
## CASE 4 (naive)
## no exception should occur with to_sql,
## it should be saved as timestamp without timezone
## read_sql should give back datetime64[ns] (it did with master already)
df['ts_naive'] = [datetime.datetime(2019, 11, 7, 13, 37, 4),
datetime.datetime(2019, 8, 14, 15, 12, 0)]
# show df
pd.options.display.width = 500
print(df,'\n')
# make sure we have the expected types
# where different timezones/offsets we should have an object dtype
expected_dtypes = {'ts_utc_winter_summer': 'datetime64[ns, UTC]',
'ts_diff_offset': 'O',
'ts_diff_tz': 'O',
'ts_naive': '<M8[ns]'}
for col in df.columns:
assert df[col].dtype == expected_dtypes[col]
# try to_sql one column at a time (as a DataFrame subset not a Series)
for col in df.columns:
try:
to_sql(frame = df[[col]],
name = 'timezone_tests',
con = engine,
schema = 'public',
if_exists = 'replace',
index = False)
df_db = read_sql('SELECT * FROM timezone_tests', engine)
if col == 'ts_naive':
assert df_db['ts_naive'].dtype == 'datetime64[ns]'
else:
assert df_db[col].dtype == 'datetime64[ns, UTC]'
except ValueError as e:
print(f"Error with column {col}:", e)
|
I think can just add the new tests to |
It seems like local testing with SQL is not possible (it always fails because it won't connect). But I think I have found the culprit. By adding utc=True in function _parse_date_columns (see comment of @mroeschke here) we changed the behavior with the argument "parse_dates". See example below: Examplefrom sqlalchemy import create_engine
import pandas as pd
import datetime
engine = create_engine('sqlite:///sqlite_test.sqlite3')
# create and save test data
data = [datetime.datetime(2019, 11, 11, 9, 0, 0,
tzinfo=datetime.timezone.utc),
datetime.datetime(2019, 6, 1, 9, 0, 0,
tzinfo=datetime.timezone.utc)]
df = pd.DataFrame(data, columns = ['ts'])
df.to_sql('test_utc_true', con = engine, index = False, if_exists = 'replace')
# with utc=True change
df = pd.read_sql('SELECT * FROM test_utc_true',
con = engine,
parse_dates = ['ts'])
print(df,'\n')
print(df.dtypes)
# without utc=True change
df = pd.read_sql('SELECT * FROM test_utc_true',
con = engine,
parse_dates = ['ts'])
print(df,'\n')
print(df.dtypes)
With the consideration from above I made a separate block for the columns in parse_dates in a new commit. |
pandas/io/sql.py
Outdated
@@ -97,6 +97,38 @@ def _handle_date_column(col, utc=None, format=None): | |||
return to_datetime(col, errors="coerce", format=format, utc=utc) | |||
|
|||
|
|||
def _is_datetime_series_with_different_offsets(df_col): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls don't re-invent the wheel, @mroeschke is correct here, if infer_dtype is not doing its job exactly right then then let's fix that, rather than creating an all new mostly untested code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ThibTrip i would prefer that you just do the infer_dtype change in this PR, then submit a followup after that is merged for the sql changes. This is going to be tricky as-is.
I merged master again (commit id ca3bfcc) and cythonized again so everything should be up to date. I don't understand why I had an issue with two of the CI tests. Seems to be related to tab completion. I will investigate but I don't really see any relation with what I did... Also @jreback I am leaving you some notes on your changes proposals. EDIT: jreback I am not sure what you meant with "grep the codebase for infer_dtype...". You wanted me to update the code right? |
Tab completion is unrelated, don’t worry about it
…On Sun, Jan 26, 2020 at 12:09 PM ThibTrip ***@***.***> wrote:
I merged master again (commit id ca3bfcc
<ca3bfcc>)
and cythonized again so everything should be up to date. I don't understand
why I had an issue with two of the CI tests. Seems to be related to tab
completion. I will investigate but I don't really see any relation with
what I did... Also @jreback <https://github.com/jreback> I am leaving you
some notes on your changes proposals.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30492?email_source=notifications&email_token=AB5UM6HKGK646SN74HKZAMTQ7XUXPA5CNFSM4J7PKTG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ54UIA#issuecomment-578538016>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AB5UM6DIQSRRVMB2TZPLU5LQ7XUXPANCNFSM4J7PKTGQ>
.
|
So I realized one check did not seem necessary as I was able to pass all my tests (plus my other tests I already showed you with psycopg2) and pandas tests (I mean apart from the tab completion and also I made a bad indentation somewhere it seems) without it. So I just removed it. I think I added it to fix something then realized the trigger was elsewhere and then forgot about it. I also created and used a function to check for datetimetz with nan (got inspired by the integer na funtion). I hope this will seem better to you because I am running out of ideas 😞 |
pandas/_libs/lib.pyx
Outdated
try: | ||
values = getattr(value, '_values', getattr(value, 'values', value)) | ||
except TypeError: | ||
# This gets hit if we have an EA, since cython expects `values` | ||
# to be an ndarray | ||
# before using try_infer_map check if there is a tz attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't belong here at all, this is only for handling a specific dtype w/o special cases, which is what you are doing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think I get the idea. I moved the timezone aware Series check above in my new commit.
pandas/_libs/lib.pyx
Outdated
@@ -1362,7 +1372,11 @@ def infer_dtype(value: object, skipna: bool = True) -> str: | |||
|
|||
elif PyDateTime_Check(val): | |||
if is_datetime_array(values): | |||
return "datetime" | |||
if (is_datetimetz_na_array(values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you just need is_datetimetz_array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this was quite stupid the part or is_datetime_tz_array(values)
would never be executed. I have fixed this.
pandas/_libs/lib.pyx
Outdated
return validator.validate(values) | ||
|
||
|
||
cdef class DatetimeTZNaValidator(TemporalValidator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I realize now this is redundant the DatetimeTZValidator works fine with missing values (and we don't need to differentiate between datetimetz and datetimetz with na either).
pandas/_libs/lib.pyx
Outdated
return is_null_datetime64(value) | ||
|
||
|
||
cpdef bint is_datetimetz_na_array(ndarray values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this in my new commit.
@@ -963,7 +963,7 @@ def _sqlalchemy_type(self, col): | |||
TIMESTAMP, | |||
) | |||
|
|||
if col_type == "datetime64" or col_type == "datetime": | |||
if col_type in ("datetime64", "datetime", "datetimetz"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you revert all of these changes in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this to pass the tests. As you can see there were some failures. I will try some tests locally (I have to copy paste them since I am not sure how to change the configuration for the connection strings) and give you a new feedback when I'm done :).
@@ -794,6 +794,12 @@ def test_infer_dtype_datetime(self): | |||
arr = np.array([n, datetime(2011, 1, 1)]) | |||
assert lib.infer_dtype(arr, skipna=True) == "datetime" | |||
|
|||
arr = np.array([n, datetime(2011, 1, 1, tzinfo=pytz.utc)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move these to a new test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I took the liberty of adding some more tests for cases that seem relevant to me. I hope that's alright.
Hey, I wanted to say thanks for your patience jreback. This PR has been and is quite difficult for me and if anyone wants to help in my branch or even take over I do not mind at all. Not that I want to give up or anything this is actually very interesting! Anyhow after testing minutely the sql.py module for the failures of the last CI run I can confirm that we need to check if the infered col_type is equal to "datetimetz" in method SQLTable._sqlalchemy_type otherwise test_sqlalchemy_type_mapping and test_datetime_with_timezone_roundtrip (both in pandas/tests/io/test_sql.py) will fail with an sqlalchemy engine connected to a SQLite or postgres database (because the sql dtype of the table model will be Text instead of TIMESTAMP). So basically if I readd this change to the sql.py module the tests should pass. However the other bit of code I changed in SQLiteTable._sql_type_name where I also check if the infered col_type is equal to "datetimetz" does not seem to be necessary to pass tests. But it would not give the expected data type in the table model so it seems logical to me to do this change as well (see SQLite test details). SQLite testimport sqlite3
from pandas import DataFrame, to_datetime, date_range
from pandas.io import sql
from sqlalchemy import create_engine
# with engine = sqlite3.connect(":memory:") we get the same output
engine = create_engine('sqlite:///:memory:')
db = sql.SQLDatabase(engine)
expected = DataFrame({"A": date_range("2013-01-01 09:00:00",
periods=3, tz="US/Pacific")})
table = sql.SQLiteTable("test_type", db, frame=expected)
# if we do not check for "datetimetz" in SQLiteTable._sql_type_name
# we end up with "A" as "TEXT" instead of "TIMESTAMP" (see table below)
table.table ['CREATE TABLE "test_type" (\n"index" INTEGER,\n "A" TEXT \n)',
'CREATE INDEX "ix_test_type_index"ON "test_type" ("index")'] How shall I proceed? Thanks in advance. |
Hi @ThibTrip. Sorry your PR had went unnoticed for a good period of time. Do you have time to merge master? I suspect we need to add that |
Hi @mroeschke! Not a problem at all! Also sorry I took so long to reply I have been very busy with work and personal projects. I have merged master. I had a check for datetimetz but I removed it (we need it at the very least here: https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L973) because the idea IIRC was to touch as little files as possible and do the change in two PRs (in this one we would only make sure infer dtypes recognizes we are in presence of datetime with timezone). |
@mroeschke I managed to test SQL locally 👍 ! It was not complicated at all I just did not have sufficient knowledge back when I started this PR. I just needed to install postgres and create a database pandas_nosetest (in the tests it connects with the default postgres user on localhost). Anyhow I have lots of errors locally that I did not have before and in other tests modules so I am rather surprised. Some tests like pandas.tests.plotting.test_datetimelike.TestTSPlot seem completely irrelevant to my PR yet still fail. So I pushed to see if I have the same errors in the CI tools. EDIT: So I added the column type checks I mentionned earlier (see commit 6f9e8ab) as they are necessary to pass the tests. |
pandas/_libs/lib.pyx
Outdated
@@ -1240,6 +1243,10 @@ def infer_dtype(value: object, skipna: bool = True) -> str: | |||
elif hasattr(value, 'dtype'): | |||
# this will handle ndarray-like | |||
# e.g. categoricals | |||
# begin by checking if there is a tz attribute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to move this
@jreback I think you mentioned the section about DatetimeTZDtype detection before and I did not know how to place it otherwise back then. Sorry about that. I think I know now what you had in mind and have pushed a new commit. Please tell me if that's good like this 👍 . |
@@ -834,6 +834,30 @@ def test_infer_dtype_datetime(self): | |||
arr = np.array([np.nan, "2011-01-01", pd.Timestamp("2011-01-02")]) | |||
assert lib.infer_dtype(arr, skipna=True) == "mixed" | |||
|
|||
def test_infer_dtypes_datetimetz(self): | |||
|
|||
# starts with nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split this into 3 different tests?
And for these # starts with nan
test, could you use pytest.mark.parameterize
Just one comment on the tests. The changes to I'll defer to @jreback on |
@@ -1283,6 +1286,11 @@ def infer_dtype(value: object, skipna: bool = True) -> str: | |||
|
|||
if util.is_array(value): | |||
values = value | |||
|
|||
elif hasattr(value, 'dtype') and hasattr(value.dtype, 'tz'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't belong here at all
@@ -1364,7 +1372,10 @@ def infer_dtype(value: object, skipna: bool = True) -> str: | |||
|
|||
elif PyDateTime_Check(val): | |||
if is_datetime_array(values): | |||
return "datetime" | |||
if is_datetimetz_array(values): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is likely enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will very simply delete the part above and test locally. I'll check again which paths the tests function take. Perhaps it is not useful or perhaps there is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So @jreback I executed my plan and one of the tests failed (see code of failed test below) after I removed the part which is not in the correct place (elif hasattr(value, 'dtype') and hasattr(value.dtype, 'tz'):...
).
I followed the path taken during the test. As you can see by the code and my comments below (look for comments with lots of hyphens) it goes through _try_infer_map (see code below) where it is not recognized as "datetimetz" but as "datetime64" (s.dtype.kind is "M" which in _TYPE_MAP translates to "datetime64").
So the kind and base attribute of s.dtype don't help identify our Series as "datetimetz" with try_infer_map (see code below) and since it's a mapping and not say regex the name attribute of s.dtype which is "datetime64[ns, US/Pacific]" won't match (we'd have to include all timezones in the mapper which does not seem like the right solution).
Between this and the fact that afterwards the dtype is supposed to be returned this brings me to the conclusion that the check for DatetimeTZdtype must be right before try_infer_map as it was during a previous commit (or we would have to do some surgery and check no other inferences break).
failed_test
import pandas as pd
from pandas import _lib as lib
from pandas import Series
# pandas/tests/series/test_constructors.py
# TestSeriesConstructors.test_constructor_with_datetime_tz
# inference
s = Series([pd.Timestamp("2013-01-01 13:00:00-0800", tz="US/Pacific"),
pd.Timestamp("2013-01-02 14:00:00-0800", tz="US/Pacific")])
# Assertion Error 'datetime64' == 'datetimetz'
assert lib.infer_dtype(s, skipna=True) == "datetimetz"
infer_dtype
if util.is_array(value):
values = value
elif hasattr(value, 'dtype'): # <------------ GOES HERE
# this will handle ndarray-like
# e.g. categoricals
try:
values = getattr(value, '_values', getattr(value, 'values', value))
except TypeError: # <------------ GOES HERE
# This gets hit if we have an EA, since cython expects `values`
# to be an ndarray
value = _try_infer_map(value) # -----SEE try_infer_map code below----
if value is not None:
return value
# its ndarray like but we can't handle
raise ValueError(f"cannot infer type for {type(value)}")
else:
if not isinstance(value, list):
value = list(value)
from pandas.core.dtypes.cast import (
construct_1d_object_array_from_listlike)
values = construct_1d_object_array_from_listlike(value)
try_infer_map
_TYPE_MAP = {'categorical': 'categorical', 'category': 'categorical',
'int8': 'integer', 'int16': 'integer', 'int32': 'integer',
'int64': 'integer', 'i': 'integer', 'uint8': 'integer',
'uint16': 'integer', 'uint32': 'integer', 'uint64': 'integer',
'u': 'integer', 'float32': 'floating', 'float64': 'floating',
'f': 'floating', 'complex64': 'complex', 'complex128': 'complex',
'c': 'complex', 'string': 'string', 'S': 'bytes', 'U': 'string',
'bool': 'boolean', 'b': 'boolean', 'datetime64[ns]': 'datetime64',
'M': 'datetime64', 'timedelta64[ns]': 'timedelta64',
'm': 'timedelta64', 'interval': 'interval'}
# NOT COPIED: types which only exist on certain platform
# (it's for float16, flat128 and complex256)
cdef _try_infer_map(v):
"""
If its in our map, just return the dtype.
"""
cdef:
object attr, val
for attr in ['name', 'kind', 'base']:
# -----------------------------------
# for our example:
# name datetime64[ns, US/Pacific]
# kind M
# base dtype('<M8[ns]')
# note: base won't be called bc kind "M" in is _TYPE_MAP
# -----------------------------------
val = getattr(v.dtype, attr)
if val in _TYPE_MAP:
return _TYPE_MAP[val]
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do this without doing multiple passes through the data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel cython.datetime does not seem to provide a way to distinguish between datetime with and without timezone however this is something we can do I think (I tested it locally and it works): in DatetimeValidator we can check the absence of tzinfo.
cdef class DatetimeValidator(TemporalValidator):
cdef bint is_value_typed(self, object value) except -1:
return PyDateTime_Check(value) and (value.tzinfo is None) # <-----
And then we change the part with the repetition to:
elif PyDateTime_Check(val):
if is_datetime_array(values):
return "datetime"
elif is_datetimetz_array(values):
return "datetimetz"
|
||
# pd.Timestamps with different timezones | ||
arr = [ | ||
pd.Timestamp("2011-01-02", tz="UTC"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use pytest.mark.parameterize
over [datetime, Timestamp]
and construct these like the datetime
form you have above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep... makes sense! Don't know why I did not think about it! I will make a new commit after I do some tests related to the request of @jreback.
return PyDateTime_Check(value) and (value.tzinfo is not None) | ||
|
||
cdef inline bint is_valid_null(self, object value) except -1: | ||
return is_null_datetime64(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might want to only accept value is NaT
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel sorry but I think that we should at least test for None as well since infer_dtype will be used on arrays we receive from the various SQL drivers and None is AFAIK the default null value there. As you can see in the example below (this is the issue that we are currently trying to fix) the array has not been parsed to a DatetimeTZDtype and still contains None:
import pandas as pd
from sqlalchemy import create_engine
# REPLACE with valid postgresql connection string
engine = create_engine('postgresql://............')
df = pd.DataFrame({'dt': [pd.Timestamp('2019-11-14 15:12:00+0000', tz='UTC'),
pd.Timestamp('2019-08-07 13:37:04+0000', tz='UTC'),
pd.NaT]})
df.to_sql('null_value_test', con=engine, schema='tests', if_exists='replace',
index=False)
pd.read_sql('SELECT * FROM tests.null_value_test', con=engine)
dt | |
---|---|
0 | 2019-11-14 16:12:00+01:00 |
1 | 2019-08-07 15:37:04+02:00 |
2 | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for clarifying. I'll need to give some thought to whether this is the appropriate place to do this check.
I think these changes are nice, but closing as stale. ping / open a new PR if you want to continue |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Explanations
For to_sql no errors are raised anymore when we have Series containing datetimes with timezone and different offsets (unlike in GH30207) and the data is saved as expected (I tested with postgres which does not save offsets anyways).
Concerning read_sql however there are several points to discuss:
if we want to have datetime64 instead of "object" (as described in GH30207) we lose the offsets. So this would be a breaking change and would require first a warning in the next version then we implement it in the version after, right?
is this patch I made acceptable?
The probem is that read_sql uses pd.DataFrame.from_records and it reads datetimes with different offsets as "object" instead of datetime64[ns,...] (e.g. datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)]).
If we wanted to modify this behavior we would have to modify the C extension _libs.lib. I don't know how to help with that and I am not sure we want to do this. Even if we could pack the datetimes with differents offsets into a Series with a datetime64 dtype the methods of the .dt accessor are not compatible anyways: currently a ValueError is thrown if one attempts to use the accessor in this case (see traceback in the example in "Before PR" below).
Before PR
Traceback
After PR