Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: sql different offsets (1/2 modify infer_dtype) #30492

Closed
wants to merge 33 commits into from
Closed

BUG: sql different offsets (1/2 modify infer_dtype) #30492

wants to merge 33 commits into from

Conversation

ThibTrip
Copy link
Contributor

@ThibTrip ThibTrip commented Dec 26, 2019

Explanations

For to_sql no errors are raised anymore when we have Series containing datetimes with timezone and different offsets (unlike in GH30207) and the data is saved as expected (I tested with postgres which does not save offsets anyways).

Concerning read_sql however there are several points to discuss:

  1. if we want to have datetime64 instead of "object" (as described in GH30207) we lose the offsets. So this would be a breaking change and would require first a warning in the next version then we implement it in the version after, right?

  2. is this patch I made acceptable?

The probem is that read_sql uses pd.DataFrame.from_records and it reads datetimes with different offsets as "object" instead of datetime64[ns,...] (e.g. datetime64[ns, psycopg2.tz.FixedOffsetTimezone(offset=60, name=None)]).

If we wanted to modify this behavior we would have to modify the C extension _libs.lib. I don't know how to help with that and I am not sure we want to do this. Even if we could pack the datetimes with differents offsets into a Series with a datetime64 dtype the methods of the .dt accessor are not compatible anyways: currently a ValueError is thrown if one attempts to use the accessor in this case (see traceback in the example in "Before PR" below).

Before PR

import datetime
import psycopg2
import sqlalchemy
import pandas as pd
from pandas.io.sql import read_sql, to_sql

# CREATE ENGINE
# engine = sqlalchemy.create_engine("CONNECTION_STRING")

# create datetimes with timezone that have different offsets
data = [[datetime.datetime(2019, 11, 14, 16, 12, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))],
        [datetime.datetime(2019, 8, 7, 15, 37, 4, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120, name=None))]]

df = pd.DataFrame(data, columns = ['ts_tz'])

# save data
to_sql(frame = df, 
       name = 'timezone_diff_offsets_test',
       con = engine,
       schema = 'public',
       if_exists = 'replace',
       index = False)
Traceback
    ---------------------------------------------------------------------------

    ValueError                                Traceback (most recent call last)

    ~/GitHub/pandas/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
       1968         try:
    -> 1969             values, tz_parsed = conversion.datetime_to_datetime64(data)
       1970             # If tzaware, these values represent unix timestamps, so we


    ~/GitHub/pandas/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()


    ValueError: Array must be all same time zone

    
    During handling of the above exception, another exception occurred:


    ValueError                                Traceback (most recent call last)

    <ipython-input-2-a66775d04fe3> in <module>
         20        schema = 'public',
         21        if_exists = 'replace',
    ---> 22        index = False)
    

    ~/GitHub/pandas/pandas/io/sql.py in to_sql(frame, name, con, schema, if_exists, index, index_label, chunksize, dtype, method)
        519         chunksize=chunksize,
        520         dtype=dtype,
    --> 521         method=method,
        522     )
        523 


    ~/GitHub/pandas/pandas/io/sql.py in to_sql(self, frame, name, if_exists, index, index_label, schema, chunksize, dtype, method)
       1318             index_label=index_label,
       1319             schema=schema,
    -> 1320             dtype=dtype,
       1321         )
       1322         table.create()


    ~/GitHub/pandas/pandas/io/sql.py in __init__(self, name, pandas_sql_engine, frame, index, if_exists, prefix, index_label, schema, keys, dtype)
        620         if frame is not None:
        621             # We want to initialize based on a dataframe
    --> 622             self.table = self._create_table_setup()
        623         else:
        624             # no data provided, read-only mode


    ~/GitHub/pandas/pandas/io/sql.py in _create_table_setup(self)
        869         from sqlalchemy import Table, Column, PrimaryKeyConstraint
        870 
    --> 871         column_names_and_types = self._get_column_names_and_types(self._sqlalchemy_type)
        872 
        873         columns = [


    ~/GitHub/pandas/pandas/io/sql.py in _get_column_names_and_types(self, dtype_mapper)
        861         column_names_and_types += [
        862             (str(self.frame.columns[i]), dtype_mapper(self.frame.iloc[:, i]), False)
    --> 863             for i in range(len(self.frame.columns))
        864         ]
        865 


    ~/GitHub/pandas/pandas/io/sql.py in <listcomp>(.0)
        861         column_names_and_types += [
        862             (str(self.frame.columns[i]), dtype_mapper(self.frame.iloc[:, i]), False)
    --> 863             for i in range(len(self.frame.columns))
        864         ]
        865 


    ~/GitHub/pandas/pandas/io/sql.py in _sqlalchemy_type(self, col)
        971             # timezone information
        972             try:
    --> 973                 if col.dt.tz is not None:
        974                     return TIMESTAMP(timezone=True)
        975             except AttributeError:


    ~/GitHub/pandas/pandas/core/accessor.py in _getter(self)
         83         def _create_delegator_property(name):
         84             def _getter(self):
    ---> 85                 return self._delegate_property_get(name)
         86 
         87             def _setter(self, new_values):


    ~/GitHub/pandas/pandas/core/indexes/accessors.py in _delegate_property_get(self, name)
         60         from pandas import Series
         61 
    ---> 62         values = self._get_values()
         63 
         64         result = getattr(values, name)


    ~/GitHub/pandas/pandas/core/indexes/accessors.py in _get_values(self)
         51                 return PeriodArray(data, copy=False)
         52             if is_datetime_arraylike(data):
    ---> 53                 return DatetimeIndex(data, copy=False, name=self.name)
         54 
         55         raise TypeError(


    ~/GitHub/pandas/pandas/core/indexes/datetimes.py in __new__(cls, data, freq, tz, normalize, closed, ambiguous, dayfirst, yearfirst, dtype, copy, name)
        266             dayfirst=dayfirst,
        267             yearfirst=yearfirst,
    --> 268             ambiguous=ambiguous,
        269         )
        270 


    ~/GitHub/pandas/pandas/core/arrays/datetimes.py in _from_sequence(cls, data, dtype, copy, tz, freq, dayfirst, yearfirst, ambiguous)
        418             dayfirst=dayfirst,
        419             yearfirst=yearfirst,
    --> 420             ambiguous=ambiguous,
        421         )
        422 


    ~/GitHub/pandas/pandas/core/arrays/datetimes.py in sequence_to_dt64ns(data, dtype, copy, tz, dayfirst, yearfirst, ambiguous)
       1864             #  or M8[ns] to denote wall times
       1865             data, inferred_tz = objects_to_datetime64ns(
    -> 1866                 data, dayfirst=dayfirst, yearfirst=yearfirst
       1867             )
       1868             tz = maybe_infer_tz(tz, inferred_tz)


    ~/GitHub/pandas/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
       1972             return values.view("i8"), tz_parsed
       1973         except (ValueError, TypeError):
    -> 1974             raise e
       1975 
       1976     if tz_parsed is not None:


    ~/GitHub/pandas/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
       1963             dayfirst=dayfirst,
       1964             yearfirst=yearfirst,
    -> 1965             require_iso8601=require_iso8601,
       1966         )
       1967     except ValueError as e:


    ~/GitHub/pandas/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()


    ~/GitHub/pandas/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()


    ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

# save datetimes with UTC timezone in different periods of the year
## create test data
data = [[datetime.datetime(2019, 11, 14, 15, 12, 0, tzinfo = datetime.timezone.utc)],
        [datetime.datetime(2019, 8, 7, 13, 37, 4,  tzinfo = datetime.timezone.utc)]]

df = pd.DataFrame(data, columns = ['ts_tz'])

## save test data
to_sql(frame = df, 
       name = 'timezone_diff_offsets_test',
       con = engine,
       schema = 'public',
       if_exists = 'replace',
       index = False)

# upon reading if you have for instance "Europe/Paris"
# configured as timezone on your local machine then you get datetimes with timezone
# that have different offsets (because they are in different periods of the year)
# due to this you have "object" dtype and not datetime64[...]
df = read_sql('SELECT * FROM timezone_diff_offsets_test', engine)
print(df,'\n')
print(df.dtypes)
                       ts_tz
0  2019-11-14 16:12:00+01:00
1  2019-08-07 15:37:04+02:00 

ts_tz    object
dtype: object

After PR

import datetime
import psycopg2
import sqlalchemy
import pandas as pd
from pandas.io.sql import read_sql, to_sql

# CREATE ENGINE
# engine = sqlalchemy.create_engine("CONNECTION_STRING")

# create datetimes with timezone that have different offsets
data = [[datetime.datetime(2019, 11, 14, 16, 12, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))],
        [datetime.datetime(2019, 8, 7, 15, 37, 4, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120, name=None))]]

df = pd.DataFrame(data, columns = ['ts_tz'])

# save data - no exceptions occur
to_sql(frame = df, 
       name = 'timezone_diff_offsets_test',
       con = engine,
       schema = 'public',
       if_exists = 'replace',
       index = False)

# read data - offsets are dropped and dtype of column ts_tz is datetime64[ns, UTC]
df = read_sql('SELECT * FROM timezone_diff_offsets_test', engine)
print(df,'\n')
print(df.dtypes)
                      ts_tz
0 2019-11-14 15:12:00+00:00
1 2019-08-07 13:37:04+00:00 

ts_tz    datetime64[ns, UTC]
dtype: object

ThibTrip added 3 commits December 26, 2019 18:39
detect columns containing only datetimes/NULL recognized as object with function _is_datetime_column_with_dtype_object and coerce them to UTC for read_sql and to_sql
@jbrockmendel
Copy link
Member

If the user gives a series of datetimes with mixed timezones, presumably they did it on purpose. So we should try to preserve that information rather than convert to UTC. Or am I missing something here?

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Dec 26, 2019

If the user gives a series of datetimes with mixed timezones, presumably they did it on purpose. So we should try to preserve that information rather than convert to UTC. Or am I missing something here?

I don't think that we are really losing information here (see sections to_sql and read_sql below and please correct me if I'm wrong).

to_sql

Mixed offsets and actually mixed timezones will prevent to_sql from working because the function io.sql._sqlalchemy_type uses the .dt acessor and this raises a ValueError with mixed timezones. Since the offsets are not saved in the database anyways (for postgres and SQLite3 AFAIK) converting to UTC should solve the issue without changes for the user.

read_sql

If all datetimes are in the same timezone and offset then we get a Series with dtype datetime64 and no offset.
If not - for instance in my case one datetime is in winter time and one is in summer time so that when psycopg2 adds the offsets to my local timezone (Europe/Paris) we get two different offsets (+1 and +2 hours) - then pandas does not convert it to UTC.
I find this rather confusing and I think the example below illustrates this:

Example
import datetime
import psycopg2
import sqlalchemy
import pandas as pd
import pytz
from pandas.io.sql import read_sql, to_sql

# CREATE ENGINE
# engine = sqlalchemy.create_engine("CONNECTION_STRING")

# create datetimes with timezone that have different offsets
data = [[datetime.datetime(2019, 11, 14, 15, 12, 0, tzinfo = datetime.timezone.utc),
         datetime.datetime(2019, 11, 14, 15, 12, 0, tzinfo = datetime.timezone.utc)],
        [datetime.datetime(2019, 11, 7, 13, 37, 4,  tzinfo = datetime.timezone.utc),
        datetime.datetime(2019, 8, 14, 15, 12, 0, tzinfo = datetime.timezone.utc)]]

df = pd.DataFrame(data, columns = ['ts_tz','ts_tz2'])

# save data - no exceptions occur
to_sql(frame = df, 
       name = 'timezone_diff_offsets_test',
       con = engine,
       schema = 'public',
       if_exists = 'replace',
       index = False)

# read data - offsets are dropped and dtype of column ts_tz is datetime64[ns, UTC]
df = read_sql('SELECT * FROM timezone_diff_offsets_test', engine)
print(df,'\n')
print(df.dtypes)

Output of master branch

                      ts_tz                     ts_tz2
0 2019-11-14 15:12:00+00:00  2019-11-14 16:12:00+01:00
1 2019-11-07 13:37:04+00:00  2019-08-14 17:12:00+02:00 

ts_tz     datetime64[ns, UTC]
ts_tz2                 object
dtype: object

Output of my branch (bugfix_sql_different_offsets)

                      ts_tz                    ts_tz2
0 2019-11-14 15:12:00+00:00 2019-11-14 15:12:00+00:00
1 2019-11-07 13:37:04+00:00 2019-08-14 15:12:00+00:00 

ts_tz     datetime64[ns, UTC]
ts_tz2    datetime64[ns, UTC]
dtype: object

EDIT: typos

@WillAyd
Copy link
Member

WillAyd commented Dec 26, 2019

Can you add test cases? It's a little unclear what this is doing without those.

With respect to writing mixed timezones, I'm not sure about this. I actually think the ValueError getting raised makes more sense

@WillAyd WillAyd added IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype labels Dec 26, 2019
@mroeschke
Copy link
Member

@jbrockmendel we have a president of converting timezones (which should include mixed) to UTC when reading from DBs. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#datetime-data-types

When writing mixed timezones, I agree with @WillAyd in that raising an error is preferable to coercing to UTC if we can't maintain the mixed timezones.

pandas/io/sql.py Outdated
# handle datetime Series with different offsets
# see GH30207
if _is_datetime_series_with_different_offsets(df_col):
data_frame[col_name] = to_datetime(df_col, utc=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pass utc=True into data_frame[col_name] = _handle_date_column(df_col, format=fmt) above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you are right it should be utc=True. I suppose this was already infered by pandas but we should always enforce it because as you mentionned:

When reading TIMESTAMP WITH TIME ZONE types, pandas will convert the data to UTC. source

Please note that we still need the logic from L149 to L152 btw it should be "elif _is_datetime_series_with_different_offsets" (I shall make a commit soon with this correction and also the utc=True from above) in the case where we have different offsets (with postgres psycopg2 adds the offset to the local timezone which we do not want anyways) because is_datetime64tz_dtype(df_col) is False in this case.

pandas/io/sql.py Outdated
@@ -97,6 +97,38 @@ def _handle_date_column(col, utc=None, format=None):
return to_datetime(col, errors="coerce", format=format, utc=utc)


def _is_datetime_series_with_different_offsets(df_col):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think infer_dtype can detect whether the dtype should be upcast to datetimetz. Otherwise the logic should live there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or use maybe_convert_objects

Copy link
Contributor Author

@ThibTrip ThibTrip Dec 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested infer_dtype, maybe_convert_objects and is_datetime64tz_dtype, and my function _is_datetime_series_with_different_offsets. Only my function works. So we should keep it unless someone has a better idea.

Test of infert_dtype, maybe_convert_objects, is_datetime64tz_dtype, _is_datetime_series_with_different_offsets
from pandas.api.types import infer_dtype
from pandas._libs.lib import maybe_convert_objects
from pandas.api.types import is_datetime64tz_dtype
from pandas.io.sql import _is_datetime_series_with_different_offsets
import pandas as pd
import psycopg2
import datetime
import numpy as np


# this is the kind of data we receive from a psycopg2 cursor when we have a timestamptz column in a postgres table
data_with_diff_offsets = [datetime.datetime(2019, 11, 14, 16, 12, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60)),
                     datetime.datetime(2019, 8, 7, 15, 37, 4, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120))]

data_with_identical_offsets = [datetime.datetime(2019, 11, 14, 16, 12, 0, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60)),
                        datetime.datetime(2019, 8, 7, 15, 37, 4, tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60))]


print('infer_dtype with different offsets')
print('as list:',infer_dtype(data_with_diff_offsets))
print('as Series:',infer_dtype(pd.Series(data_with_diff_offsets)),'\n')

print('infer_dtype with identical offsets')
print('as list:',infer_dtype(data_with_identical_offsets))
print('as Series:',infer_dtype(pd.Series(data_with_identical_offsets)),'\n')

print('maybe_convert_objects (only works with numpy array) with different offsets')
print('dtype='+str(maybe_convert_objects(np.array(data_with_diff_offsets)).dtype),'\n')

print('maybe_convert_objects with identical offsets')
print('dtype='+str(maybe_convert_objects(np.array(data_with_identical_offsets)).dtype),'\n')

print('is_datetime64tz_dtype (used in _parse_columns for read_sql) with different offsets')
print('as list:',is_datetime64tz_dtype(data_with_diff_offsets))
print('as Series:',is_datetime64tz_dtype(pd.Series(data_with_diff_offsets)),'\n')

print('is_datetime64tz_dtype with identical offsets')
print('as list:',is_datetime64tz_dtype(data_with_identical_offsets))
print('as Series:',is_datetime64tz_dtype(pd.Series(data_with_identical_offsets)),'\n')

print('_is_datetime_series_with_different_offsets with different offsets')
print(_is_datetime_series_with_different_offsets(pd.Series(data_with_diff_offsets)),'\n')

print('_is_datetime_series_with_different_offsets with identical offsets')
print(_is_datetime_series_with_different_offsets(pd.Series(data_with_identical_offsets)))
infer_dtype with different offsets
as list: datetime
as Series: datetime 

infer_dtype with identical offsets
as list: datetime
as Series: datetime64 

maybe_convert_objects (only works with numpy array) with different offsets
dtype=object 

maybe_convert_objects with identical offsets
dtype=object 

is_datetime64tz_dtype (used in _parse_columns for read_sql) with different offsets
as list: False
as Series: False 

is_datetime64tz_dtype with identical offsets
as list: False
as Series: True 

_is_datetime_series_with_different_offsets with different offsets
True 

_is_datetime_series_with_different_offsets with identical offsets
False

EDIT: improve variable names and prints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't re-invent the wheel, @mroeschke is correct here, if infer_dtype is not doing its job exactly right then then let's fix that, rather than creating an all new mostly untested code path.

Copy link
Contributor Author

@ThibTrip ThibTrip Dec 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to change infer_dtype we'd have to add an additional possible return value such as "datetimetz" because it currently does not differentiate between naive and timezone aware datetimes (see example below). And I don't think we want to convert postgres naive datetimes into UTC.

from pandas.api.types import infer_dtype
import datetime

assert infer_dtype([datetime.datetime(2019,1,1,0,0,0, tzinfo = datetime.timezone.utc)]) == 'datetime'
assert infer_dtype([datetime.datetime(2019,1,1,0,0,0)]) == 'datetime'

Should we do that? Otherwise I see 2 other possibilities:

  1. change is_datetime64tz_dtype so it returns True even with different offsets
  2. simplify the logic in _is_datetime_series_with_different_offsets. Please note this is less "precise" as what I did.
    This would be sufficient (if we do this we could get rid of the function and do it directly in _parse_date_columns):
if infer_dtype(df_col) == 'datetime' and df_col.dtype == 'object':
    return True

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would rather infer_dryoe return a new type (option 1)

we have extensive test coverage in this routine so adding this would be ok

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think infer_dtype should be able to return "datetimetz", somewhere around here.

# if all values are nan/NaT

Copy link
Contributor Author

@ThibTrip ThibTrip Dec 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made a new commit with an update on infer_dtype (it now separates between datetime and datetimetz). It seems like quite a big change (infer_dtype is used in lots of places).

I had to modify a method of SQLTable : _sqlalchemy_type to include this "datetimetz". Also in the same function the error message when using to_sql with different offsets/timezones changed so I had to kinda reeinvent the wheel there to give a useful error message 😿

…ion)

1) improve docstring of _is_datetime_series_with_different_offsets
2) in _parse_date_columns enforce utc if the dtype of df_col is recognized as datetime with tz
3) corect mistake in _parse_date_columns (elif _is_datetime_series_with_different_offsets)
4) make to_sql raise again with diff timezones/offsets by deleting "except ValueError as e" block (L1015-L1025)
@ThibTrip
Copy link
Contributor Author

ThibTrip commented Dec 27, 2019

So I have pushed another commit with the suggestions from @mroeschke.

EDIT: sorry I forgot to do local testing before pushing 😞 . I see there are some errors. I suppose they may be due to this utc=True I added. I will run some tests locally to figure this out.

I'd gladly add tests but I am really lost in test_sql.py I don't even know where to start. I suppose I could add tests in one of the functions here. It would be easier if I had access to the database (of if I could use my own for local testing). Maybe I just need some more time to figure this out but if some wants to help I thought of some testing (it passed) with postgres already (see below).

Additional tests
import datetime
import psycopg2 
import sqlalchemy
import pandas as pd
import pytz
from pandas.io.sql import read_sql, to_sql

# CREATE ENGINE
# engine = sqlalchemy.create_engine("CONNECTION_STRING")

# create test data
df = pd.DataFrame()

## CASE 1 (all UTC)
## no exception should occur with to_sql, 
## it should be saved as timestamp with timezone (where compatible e.g. postgres)
## read_sql should give back datetime64[ns, UTC] 
## and not object dtype (discard local offsets winter/summer time)
df['ts_utc_winter_summer'] = [datetime.datetime(2019, 11, 7, 13, 37, 4,  
                                                tzinfo=datetime.timezone.utc),
                              datetime.datetime(2019, 8, 14, 15, 12, 0, 
                                                tzinfo=datetime.timezone.utc)]

## CASE 2 (different offsets)
## to_sql should raise (it did with master already)
df['ts_diff_offset'] = [datetime.datetime(2019, 11, 14, 16, 12, 0, 
                             tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60)),
                        datetime.datetime(2019, 8, 7, 15, 37, 4, 
                             tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120))]

## CASE 3 (different timezones)
## to_sql should raise (it did with master already)
df['ts_diff_tz'] = [datetime.datetime(2019, 11, 14, 16, 12, 0, 
                                   tzinfo=datetime.timezone.utc),
                    datetime.datetime(2019, 11, 14, 16, 12, 0, 
                                   tzinfo=pytz.timezone('Europe/Paris'))]

## CASE 4 (naive)
## no exception should occur with to_sql, 
## it should be saved as timestamp without timezone
## read_sql should give back datetime64[ns] (it did with master already)
df['ts_naive'] = [datetime.datetime(2019, 11, 7, 13, 37, 4),
                  datetime.datetime(2019, 8, 14, 15, 12, 0)]

# show df
pd.options.display.width = 500
print(df,'\n')

# make sure we have the expected types
# where different timezones/offsets we should have an object dtype
expected_dtypes = {'ts_utc_winter_summer': 'datetime64[ns, UTC]',
                   'ts_diff_offset': 'O',
                   'ts_diff_tz': 'O',
                   'ts_naive': '<M8[ns]'}

for col in df.columns:
    assert df[col].dtype == expected_dtypes[col]

# try to_sql one column at a time (as a DataFrame subset not a Series)
for col in df.columns:
    try:
        to_sql(frame = df[[col]], 
               name = 'timezone_tests',
               con = engine,
               schema = 'public',
               if_exists = 'replace',
               index = False)

        df_db = read_sql('SELECT * FROM timezone_tests', engine)

        if col == 'ts_naive':
            assert df_db['ts_naive'].dtype == 'datetime64[ns]'
        else:
            assert df_db[col].dtype == 'datetime64[ns, UTC]'
    except ValueError as e:
        print(f"Error with column {col}:", e)
       ts_utc_winter_summer             ts_diff_offset                 ts_diff_tz            ts_naive
0 2019-11-07 13:37:04+00:00  2019-11-14 16:12:00+01:00  2019-11-14 16:12:00+00:00 2019-11-07 13:37:04
1 2019-08-14 15:12:00+00:00  2019-08-07 15:37:04+02:00  2019-11-14 16:12:00+00:09 2019-08-14 15:12:00 

Error with column ts_diff_offset: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True
Error with column ts_diff_tz: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

@WillAyd
Copy link
Member

WillAyd commented Dec 27, 2019

I think can just add the new tests to _TestSQLApi

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Dec 27, 2019

It seems like local testing with SQL is not possible (it always fails because it won't connect).

But I think I have found the culprit. By adding utc=True in function _parse_date_columns (see comment of @mroeschke here) we changed the behavior with the argument "parse_dates". See example below:

Example
from sqlalchemy import create_engine
import pandas as pd
import datetime
engine = create_engine('sqlite:///sqlite_test.sqlite3')

# create and save test data
data = [datetime.datetime(2019, 11, 11, 9, 0, 0,  
                          tzinfo=datetime.timezone.utc),
        datetime.datetime(2019, 6, 1, 9, 0, 0, 
                          tzinfo=datetime.timezone.utc)]

df = pd.DataFrame(data, columns = ['ts'])
df.to_sql('test_utc_true', con = engine, index = False, if_exists = 'replace')

# with utc=True change
df = pd.read_sql('SELECT * FROM test_utc_true', 
                 con = engine, 
                 parse_dates = ['ts'])
print(df,'\n')
print(df.dtypes)
                         ts
0 2019-11-11 09:00:00+00:00
1 2019-06-01 09:00:00+00:00 

ts    datetime64[ns, UTC]
dtype: object
# without utc=True change
df = pd.read_sql('SELECT * FROM test_utc_true', 
                 con = engine, 
                 parse_dates = ['ts'])
print(df,'\n')
print(df.dtypes)
                   ts
0 2019-11-11 09:00:00
1 2019-06-01 09:00:00 

ts    datetime64[ns]
dtype: object

With the consideration from above I made a separate block for the columns in parse_dates in a new commit.

pandas/io/sql.py Outdated
@@ -97,6 +97,38 @@ def _handle_date_column(col, utc=None, format=None):
return to_datetime(col, errors="coerce", format=format, utc=utc)


def _is_datetime_series_with_different_offsets(df_col):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls don't re-invent the wheel, @mroeschke is correct here, if infer_dtype is not doing its job exactly right then then let's fix that, rather than creating an all new mostly untested code path.

pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
pandas/_libs/lib.pyx Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated Show resolved Hide resolved
pandas/io/sql.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ThibTrip i would prefer that you just do the infer_dtype change in this PR, then submit a followup after that is merged for the sql changes. This is going to be tricky as-is.

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Jan 26, 2020

I merged master again (commit id ca3bfcc) and cythonized again so everything should be up to date. I don't understand why I had an issue with two of the CI tests. Seems to be related to tab completion. I will investigate but I don't really see any relation with what I did... Also @jreback I am leaving you some notes on your changes proposals.

EDIT: jreback I am not sure what you meant with "grep the codebase for infer_dtype...". You wanted me to update the code right?

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 26, 2020 via email

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Feb 2, 2020

So I realized one check did not seem necessary as I was able to pass all my tests (plus my other tests I already showed you with psycopg2) and pandas tests (I mean apart from the tab completion and also I made a bad indentation somewhere it seems) without it. So I just removed it. I think I added it to fix something then realized the trigger was elsewhere and then forgot about it.

I also created and used a function to check for datetimetz with nan (got inspired by the integer na funtion). I hope this will seem better to you because I am running out of ideas 😞

try:
values = getattr(value, '_values', getattr(value, 'values', value))
except TypeError:
# This gets hit if we have an EA, since cython expects `values`
# to be an ndarray
# before using try_infer_map check if there is a tz attribute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't belong here at all, this is only for handling a specific dtype w/o special cases, which is what you are doing here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think I get the idea. I moved the timezone aware Series check above in my new commit.

@@ -1362,7 +1372,11 @@ def infer_dtype(value: object, skipna: bool = True) -> str:

elif PyDateTime_Check(val):
if is_datetime_array(values):
return "datetime"
if (is_datetimetz_na_array(values)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you just need is_datetimetz_array

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this was quite stupid the part or is_datetime_tz_array(values) would never be executed. I have fixed this.

return validator.validate(values)


cdef class DatetimeTZNaValidator(TemporalValidator):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this at all

Copy link
Contributor Author

@ThibTrip ThibTrip Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I realize now this is redundant the DatetimeTZValidator works fine with missing values (and we don't need to differentiate between datetimetz and datetimetz with na either).

return is_null_datetime64(value)


cpdef bint is_datetimetz_na_array(ndarray values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this in my new commit.

@@ -963,7 +963,7 @@ def _sqlalchemy_type(self, col):
TIMESTAMP,
)

if col_type == "datetime64" or col_type == "datetime":
if col_type in ("datetime64", "datetime", "datetimetz"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you revert all of these changes in this file

Copy link
Contributor Author

@ThibTrip ThibTrip Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need this to pass the tests. As you can see there were some failures. I will try some tests locally (I have to copy paste them since I am not sure how to change the configuration for the connection strings) and give you a new feedback when I'm done :).

@@ -794,6 +794,12 @@ def test_infer_dtype_datetime(self):
arr = np.array([n, datetime(2011, 1, 1)])
assert lib.infer_dtype(arr, skipna=True) == "datetime"

arr = np.array([n, datetime(2011, 1, 1, tzinfo=pytz.utc)])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move these to a new test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I took the liberty of adding some more tests for cases that seem relevant to me. I hope that's alright.

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Feb 10, 2020

Hey, I wanted to say thanks for your patience jreback.

This PR has been and is quite difficult for me and if anyone wants to help in my branch or even take over I do not mind at all. Not that I want to give up or anything this is actually very interesting!

Anyhow after testing minutely the sql.py module for the failures of the last CI run I can confirm that we need to check if the infered col_type is equal to "datetimetz" in method SQLTable._sqlalchemy_type otherwise test_sqlalchemy_type_mapping and test_datetime_with_timezone_roundtrip (both in pandas/tests/io/test_sql.py) will fail with an sqlalchemy engine connected to a SQLite or postgres database (because the sql dtype of the table model will be Text instead of TIMESTAMP).

So basically if I readd this change to the sql.py module the tests should pass.

However the other bit of code I changed in SQLiteTable._sql_type_name where I also check if the infered col_type is equal to "datetimetz" does not seem to be necessary to pass tests. But it would not give the expected data type in the table model so it seems logical to me to do this change as well (see SQLite test details).

SQLite test
import sqlite3
from pandas import DataFrame, to_datetime, date_range
from pandas.io import sql
from sqlalchemy import create_engine

# with engine = sqlite3.connect(":memory:") we get the same output
engine = create_engine('sqlite:///:memory:') 
db = sql.SQLDatabase(engine)
expected = DataFrame({"A": date_range("2013-01-01 09:00:00", 
                                      periods=3, tz="US/Pacific")})
table = sql.SQLiteTable("test_type", db, frame=expected)
# if we do not check for "datetimetz" in SQLiteTable._sql_type_name
# we end up with "A" as "TEXT" instead of "TIMESTAMP" (see table below)
table.table 
['CREATE TABLE "test_type" (\n"index" INTEGER,\n  "A" TEXT \n)',
 'CREATE INDEX "ix_test_type_index"ON "test_type" ("index")']

How shall I proceed? Thanks in advance.

@mroeschke
Copy link
Member

Hi @ThibTrip.

Sorry your PR had went unnoticed for a good period of time. Do you have time to merge master?

I suspect we need to add that col_name == 'datetimetz' check somewhere.

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Apr 5, 2020

Hi @mroeschke! Not a problem at all! Also sorry I took so long to reply I have been very busy with work and personal projects. I have merged master. I had a check for datetimetz but I removed it (we need it at the very least here: https://github.com/pandas-dev/pandas/blob/master/pandas/io/sql.py#L973) because the idea IIRC was to touch as little files as possible and do the change in two PRs (in this one we would only make sure infer dtypes recognizes we are in presence of datetime with timezone).
I am going to add it back in another commit, test what I can locally (sadly I never figured out how to test SQL locally) and then we'll see.

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Apr 6, 2020

@mroeschke I managed to test SQL locally 👍 ! It was not complicated at all I just did not have sufficient knowledge back when I started this PR. I just needed to install postgres and create a database pandas_nosetest (in the tests it connects with the default postgres user on localhost).

Anyhow I have lots of errors locally that I did not have before and in other tests modules so I am rather surprised. Some tests like pandas.tests.plotting.test_datetimelike.TestTSPlot seem completely irrelevant to my PR yet still fail. So I pushed to see if I have the same errors in the CI tools.

EDIT:

So I added the column type checks I mentionned earlier (see commit 6f9e8ab) as they are necessary to pass the tests.

@@ -1240,6 +1243,10 @@ def infer_dtype(value: object, skipna: bool = True) -> str:
elif hasattr(value, 'dtype'):
# this will handle ndarray-like
# e.g. categoricals
# begin by checking if there is a tz attribute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to move this

@ThibTrip
Copy link
Contributor Author

ThibTrip commented Apr 7, 2020

@jreback I think you mentioned the section about DatetimeTZDtype detection before and I did not know how to place it otherwise back then. Sorry about that. I think I know now what you had in mind and have pushed a new commit. Please tell me if that's good like this 👍 .

@@ -834,6 +834,30 @@ def test_infer_dtype_datetime(self):
arr = np.array([np.nan, "2011-01-01", pd.Timestamp("2011-01-02")])
assert lib.infer_dtype(arr, skipna=True) == "mixed"

def test_infer_dtypes_datetimetz(self):

# starts with nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split this into 3 different tests?

And for these # starts with nan test, could you use pytest.mark.parameterize

@mroeschke
Copy link
Member

Just one comment on the tests. The changes to sql.py look good.

I'll defer to @jreback on infer_dtype.

@@ -1283,6 +1286,11 @@ def infer_dtype(value: object, skipna: bool = True) -> str:

if util.is_array(value):
values = value

elif hasattr(value, 'dtype') and hasattr(value.dtype, 'tz'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't belong here at all

@@ -1364,7 +1372,10 @@ def infer_dtype(value: object, skipna: bool = True) -> str:

elif PyDateTime_Check(val):
if is_datetime_array(values):
return "datetime"
if is_datetimetz_array(values):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is likely enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will very simply delete the part above and test locally. I'll check again which paths the tests function take. Perhaps it is not useful or perhaps there is a better way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So @jreback I executed my plan and one of the tests failed (see code of failed test below) after I removed the part which is not in the correct place (elif hasattr(value, 'dtype') and hasattr(value.dtype, 'tz'):...).

I followed the path taken during the test. As you can see by the code and my comments below (look for comments with lots of hyphens) it goes through _try_infer_map (see code below) where it is not recognized as "datetimetz" but as "datetime64" (s.dtype.kind is "M" which in _TYPE_MAP translates to "datetime64").

So the kind and base attribute of s.dtype don't help identify our Series as "datetimetz" with try_infer_map (see code below) and since it's a mapping and not say regex the name attribute of s.dtype which is "datetime64[ns, US/Pacific]" won't match (we'd have to include all timezones in the mapper which does not seem like the right solution).

Between this and the fact that afterwards the dtype is supposed to be returned this brings me to the conclusion that the check for DatetimeTZdtype must be right before try_infer_map as it was during a previous commit (or we would have to do some surgery and check no other inferences break).

failed_test

import pandas as pd
from pandas import _lib as lib
from pandas import Series

# pandas/tests/series/test_constructors.py
# TestSeriesConstructors.test_constructor_with_datetime_tz
# inference
s = Series([pd.Timestamp("2013-01-01 13:00:00-0800", tz="US/Pacific"),
            pd.Timestamp("2013-01-02 14:00:00-0800", tz="US/Pacific")])
# Assertion Error 'datetime64' == 'datetimetz'
assert lib.infer_dtype(s, skipna=True) == "datetimetz"

infer_dtype

if util.is_array(value):
    values = value

elif hasattr(value, 'dtype'):  # <------------ GOES HERE
    # this will handle ndarray-like
    # e.g. categoricals
    try:
        values = getattr(value, '_values', getattr(value, 'values', value))
    except TypeError:  # <------------ GOES HERE
        # This gets hit if we have an EA, since cython expects `values`
        #  to be an ndarray
        value = _try_infer_map(value) # -----SEE try_infer_map code below----
        if value is not None:
            return value

        # its ndarray like but we can't handle
        raise ValueError(f"cannot infer type for {type(value)}")

else:
    if not isinstance(value, list):
        value = list(value)
    from pandas.core.dtypes.cast import (
        construct_1d_object_array_from_listlike)
    values = construct_1d_object_array_from_listlike(value)

try_infer_map

_TYPE_MAP = {'categorical': 'categorical', 'category': 'categorical',
             'int8': 'integer', 'int16': 'integer', 'int32': 'integer',
             'int64': 'integer', 'i': 'integer', 'uint8': 'integer',
             'uint16': 'integer', 'uint32': 'integer', 'uint64': 'integer',
             'u': 'integer', 'float32': 'floating', 'float64': 'floating',
             'f': 'floating', 'complex64': 'complex', 'complex128': 'complex',
             'c': 'complex', 'string': 'string', 'S': 'bytes', 'U': 'string', 
             'bool': 'boolean', 'b': 'boolean', 'datetime64[ns]': 'datetime64',
             'M': 'datetime64', 'timedelta64[ns]': 'timedelta64',
             'm': 'timedelta64', 'interval': 'interval'}

# NOT COPIED: types which only exist on certain platform
# (it's for float16, flat128 and complex256)

cdef _try_infer_map(v):
    """
    If its in our map, just return the dtype.
    """
    cdef:
        object attr, val
    for attr in ['name', 'kind', 'base']:
        # -----------------------------------
        # for our example:
        # name datetime64[ns, US/Pacific]
        # kind M
        # base dtype('<M8[ns]')
        # note: base won't be called bc kind "M" in is _TYPE_MAP
        # -----------------------------------
        val = getattr(v.dtype, attr)
        if val in _TYPE_MAP:
            return _TYPE_MAP[val]
    return None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do this without doing multiple passes through the data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel cython.datetime does not seem to provide a way to distinguish between datetime with and without timezone however this is something we can do I think (I tested it locally and it works): in DatetimeValidator we can check the absence of tzinfo.

cdef class DatetimeValidator(TemporalValidator):
    cdef bint is_value_typed(self, object value) except -1:
        return PyDateTime_Check(value) and (value.tzinfo is None) # <-----

And then we change the part with the repetition to:

    elif PyDateTime_Check(val):
        if is_datetime_array(values):
            return "datetime"
        elif is_datetimetz_array(values):
            return "datetimetz"    


# pd.Timestamps with different timezones
arr = [
pd.Timestamp("2011-01-02", tz="UTC"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use pytest.mark.parameterize over [datetime, Timestamp] and construct these like the datetime form you have above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep... makes sense! Don't know why I did not think about it! I will make a new commit after I do some tests related to the request of @jreback.

return PyDateTime_Check(value) and (value.tzinfo is not None)

cdef inline bint is_valid_null(self, object value) except -1:
return is_null_datetime64(value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to only accept value is NaT here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel sorry but I think that we should at least test for None as well since infer_dtype will be used on arrays we receive from the various SQL drivers and None is AFAIK the default null value there. As you can see in the example below (this is the issue that we are currently trying to fix) the array has not been parsed to a DatetimeTZDtype and still contains None:

import pandas as pd
from sqlalchemy import create_engine

# REPLACE with valid postgresql connection string
engine = create_engine('postgresql://............')

df = pd.DataFrame({'dt': [pd.Timestamp('2019-11-14 15:12:00+0000', tz='UTC'),
                          pd.Timestamp('2019-08-07 13:37:04+0000', tz='UTC'),
                          pd.NaT]})
df.to_sql('null_value_test', con=engine, schema='tests', if_exists='replace',
          index=False)
pd.read_sql('SELECT * FROM tests.null_value_test', con=engine)
dt
0 2019-11-14 16:12:00+01:00
1 2019-08-07 15:37:04+02:00
2 None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for clarifying. I'll need to give some thought to whether this is the appropriate place to do this check.

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

I think these changes are nice, but closing as stale. ping / open a new PR if you want to continue

@jreback jreback closed this Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.read_sql timestamptz converted to object dtype
5 participants