Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.read_sql timestamptz converted to object dtype #30207

Open
ThibTrip opened this issue Dec 11, 2019 · 4 comments
Open

pd.read_sql timestamptz converted to object dtype #30207

ThibTrip opened this issue Dec 11, 2019 · 4 comments
Assignees
Labels
Bug IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype

Comments

@ThibTrip
Copy link
Contributor

Code Sample, a copy-pastable example if possible

from pandas import Timestamp, NaT
import pandas as pd
from sqlalchemy import create_engine

# create test data
data = {'create_date': [Timestamp('2019-11-22 10:59:44+0000', tz='UTC'),
                        Timestamp('2019-11-21 15:27:41+0000', tz='UTC'),
                        Timestamp('2019-11-21 15:25:42+0000', tz='UTC'),
                        Timestamp('2019-11-19 14:35:52+0000', tz='UTC'),
                        Timestamp('2019-11-19 13:54:44+0000', tz='UTC'),
                        Timestamp('2019-11-14 15:12:00+0000', tz='UTC'),
                        Timestamp('2019-08-07 13:37:04+0000', tz='UTC'),
                        Timestamp('2019-12-04 14:47:42+0000', tz='UTC')],
       'change_date': [Timestamp('2019-11-22 10:59:44+0000', tz='UTC'),
                       Timestamp('2019-11-21 15:27:42+0000', tz='UTC'),
                       Timestamp('2019-11-21 15:25:43+0000', tz='UTC'),
                       Timestamp('2019-11-19 14:35:53+0000', tz='UTC'),
                       Timestamp('2019-11-19 13:54:45+0000', tz='UTC'),
                       Timestamp('2019-11-14 15:13:33+0000', tz='UTC'),
                       Timestamp('2019-08-09 13:01:13+0000', tz='UTC'),
                       Timestamp('2019-12-04 14:54:57+0000', tz='UTC')],
        'unsubscribe_date': [NaT,
                             Timestamp('2019-12-09 12:58:01+0000', tz='UTC'),
                             Timestamp('2019-12-09 12:58:28+0000', tz='UTC'),
                             NaT,
                             Timestamp('2019-12-09 12:58:24+0000', tz='UTC'),
                             Timestamp('2019-12-09 12:58:19+0000', tz='UTC'),
                             NaT,
                             NaT]}

# create DataFrame from test data 
df_test = pd.DataFrame(data)
print('test DataFrame\n', df_test, '\n')
print(df_test.dtypes, '\n')


# connect to a postgres database (set CONNECTION_STRING)
engine = create_engine('CONNECTION_STRING')


# save DataFrame to postgres
df_test.to_sql(name = 'timezone_test',
               con = engine, 
               index = False,
               if_exists = 'fail')

# make sure the data type in postgres is "timestamp with time zone"
df_db_dtypes = pd.read_sql("""SELECT column_name, data_type FROM information_schema.columns 
                              WHERE table_name = 'timezone_test'
                              AND table_schema = 'public';""", 
                           con = engine, 
                           index_col = 'column_name')
print('datatypes of test DataFrame in postgres\n', df_db_dtypes, '\n')

# read DataFrame from postgres
df_db = pd.read_sql('SELECT * FROM timezone_test', con = engine)
print('test DataFrame as read from postgres\n', df_db, '\n')
# data types should all be datetime64[ns, UTC]
print(df_db.dtypes, '\n')


# attempt to append the same data in the postgres table
try:
    df_db.to_sql(name = 'timezone_test', 
                 con = engine,
                 index = False,
                 if_exists = 'append')
except Exception as e:
    # we should get the error "Tz-aware datetime.datetime cannot be converted to datetime64\
    # unless utc=True"
    print('could not append the same data:',e,'\n')


    
# I think the problem is that pandas recognizes the data types in postgres are "timestamptz"\
# and tries to convert from "object" to datetime64 UTC dtype before saving but fails
try:
    pd.to_datetime(df_db['create_date']) # it would work with argument utc = True
except Exception as e:
    print('could not convert column "create_date" to datetime64[ns, UTC]:', e)

Output

test DataFrame
                 create_date               change_date          unsubscribe_date
0 2019-11-22 10:59:44+00:00 2019-11-22 10:59:44+00:00                       NaT
1 2019-11-21 15:27:41+00:00 2019-11-21 15:27:42+00:00 2019-12-09 12:58:01+00:00
2 2019-11-21 15:25:42+00:00 2019-11-21 15:25:43+00:00 2019-12-09 12:58:28+00:00
3 2019-11-19 14:35:52+00:00 2019-11-19 14:35:53+00:00                       NaT
4 2019-11-19 13:54:44+00:00 2019-11-19 13:54:45+00:00 2019-12-09 12:58:24+00:00
5 2019-11-14 15:12:00+00:00 2019-11-14 15:13:33+00:00 2019-12-09 12:58:19+00:00
6 2019-08-07 13:37:04+00:00 2019-08-09 13:01:13+00:00                       NaT
7 2019-12-04 14:47:42+00:00 2019-12-04 14:54:57+00:00                       NaT 

create_date         datetime64[ns, UTC]
change_date         datetime64[ns, UTC]
unsubscribe_date    datetime64[ns, UTC]
dtype: object 

datatypes of test DataFrame in postgres
                                  data_type
column_name                               
create_date       timestamp with time zone
change_date       timestamp with time zone
unsubscribe_date  timestamp with time zone 

test DataFrame as read from postgres
                  create_date                change_date          unsubscribe_date
0  2019-11-22 11:59:44+01:00  2019-11-22 11:59:44+01:00                       NaT
1  2019-11-21 16:27:41+01:00  2019-11-21 16:27:42+01:00 2019-12-09 12:58:01+00:00
2  2019-11-21 16:25:42+01:00  2019-11-21 16:25:43+01:00 2019-12-09 12:58:28+00:00
3  2019-11-19 15:35:52+01:00  2019-11-19 15:35:53+01:00                       NaT
4  2019-11-19 14:54:44+01:00  2019-11-19 14:54:45+01:00 2019-12-09 12:58:24+00:00
5  2019-11-14 16:12:00+01:00  2019-11-14 16:13:33+01:00 2019-12-09 12:58:19+00:00
6  2019-08-07 15:37:04+02:00  2019-08-09 15:01:13+02:00                       NaT
7  2019-12-04 15:47:42+01:00  2019-12-04 15:54:57+01:00                       NaT 

create_date                      object
change_date                      object
unsubscribe_date    datetime64[ns, UTC]
dtype: object 

could not append the same data: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True 

could not convert column "create_date" to datetime64[ns, UTC]: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

Problem description

There are 2 problems here:

  1. pandas reads 2 columns of the test DataFrame I save in posgres as "object" and not "datetime64[ns, UTC]" although in postgres the data types are all "timestamp with time zone"

  2. when I attempt to append the postgres table to itself via pandas (pd.DataFrame.read_sql then pd.DataFrame.to_sql) it fails when trying to convert the 2 columns "object" to datetime. So perhaps there is an issue with pd.to_datetime or I am missing something here.

I would make another issue for the second problem but I cannot reproduce it other than with this workflow.

Expected Output

  • When reading the data from the postgres table I expect those datatypes:
df_db = pd.read_sql('SELECT * FROM timezone_test', con = engine)
print(df_db.dtypes)
create_date         datetime64[ns, UTC]
change_date         datetime64[ns, UTC]
unsubscribe_date    datetime64[ns, UTC]
dtype: object 
  • Also I expect pd.DataFrame.to_sql to not throw any exception even in the case where "create_date" and "change_date" are of "object" dtypes.
df_db.to_sql(name = 'timezone_test', 
             con = engine,
             index = False,
             if_exists = 'append')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 85 Stepping 4, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.3
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.12
pytest : 5.0.1
hypothesis : None
sphinx : 2.1.2
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.4
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.10.1
IPython : 7.6.1
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.4
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.14.0
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.5
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

@jbrockmendel jbrockmendel added the IO SQL to_sql, read_sql, read_sql_query label Dec 11, 2019
@ThibTrip
Copy link
Contributor Author

I have found the culprit and there was a similar problem in the PR #11216.

So in the end the issue is not directly related to SQL and I think the example below pinpoints the problem. I don't see a solution other than discarding offsets.

import pandas as pd
import datetime
import psycopg2
from pandas.api.types import is_datetime64_any_dtype

# create datetimes with different offsets (60 and 120 minutes respectively)
data = [[datetime.datetime(2019, 11, 14, 16, 12, 
         tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60))],
        [datetime.datetime(2019, 8, 7, 15, 37, 4, 
         tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=120))]]

# different offsets causes the data to be read as object dtype\
# instead of any datetime dtype
# pd.DataFrame.from_records is what is used in functions that read_sql uses
df = pd.DataFrame.from_records(data, columns = ['ts'])
df.dtypes
ts    object
dtype: object

# also this outputs False instead of True
is_datetime64_any_dtype(df['ts'])
False

# upon using pd.to_sql pd.to_datetime will be executed but
# it won't work since there are different offsets
pd.to_datetime(df['ts'])
Traceback
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/GitHub/pandas_master/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1968         try:
-> 1969             values, tz_parsed = conversion.datetime_to_datetime64(data)
   1970             # If tzaware, these values represent unix timestamps, so we

~/GitHub/pandas_master/pandas/_libs/tslibs/conversion.pyx in pandas._libs.tslibs.conversion.datetime_to_datetime64()

ValueError: Array must be all same time zone

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-55-46efc3100ca6> in <module>
----> 1 pd.to_datetime(df['ts'])

~/GitHub/pandas_master/pandas/core/tools/datetimes.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, format, exact, unit, infer_datetime_format, origin, cache)
    717             result = arg.map(cache_array)
    718         else:
--> 719             values = convert_listlike(arg._values, format)
    720             result = arg._constructor(values, index=arg.index, name=arg.name)
    721     elif isinstance(arg, (ABCDataFrame, abc.MutableMapping)):

~/GitHub/pandas_master/pandas/core/tools/datetimes.py in _convert_listlike_datetimes(arg, format, name, tz, unit, errors, infer_datetime_format, dayfirst, yearfirst, exact)
    431             errors=errors,
    432             require_iso8601=require_iso8601,
--> 433             allow_object=True,
    434         )
    435 

~/GitHub/pandas_master/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1972             return values.view("i8"), tz_parsed
   1973         except (ValueError, TypeError):
-> 1974             raise e
   1975 
   1976     if tz_parsed is not None:

~/GitHub/pandas_master/pandas/core/arrays/datetimes.py in objects_to_datetime64ns(data, dayfirst, yearfirst, utc, errors, require_iso8601, allow_object)
   1963             dayfirst=dayfirst,
   1964             yearfirst=yearfirst,
-> 1965             require_iso8601=require_iso8601,
   1966         )
   1967     except ValueError as e:

~/GitHub/pandas_master/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

~/GitHub/pandas_master/pandas/_libs/tslib.pyx in pandas._libs.tslib.array_to_datetime()

ValueError: Tz-aware datetime.datetime cannot be converted to datetime64 unless utc=True

Solution with caveat (offsets are dropped)

pd.to_datetime(df['ts'], utc = True)
0   2019-11-14 15:12:00+00:00
1   2019-08-07 13:37:04+00:00
Name: ts, dtype: datetime64[ns, UTC]

I tested this directly in the master so you'll find the output of pd.show_versions() here again.

Output of pd.show_versions() INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-20-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.26.0.dev0+1382.g3577b5a34.dirty
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191201
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 4.55.4
sphinx : 2.3.0
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.10.3
IPython : 7.10.1
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.2
s3fs : 0.4.0
scipy : 1.4.0
sqlalchemy : 1.3.11
tables : 3.6.1
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.6

@mroeschke mroeschke added the Timezones Timezone data dtype label Dec 24, 2019
@mroeschke
Copy link
Member

It's pandas' policy to convert TIMESTAMP WITH TIME ZONE database types to UTC in pandas, so converting the incoming data to UTC is an acceptable solution. Happy to have a PR with the change!

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#datetime-data-types

@ThibTrip
Copy link
Contributor Author

Hi @mroeschke, thanks for your answer :). I was able to patch to_sql function.
I was also able to patch the read_sql function but perhaps this is not a desired behavior. I will explain everything in a PR soon (after Christmas time).

@ThibTrip
Copy link
Contributor Author

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SQL to_sql, read_sql, read_sql_query Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants