Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: sql support for NaN/NaT conversions #2754

Closed
jreback opened this issue Jan 25, 2013 · 13 comments · Fixed by #8208
Closed

ENH: sql support for NaN/NaT conversions #2754

jreback opened this issue Jan 25, 2013 · 13 comments · Fixed by #8208
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Jan 25, 2013

UPDATE from @jorisvandenbossche:

Overview of current status writing nan values (see also tests added in #7100):

  • For MySQL using pymysql/MySQLdb nothing works: you get message Unknown column 'nan' in 'field list' (see also eg http://stackoverflow.com/questions/23353732/python-pandas-write-to-sql-with-nan-values)
  • Numeric columns:
    • working for sqlite and postgres
    • Only full NaN columns stay None in sqlite
  • Object columns (eg strings)
    • for postgresql: NaN is converted to the string u'NaN', which is not really what we want
    • for sqlite it is returned as None
  • NaT:
    • postgresql: gives error on inserting "0001-255-255T00:00:00"
    • sqlite3: writing works, but reading it with query returns '-001--1--1 -1:-1:-1.-00001'
  • MSSQL: not working with message "The supplied value is not a valid instance of data type float", see Write Null values to mssql table #8088 for more details

not sure exactly what sql expects (Nones?) rather than np.nan (or NaT)

https://groups.google.com/forum/?fromgroups#!topic/pydata/lxhnFtuzvWQ

also provide pandas datetime64[ns], instead of datetime/date types
#3532

@danielballan
Copy link
Contributor

Yes, I believe it expects Nones. I will post some code this weekend.

@danielballan
Copy link
Contributor

I may in over my head here, actually, because performance is crucial. Would it helpful to consider #2717 ?

@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2013

this is actually tricky, you might need need to do some cython to get this fast...

e.g. in pytables I had to pass a numpy rec-array because that is how it like it

how are you passing, as a ndarray of dtype=object? (or rec-array)? list?

@jreback
Copy link
Contributor Author

jreback commented Mar 7, 2013

try looking at this:

pandas/io/pytables.py/AppenableTable.write_data (write a test table and break there)

it uses this to write basically a numpy rec-array

lib.pyx/create_hdf_rows_2d

pretty sure you could do something similar
(essentially your are creating a tuple of the data, sucking it out of the frame, and then replace say missing data with whatever you need)

this is a bit more complicated, because I use a mask to avoid writing completely empty rows (which is much more common in say writing a panel)

@jreback
Copy link
Contributor Author

jreback commented Mar 15, 2013

see #3047 for a strategy of how u should deal with conversion to datetime64[ns] to make sure that u r returning NaT for missing values

(this might actually be reasonably efficient too)

@jreback
Copy link
Contributor Author

jreback commented Mar 15, 2013

you can also do this if u have the array directly

wesm commented 10 minutes ago
I'd recommend using pandas.to_datetime(df['date'])

@jorisvandenbossche
Copy link
Member

Actually I think we should try to fix this before 0.14.
Basic NaN handling is really essential.

@jorisvandenbossche jorisvandenbossche changed the title ENH: sql to provided NaN/NaT conversions ENH: sql support for NaN/NaT conversions May 13, 2014
@jorisvandenbossche
Copy link
Member

Overview of current status (see also tests added in #7100):

  • For MySQL using pymysql nothing works: you get message Unknown column 'nan' in 'field list' (see also eg http://stackoverflow.com/questions/23353732/python-pandas-write-to-sql-with-nan-values)
  • Numeric columns:
    • working for sqlite and postgres
    • Only full NaN columns stay None in sqlite
  • Object columns (eg strings)
    • for postgresql: NaN is converted to the string u'NaN', which is not really what we want
    • for sqlite it is returned as None
  • NaT:
    • postgresql: gives error on inserting "0001-255-255T00:00:00"
    • sqlite3: writing works, but reading it with query returns '-001--1--1 -1:-1:-1.-00001'
  • MSSQL: not working with message "The supplied value is not a valid instance of data type float", see Write Null values to mssql table #8088 for more details

@jreback
Copy link
Contributor Author

jreback commented May 13, 2014

in hdf I do a translation on NaN to a string that can be user specified - (as can't store the NaN directly)

I think that is a reasonable soln in case it can't be natively stored (or maybe always?)

http://pandas-docs.github.io/pandas-docs-travis/io.html#string-columns (see nan_rep)

@jorisvandenbossche
Copy link
Member

But in theory it should be possible to store it natively in SQL I think, as there is the NULL value.

@jreback
Copy link
Contributor Author

jreback commented May 13, 2014

Their are 2 approaches you can use here:

This is what i do in sqlalchemy (I am using a custom date-time type (not Timestamp but same idea)
easy enough to convert Timestamps to datetimes directly and convert NaT to None

class DateTime(sql.types.TypeDecorator):
    """convert to/from Date type """

    impl = sql.types.DateTime

    def process_bind_param(self, value, dialect):
        f = getattr(value,'as_naive_datetime_for_db',None)
        if f is not None:
            return f()
        return value

    def process_result_value(self, value, dialect):
        if value is None: return None
        return dates.Date(value)

alternatively, you can vectorize the conversion

In [23]: df = DataFrame(dict(A = Timestamp('20130101')),index=range(3))

In [26]: df.iloc[1] = np.nan

In [27]: df
Out[27]: 
           A
0 2013-01-01
1        NaT
2 2013-01-01

[3 rows x 1 columns]

In [35]: s = df['A'].astype(object).values

In [36]: s
Out[36]: 
array([datetime.datetime(2013, 1, 1, 0, 0), nan,
       datetime.datetime(2013, 1, 1, 0, 0)], dtype=object)

In [38]: s[s==np.nan] = None

In [39]: s
Out[39]: 
array([datetime.datetime(2013, 1, 1, 0, 0), None,
       datetime.datetime(2013, 1, 1, 0, 0)], dtype=object)

@jreback
Copy link
Contributor Author

jreback commented May 15, 2014

@jorisvandenbossche ok...move to 0.14.1?

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.14.1, 0.14.0 May 15, 2014
@jorisvandenbossche
Copy link
Member

yep

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.15.0, 0.14.1 Jul 1, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO Data IO issues that don't fit into a more specific label IO SQL to_sql, read_sql, read_sql_query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants