Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: data written with to_sql legacy mode (sqlite/mysql) not persistent #6846

Closed
Acanthostega opened this issue Apr 9, 2014 · 18 comments · Fixed by #6875
Closed

BUG: data written with to_sql legacy mode (sqlite/mysql) not persistent #6846

Acanthostega opened this issue Apr 9, 2014 · 18 comments · Fixed by #6875
Labels
Bug IO SQL to_sql, read_sql, read_sql_query Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@Acanthostega
Copy link
Contributor

UPDATE: this issue seemed actually to be a bug in the new legacy code so all databases written were only written in memory and not really committed to the database itself (see discussion below).


Hi everybody,

I still have a problem with writing data into a SQL database. With the following example, the resulting database file isn't written, but the structure of the table is created (I assume a pylab environment set with ipython...):

>>> import pandas as pd
>>> import sqlite3

>>> data = pd.DataFrame({"galid":randint(2**63-1, size=100), "objid": randint(2**63-1, size=100), "alpha": rand(100)})

>>> conn = sqlite3.connect("/tmp/bidulechouette.db")

>>> data.to_sql("DATA", conn, if_exists="replace", flavor="sqlite", index=False)

>>> # just to be sure...
>>> conn.close()
>>> conn = sqlite3.connect("/tmp/bidulechouette.db")

>>> result = pd.read_sql("SELECT objid FROM DATA;", conn)
>>> len(result)
0

I tried it on two different systems with different versions of sqlite3 and python(3.2 and 3.4).

If I kill ipython and redo the same without the if_exists option on the same database file, it complains that the table already exists, even if I manually remove the database file of sqlite3. This lets me suppose, that somewhere, a reference to the database is kept, but it's weird because ipython is killed... Or the file in which it write isn't the good one, since with a lot of data, it takes a long time as it is writing the data somewhere, explaining the problem of existing table in a deleted database...

INSTALLED VERSIONS

commit: None
python: 3.4.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.8-1-ARCH
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: fr_FR.utf8

pandas: 0.13.1-605-g61ea0a3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.13.3
statsmodels: None
IPython: 2.0.0
sphinx: 1.2.2
patsy: None
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.2
bottleneck: None
tables: 3.1.0
numexpr: 2.3.1
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

@jorisvandenbossche jorisvandenbossche added this to the 0.14.0 milestone Apr 9, 2014
@jorisvandenbossche
Copy link
Member

See also #6843.

I can't try to reproduce it at the moment (I get OverflowError: Python int too large to convert to C long on python 2.7, windows, 64 bit, I think something windows-specific (C long is 32 bit on windows)).

But can you try it also with the new sqlalchemy-based functions? It should be something like:

from sqlalchemy import create_engine
engine = create_engine('sqlite:////tmp/bidulechouette.db')
data.to_sql("DATA", engine, if_exists="replace", index=False)

to see if you have the same issue with this.

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

this is for windows:

numpy on windows is not friendly to large ints with the random number generation. might be a bug, as their is no way to specify the dtype upfront (e.g. randint doesn't take a dtype argument). could be an implementation detail as well.

so I think you would have to generate int32's cast them to int64 then multiply to get int64 like numbers (which pandas will handle).

>>> np.random.randint(2**63-1,size=100)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 871, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:6682)
OverflowError: Python int too large to convert to C long

>>> np.random.randint(2**31-1,size=100)
array([1065151596,  141587709, 1034731056, 1421678991, 1113510421,
       1053591656,  721945385,  417353460,  588311441,  451124932,
       2011103288,  606400194, 1767870596, 1577034881,  801116609,
       1094153300,  681848258, 1216074099,  828001710, 1818981627,
        583590097, 1893057734, 2041021821, 1719114523,  659638699,
         36901870, 1500212655,   80923766, 1147783744, 1802598765,
        148074008,  637763026, 2105303772, 1512947348, 1699075859,
        842315446, 1365873318,  820098773,  676695949, 1474271470,
        335374748, 1477530337, 1537800347,   87529923,  516659836,
       1551506257, 1743913044, 1156954863, 1286249176, 1279615765,
       1740947158,  448293258, 1524004071, 1527588882,  795421749,
       1231177972,  264830723,  589977924,  762902232, 1805756724,
        696292676,  915029891, 1311767606,  247202537, 1929326874,
       1238932552, 2026394145,  394577189, 1241699925,  910466670,
       1911660830, 1717398704,  761388787, 1527405720, 1132149302,
        930729795, 1350701423,   50980838, 1224099292,  705686236,
       1529564451,  747506807, 1142508936,  224622576, 1217950640,
       1208551918,  365819633,  456296798,  715871164, 1228926134,
       2030700788, 1192490214, 1276642215, 1186011869, 1479536059,
        334653481, 1961647202, 1838092157,  645948253, 1868626295])

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

this DOES work with @jorisvandenbossche method (python 3.4,numpy 1.8, on 64-bit linux), maybe a problem with sqlite3 dtype conversion of int64's (on the legacy)?

@jorisvandenbossche
Copy link
Member

I can't reproduce this on Windows (python 2.7, 64-bit), but both with the legacy and new slqalchemy method (using the conversion of int32 to int64 and multiplying as suggested by @jreback above).

The problem with the database that is not deleted properly, do you also have this with other tables (eg with just a toy example with small values)? Or does it only occur with this example with these large values/dtype problem?

@Acanthostega
Copy link
Contributor Author

I confirm that it works with the new SQLAlchemy method, but not with the legacy one.

This is not a specific problem to long integers (int64) since I can reproduce the problem with 28 instead of 263-1.

If I don't close the return of the connect function of sqlite3 in the middle, data seems to be read in the memory. But once the ipython console is closed, the database file is still empty (but with table structure).

@cpcloud
Copy link
Member

cpcloud commented Apr 9, 2014

@Acanthostega can you try by just writing a small test case and running nosetests the_test.py to see if there's an ipython issue? thx

@Acanthostega
Copy link
Contributor Author

I tried with nosetests and "simple" python interpreter, but same result...

@jorisvandenbossche
Copy link
Member

@Acanthostega Can you post the test you used?

@Acanthostega
Copy link
Contributor Author

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import sqlite3

data = pd.DataFrame(
    {
        "galid": np.random.randint(2**8, size=100),
        "truc": np.random.rand(100),
    }
)

conn = sqlite3.connect("/tmp/machinchose.db")

data.to_sql("DATA", conn, if_exists="replace", index=False, flavor="sqlite")

conn.close()
conn = sqlite3.connect("/tmp/machinchose.db")

result = pd.read_sql("SELECT galid FROM DATA;", conn)

print(len(result))

I'm not familiar with nosetests so I did a simple script to check the resulting database, and used the python interpreter to check. I can improve the test if you want but I have to read a little the doc before...

@cpcloud
Copy link
Member

cpcloud commented Apr 9, 2014

just write a function called test_database then run it with nosetests the_file_containing_test_database.py

@jorisvandenbossche
Copy link
Member

OK, so I think the problem is not with too large ints, or with mixed dtype or something like that, but with not correctly closing the database connection.
As if you don't do the conn.close() in the above script, then it works for me.

@Acanthostega
Copy link
Contributor Author

@jorisvandenbossche So, do you reproduce the problem with closing the connection to the database? Since, if I don't close the conn, the data is read in the memory, but nothing is present in the database file.

@cpcloud I rewrote the test:

#!/usr/bin/env python3
#-*- coding: utf-8 -*-

import pandas as pd
import numpy as np
import sqlite3

def test_database():
    data = pd.DataFrame(
        {
            "galid": np.random.randint(2**8, size=100),
            "truc": np.random.rand(100),
        }
    )

    conn = sqlite3.connect("/tmp/machinchose.db")

    data.to_sql("DATA", conn, if_exists="replace", index=False, flavor="sqlite")

    conn.close()
    conn = sqlite3.connect("/tmp/machinchose.db")

    result = pd.read_sql("SELECT galid FROM DATA;", conn)

    print(len(result))
    assert len(result) == len(data)

Same problem.

@jorisvandenbossche
Copy link
Member

Yes, indeed I reproduce the problem when closing and reopening the connection manually, so the data are indeed never written to the database on file. I think there is a conn.commit() (see https://docs.python.org/2/library/sqlite3.html) missing in the insert function in the legacy code (https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L779)

@jorisvandenbossche
Copy link
Member

There was a conn.commit() in the previous version in 0.13.1 (https://github.com/pydata/pandas/blob/v0.13.1/pandas/io/sql.py#L240), so I think this is a regression.

@Acanthostega
Copy link
Contributor Author

@jorisvandenbossche I added a call to commit() in the insert method, and the problem seems to disappear. Do you want that I propose a patch or you think it's better if you do it yourself? I think that a test for the writing and reading in the SQL database with legacy should be added too, with the connection to the database closed before reading, to avoid any future regression. But I don't know sufficiently how pandas works in internal to propose one...

@jorisvandenbossche
Copy link
Member

@Acanthostega Good, if you want, certainly try to put up a PR!
For a test, that would certainly be good, I only don't know how to do this with sqlite, as we use a in-memory sqlite database in the tests, and I think this will lose all content when closing the connection? So we can't really test this bug that way.
For mysql we should certainly test it.

@Acanthostega
Copy link
Contributor Author

@jorisvandenbossche Ok! If I have enough time, I will try to do it this week end!

@jorisvandenbossche jorisvandenbossche changed the title Can't write data into SQL database with sqlite3 BUG: data written with to_sql legacy mode (sqlite/mysql) not persistent Apr 9, 2014
@jorisvandenbossche
Copy link
Member

@Acanthostega OK, let know how it goes!

I updated the issue title/description to reflect the dicussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SQL to_sql, read_sql, read_sql_query Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants