-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103
Comments
This is also an issue when connecting to MSSQL. It fails silently and will not stop the function or return an exception, but will throw one if you set "chunksize" manually. |
@kristang: I just played around a bit more. Have you tried a chunksize limit of 75 or less? My case is working for the local db based on this chunksize and strangely 499 for a sqlite db in memory, neither of which are the limits stated by sqlite. Have you tried different chunksizes in your case? |
I tried 999, 1000 and 2100, but then just ended up downgrading to 0.22 to get on with my day. |
@tripkane could I bother you to narrow down the original example a bit? It should be minimal, and something that's appropriate for sticking in a unit test. Could you also include the traceback you get? |
@kristang: :) fair enough maybe try these lower numbers this out when you have time |
@TomAugspurger: Hi Tom. Yep can do. So reduce it to my actual case (i.e. locally saved database i.e. not in memory and create the data from np.arange like the :memory: version?) |
In-memory is fine if it reproduces the error.
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports may be
helpful.
…On Thu, May 17, 2018 at 8:56 AM, tripkane ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger>: Hi Tom. Yep can do. So
reduce it to my actual case (i.e. locally saved database i.e. not in memory
and create the data from np.arange like the :memory: version?)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21103 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIqaA8WdrT73TquzvHzb6vuZNSnmGks5tzYGQgaJpZM4UDECk>
.
|
@TomAugspurger: I have shortened it and updated the comments. Basically chunksize must be used and one needs to estimate this based on the data size of one row based on each individual case. This person implemented this solution with peewee: https://stackoverflow.com/questions/35616602/peewee-operationalerror-too-many-sql-variables-on-upsert-of-only-150-rows-8-c anyway I hope this helps and for those with this issue you can't go wrong in the meantime with a chunksize of something very small like 1 :) |
@TomAugspurger: After further testing the initial assumption seems correct and is based on the SQLITE_MAX_VARIABLE_NUMBER limit of 999. The max allowable chunksize associated with pd.df_to_sql is given by |
@tripkane thanks for the minimal example! I can confirm this regression. |
@jorisvandenbossche: You're welcome, happy to help. |
Since you guys are looking into this... Can you guys check if the change that caused this regression actually improved performance? As I reported on #8953 this change actually made the operation slower for me. And IMO should be reverted... |
Best to open a separate issue to track that. Could you do some profiling as
well, to see where the slowdown is?
…On Fri, May 18, 2018 at 3:12 AM, eduardo naufel schettino < ***@***.***> wrote:
Since you guys are looking into this... Can you guys check if the change
that caused this regression actually improved performance?
As I reported on #8953 <#8953>
this change actually made the operation slower for me. And IMO should be
reverted...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21103 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIo7h4aUvpJQF_pqhEpe0xB7sSZQsks5tzoJ2gaJpZM4UDECk>
.
|
@TomAugspurger: I tested this quickly with essentially the same example as above for two different length df's 20k and 200k and compared chunksize = 1 with chunksize=999//(cols+1) with cols = 10 and in both cases it was twice as fast when connected to a local sqlite db. |
@TomAugspurger I created issue #21146 to track this performance regression. |
@pandas-dev/pandas-core How do people feel about reverting #19664? It both breaks existing code and degrades performance significantly. Happy to put up the patch. |
Reverting seems reasonable for now.
Is there anything we can salvage form #19664 so that users can opt into
multi-row inserts if they want?
…On Tue, May 22, 2018 at 4:07 PM, Phillip Cloud ***@***.***> wrote:
@pandas-dev/pandas-core
<https://github.com/orgs/pandas-dev/teams/pandas-core> How do people feel
about reverting #19664 <#19664>?
It both breaks existing code and degrades performance significantly. Happy
to put up the patch.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21103 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHItx5fraPyzUV6092imwIAO7NLIpcks5t1H4ogaJpZM4UDECk>
.
|
cc @danfrankj |
In the original issue (#8953), it was discussed to have an optional keyword for this, as which default would be best depends a lot on the exact circumstances (this discussion was largely ignored in the PR, sorry about not raising this in the PR). I think it would still be useful for someone with some time to go through the discussion in that issue to check the raises usecases, how it interacts with chunksize, etc, as I don't recall everything that was discussed there. |
IMO pandas should allow more flexibility here: users pass in a function that will get called with the SQLAlchemy table and they can construct the DML themselves. The default DML construction function would be the previous working code. Then any such optimization that comes up in the future doesn't need yet another keyword argument. I wonder if there isn't already a way to do this without monkey patching. |
@cpcloud As of today there is no way to achieve this without monkey patching. I could work on patch to make this more flexible. I suggest adding a parameter |
…s-dev#21103) Also revert default insert method to NOT use multi-value.
The way without monkeypatching is subclassing the table class and overriding the appropriate method, but that is not necessarily cleaner than monkeypatching as it private methods you are overriding anyway. I like the proposed idea of allowing more flexibility. But we should think a bit about the exact interface we want (what does a user need to create the appropriate construct? The sqlalchemy table, the data (in which form?), the connection, ...). |
Also you would need to be careful with subclassing the table class as there are 2 implementations: SQLAlchemy and SQLite. Another problem of monkeypatching/subclassing (opposed to parameter) is that the same method has to be used globally by an application, and can not be fine tuned for each call of
I think just follow the API already used for
That was enough for me to implement postgresql |
Yeah, for Anyway, let's leave that discussion as I think we agree that we need some way to customize it (at least to switch between the single / multi values to solve the regression).
Would there be occasions where a user would want the actual dataframe instead of @cpcloud what do you think of this interface? @schettino72 One option (for the PR) would also be to implement this parameter, but leave the postgres-copy one as an example of how to use this (to not add an option that only works for a single engine for now; although given its superior speed, we should really think how to deal with this). |
Hi all! My apologies for the regression introduced. I like the solution in #21199 much better. For context, the motivation for needing a multivalues insert was to interact with a presto database. In that case, a normal insert occurs row by row incurring a network call thus making it prohibitively slow. This may be the case for other analytic databases as well (redshift, bigquery). But definitely we should not have this be the default so thank you @schettino72 for cleaning this up and again apologies. |
Just verified that multivalues has this behavior for Redshift as well.
|
@schettino72 do you have time to work on the PR? |
Sure. I was waiting for more feedback on API. From my side I guess just need to update docstrings and docs... Should I target 0.24 or 0.23.1? |
I think 0.23.1
…On Wed, Jun 6, 2018 at 8:21 AM, eduardo naufel schettino < ***@***.***> wrote:
@schettino72 <https://github.com/schettino72> do you have time to work on
the PR?
Sure. I was waiting for more feedback on API. From my side I guess just
need to update docstrings and docs...
Should I target 0.24 or 0.23.1?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#21103 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIlJZq1B39D6rZYw-_FFYs4NoazEiks5t59dagaJpZM4UDECk>
.
|
Yes, we certainly want to do something for 0.23.1 (at a minimum reverse). But it would be nice to at once get the option in there as well. |
Had the initial issue using 23.0 updating to 23.4 corrected the problem. Thanks for the fix. |
On 1.0.1 and still facing the same issue |
@theSuiGenerisAakash try removing "method='multi'" . This might help |
reduce to see whether it will work |
This defeats the purpose |
Happen on pandas 1.0.1 , this issue should reopen!
os: win 10 , seems some people don't have this problem on linux and mac . |
Getting the error on 1.1.5. For a wide dataframe (72 cols), I cannot go above a chunksize of 10. Uploading my dataframe (of only 5mb) took initially 30min, adding I am working on Win10 EDIT : It actually sometimes work and sometimes fail. That's unfortunate. |
Problem description
In pandas 0.22 I could write a dataframe to sql of reasonable size without error. Now I receive this error "OperationalError: (sqlite3.OperationalError) too many SQL variables". I am converting a dataframe with ~20k+ rows to sql. After looking around I suspect the problem lies in the limit set by sqlite3: SQLITE_MAX_VARIABLE_NUMBER which is set to 999 by default (based on their docs). This can apparently be changed by recompiling sqlite and adjusting this variable accordingly. I can confirm that for a df of length (rows) 499 this works. I can also confirm that this test version works with a row length of 20k and a chunksize of 499 inputted with df_to_sql works. In my real case the limit is 76. These numbers are clearly dependent on data size of each row so a method is required to estimate this based on data type and number of columns. #19664
Expected #Output
runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
####Trace###########
runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
Traceback (most recent call last):
File "", line 1, in
runfile('H:/Tests/Pandas_0.23_test.py', wdir='H:/Tests')
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "H:/Tests/Pandas_0.23_test.py", line 19, in
df.to_sql('test',engine, if_exists='fail',chunksize=500)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 2127, in to_sql
dtype=dtype)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 450, in to_sql
chunksize=chunksize, dtype=dtype)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 1149, in to_sql
table.insert(chunksize)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 663, in insert
self._execute_insert(conn, keys, chunk_iter)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\sql.py", line 638, in _execute_insert
conn.execute(*self.insert_statement(data, conn))
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 948, in execute
return meth(self, multiparams, params)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\sql\elements.py", line 269, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1060, in _execute_clauseelement
compiled_sql, distilled_params
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1200, in _execute_context
context)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1413, in _handle_dbapi_exception
exc_info
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 203, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\util\compat.py", line 186, in reraise
raise value.with_traceback(tb)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\base.py", line 1193, in _execute_context
context)
File "C:\Users\kane.hill\AppData\Local\Continuum\anaconda3\lib\site-packages\sqlalchemy\engine\default.py", line 507, in do_execute
cursor.execute(statement, parameters)
OperationalError: (sqlite3.OperationalError) too many SQL variables [SQL: 'INSERT INTO test ("index", "0") VALUES (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?, ?), (?,
Output of
pd.show_versions()
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None
pandas: 0.23.0
pytest: 3.5.1
pip: 10.0.1
setuptools: 39.1.0
Cython: 0.28.2
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: 1.7.4
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: 1.2.1
tables: 3.4.3
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.5.3
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.4
lxml: 4.2.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.7
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: