-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use multi-row inserts for massive speedups on to_sql over high latency connections #8953
Comments
This seems reasonable. Thanks for investigating this! For the implementation, it will depend on how sqlalchemy deals with database flavors that does not support this (I can't test this at the moment, but it seems that sqlalchemy raises an error (eg http://stackoverflow.com/questions/23886764/multiple-insert-statements-in-mssql-with-sqlalchemy). Also, if it has the consequence that a lot of people will have to set chunksize, this is indeed not a good idea to do as default (unless we set chunksize to a value by default). |
Apparently SQLAlchemy has a flag Since this has the potential to speed up inserts a lot, and we can check for support easily, I'm thinking maybe we could do it by default, and also set chunksize to a default value (e.g. 16kb chunks... not sure what's too big in most situations). If the multirow insert fails, we could throw an exception suggesting lowering the chunksize? |
Now I just need to persuade the SQLAlchemy folks to set On a more on-topic note, I think the chunksize could be tricky. On my mysql setup (which I probably configured to allow large packets), I can set chunksize=5000, on my SQLServer setup, 500 was too large, but 100 worked fine. However, it's probably true that most of the benefits from this technique come from going from inserting 1 row at a time to 100, rather than 100 to 1000. |
What if |
On that last comment "
But of course this does not use the multi-row feature |
We've figured out how to monkey patch - might be useful to someone else. Have this code before importing pandas.
|
Maybe we can just start with adding this feature through a new @maxgrenderjones @nhockham interested to do a PR to add this? |
@jorisvandenbossche I think it's risky to start adding keyword arguments to address specific performance profiles. If you can guarantee that it's faster in all cases (if necessary by having it determine the best method based on the inputs) then you don't need a flag at all. Different DB-setups may have different performance optimizations (different DB perf profiles, local vs network, big memory vs fast SSD, etc, etc), if you start adding keyword flags for each it becomes a mess. I would suggest creating subclasses of SQLDatabase and SQLTable to address performance specific implementations, they would be used through the object-oriented API. Perhaps a "backend switching" method could be added but frankly using the OO api is very simple so this is probably overkill for what is already a specialized use-case. I created such a sub-class for loading large datasets to Postgres (it's actually much faster to save data to CSV then use the built-in non-standard COPY FROM sql commands than to use inserts, see https://gist.github.com/mangecoeur/1fbd63d4758c2ba0c470#file-pandas_postgres-py). To use it you just do |
Just for reference, I tried running the code by @jorisvandenbossche (Dec 3rd post) using the multirow feature. It's quite a bit slower. So the speed-tradeoffs here is not trivial:
Also, I agree that adding keyword parameters is risky. However, the multirow feature seems pretty fundamental. Also, 'monkey-patching' is probably not more robust to API changes than keyword parameters. |
Its as i suspected. Monkey patching isn't the solution I was suggesting - rather that we ship a number of performance oriented subclasses that the informed user could use through the OO interface (to avoid loading the functional api with too many options) -----Original Message----- Just for reference, I tried running the code by @jorisvandenbossche (Dec 3rd post) using the multirow feature. It's quite a bit slower. So the speed-tradeoffs here is not trivial: In [5]: df = pd.DataFrame(np.random.randn(50000, 10)) In [6]: In [6]: %timeit df.to_sql('test_default', engine, if_exists='replace') In [7]: In [7]: from pandas.io.sql import SQLTable In [8]: In [8]: def _execute_insert(self, conn, keys, data_iter): In [9]: SQLTable._execute_insert = _execute_insert In [10]: In [10]: reload(pd) In [11]: In [11]: %timeit df.to_sql('test_default', engine, if_exists='replace', chunksize=10) |
As per the initial ticket title, I don't think this approach is going to be preferable in all cases, so I wouldn't make it the default. However, without it, the pandas In short, my recommendation would be to add it as a keyword, with some helpful commentary about how to use it. This wouldn't be the first time a keyword was used to select an implementation (see: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html) but that perhaps isn't the best example, as I haven't the first idea about what |
As suggested here - pandas-dev#8953
Hi guys, I am try to insert around 200K rows using to_sql but it takes forever and consume a huge amount of memory! Using chuncksize helps with the memory but still the speed is very slow. My impression, looking at the MSSQL DBase trace is that the insertion is actually performed one row at the time. The only viable approach now is to dump to a csv file on a shared folder and use BULK INSERT. But it very annoying and inelegant! |
@andreacassioli You can use odo to insert a DataFrame into an SQL database through an intermediary CSV file. See Loading CSVs into SQL Databases. I don't think you can come even close to |
@ostrokach thank you, indeed I am using csv files now. If I could get close, I would trade a bit of time for simplicity! |
I thought this might help somebody: |
@indera pandas does not use the ORM, only sqlalchemy Core (which is what the doc entry there suggests to use for large inserts) |
is there any consensus on how to work around this in the meantime? I'm inserting a several million rows into postgres and it takes forever. Is CSV / odo the way to go? |
@russlamb a practical way to solve this problem is simply to bulk upload. This is someone db specific though, so |
For sqlserver I used the FreeTDS driver (http://www.freetds.org/software.html and https://github.com/mkleehammer/pyodbc ) with SQLAlchemy entities which resulted in very fast inserts (20K rows per data frame):
|
This solution will almost always be faster I think, regardless of the multi-row / chunksize settings. But, @russlamb, it is always interesting to hear whether such a multi-row keyword would be an improvement in your case. See eg #8953 (comment) on a way to easily test this out. I think there is agreement that we want to have a way to specify this (without necessarily changing the default). So if somebody wants to make a PR for this, that is certainly welcome. |
@jorisvandenbossche The document I linked above mentions "Alternatively, the SQLAlchemy ORM offers the Bulk Operations suite of methods, which provide hooks into subsections of the unit of work process in order to emit Core-level INSERT and UPDATE constructs with a small degree of ORM-based automation." What I am suggesting is to implement a sqlserver specific version for |
This was proposed before. The way you go is to implement an pandas sql
class optimised for a backend. I posted a gist in the past for using
postgres COPY FROM command which is much faster. However something similar
is now available in odo, and built in a more robust way. There isn't much
point IMHO in duplicating work from odo.
…On 7 Mar 2017 00:53, "Andrei Sura" ***@***.***> wrote:
@jorisvandenbossche <https://github.com/jorisvandenbossche> The document
I linked above mentions "Alternatively, the SQLAlchemy ORM offers the Bulk
Operations suite of methods, which provide hooks into subsections of the
unit of work process in order to emit Core-level INSERT and UPDATE
constructs with a small degree of ORM-based automation."
What I am suggesting is to implement a sqlserver specific version for
"to_sql" which under the hood uses the SQLAlchemy core for speedups.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8953 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAtYVDXKLuTlsh9ycpMQvU5C0hs_RxuYks5rjCwBgaJpZM4DCjLh>
.
|
Also noticed you mentioned sqlalchemy could core instead. Unless something
has changed a lot, only sqlalchemy core is used in any case, no orm. If you
want to speed up more than using core you have to go to lower level, db
specific optimisation
…On 7 Mar 2017 00:53, "Andrei Sura" ***@***.***> wrote:
@jorisvandenbossche <https://github.com/jorisvandenbossche> The document
I linked above mentions "Alternatively, the SQLAlchemy ORM offers the Bulk
Operations suite of methods, which provide hooks into subsections of the
unit of work process in order to emit Core-level INSERT and UPDATE
constructs with a small degree of ORM-based automation."
What I am suggesting is to implement a sqlserver specific version for
"to_sql" which under the hood uses the SQLAlchemy core for speedups.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8953 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAtYVDXKLuTlsh9ycpMQvU5C0hs_RxuYks5rjCwBgaJpZM4DCjLh>
.
|
Is this getting fixed/taken care of? As of now inserting pandas dataframes into a SQL db is extremely slow unless it's a toy dataframe. Let's decide on a solution and push it forward? |
I found |
I've been using the Monkey Patch Solution:
for some time now, but now I'm getting an error:
Is anyone else getting this? I'm on Python 3.6.5 (Anaconda) and pandas==0.23.0 |
is this getting fixed ? Currently, df.to_sql is extremely slow and can't be used at all for many practical use cases. Odo project seems to have been abandoned already.
|
#21401) * ENH: to_sql() add parameter "method" to control insertions method (#8953) * ENH: to_sql() add parameter "method". Fix docstrings (#8953) * ENH: to_sql() add parameter "method". Improve docs based on reviews (#8953) * ENH: to_sql() add parameter "method". Fix unit-test (#8953) * doc clean-up * additional doc clean-up * use dict(zip()) directly * clean up merge * default --> None * Remove stray default * Remove method kwarg * change default to None * test copy insert snippit * print debug * index=False * Add reference to documentation
…ndas-dev#8… (pandas-dev#21401) * ENH: to_sql() add parameter "method" to control insertions method (pandas-dev#8953) * ENH: to_sql() add parameter "method". Fix docstrings (pandas-dev#8953) * ENH: to_sql() add parameter "method". Improve docs based on reviews (pandas-dev#8953) * ENH: to_sql() add parameter "method". Fix unit-test (pandas-dev#8953) * doc clean-up * additional doc clean-up * use dict(zip()) directly * clean up merge * default --> None * Remove stray default * Remove method kwarg * change default to None * test copy insert snippit * print debug * index=False * Add reference to documentation
…ndas-dev#8… (pandas-dev#21401) * ENH: to_sql() add parameter "method" to control insertions method (pandas-dev#8953) * ENH: to_sql() add parameter "method". Fix docstrings (pandas-dev#8953) * ENH: to_sql() add parameter "method". Improve docs based on reviews (pandas-dev#8953) * ENH: to_sql() add parameter "method". Fix unit-test (pandas-dev#8953) * doc clean-up * additional doc clean-up * use dict(zip()) directly * clean up merge * default --> None * Remove stray default * Remove method kwarg * change default to None * test copy insert snippit * print debug * index=False * Add reference to documentation
Hey, I'm getting this error when I try to perform a multi-insert to a SQLite database: This is my code: and I get this error:
Why is this happening? I'm using Python 3.7.3 (Anaconda), pandas 0.24.2 and sqlite3 2.6.0. Thank you very much in advance! |
@jconstanzo can you open this as a new issue? |
@jconstanzo Having the same issue here. Using Unfortunately I can't really provide an example dataframe because my dataset is huge, that's the reason I'm using |
I'm sorry for the delay. I just opened an issue for this problem: #29921 |
tow to hack this? @maxgrenderjones
|
I have been trying to insert ~30k rows into a mysql database using pandas-0.15.1, oursql-0.9.3.1 and sqlalchemy-0.9.4. Because the machine is as across the atlantic from me, calling
data.to_sql
was taking >1 hr to insert the data. On inspecting with wireshark, the issue is that it is sending an insert for every row, then waiting for the ACK before sending the next, and, long story short, the ping times are killing me.However, following the instructions from SQLAlchemy, I changed
to
and the entire operation completes in less than a minute. (To save you a click, the difference is between multiple calls to
insert into foo (columns) values (rowX)
and one massiveinsert into foo (columns) VALUES (row1), (row2), row3)
). Given how often people are likely to use pandas to insert large volumes of data, this feels like a huge win that would be great to be included more widely.Some challenges:
The easiest way to do this, would be to add a
multirow=
boolean parameter (defaultFalse
) to theto_sql
function, and then leave the user responsible for setting the chunksize, but perhaps there's a better way?Thoughts?
The text was updated successfully, but these errors were encountered: