Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_sql() performance regression (#19664) when DF contains many columns #21146

Closed
schettino72 opened this issue May 21, 2018 · 2 comments
Closed
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@schettino72
Copy link
Contributor

After #19664 to_sql() performance is highly dependent on the number of columns.

Problem description

For a single column 0.23 is faster than 0.22. Adding more columns has little influence on 0.22 but quickly degrades the performance on 0.23. I.e. for 20 columns 0.23 takes 1.5 times longer than 0.22

Performance degradation happens regardless of parameter chunksize adjustment.

import time

import numpy as np
import pandas as pd
from sqlalchemy import create_engine


start = time.time()

N_COLS = 20
df = pd.DataFrame({n: np.arange(0,20_000,1) for n in range(N_COLS)})

#create the engine to connect pandas with sqlite3
engine = create_engine('postgresql://user:@localhost/db')
#create connection
conn = engine.connect()

# convert df to sql table
df.to_sql('test',engine, if_exists='replace',chunksize=1_000)
result = conn.execute("select * from test")
conn.close()
print('WRITE: {}'.format(time.time() - start))

Tested using linux, postgres 9.6, python 3.6, SQLAlchemy 1.2.2

@gfyoung gfyoung added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels May 21, 2018
@gfyoung gfyoung added this to the 0.23.1 milestone May 21, 2018
@gfyoung
Copy link
Member

gfyoung commented May 21, 2018

@schettino72 : Thanks for tracking this down.

cc @danfrankj who authored #19664

@jorisvandenbossche
Copy link
Member

So the original change was reverted in #21355 for 0.23.1, which fixes this immediate problem.
There is PR #21199 is make it optional in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants