to_sql() performance regression (#19664) when DF contains many columns #21146

schettino72 · 2018-05-21T05:37:14Z

After #19664 to_sql() performance is highly dependent on the number of columns.

Problem description

For a single column 0.23 is faster than 0.22. Adding more columns has little influence on 0.22 but quickly degrades the performance on 0.23. I.e. for 20 columns 0.23 takes 1.5 times longer than 0.22

Performance degradation happens regardless of parameter chunksize adjustment.

import time

import numpy as np
import pandas as pd
from sqlalchemy import create_engine


start = time.time()

N_COLS = 20
df = pd.DataFrame({n: np.arange(0,20_000,1) for n in range(N_COLS)})

#create the engine to connect pandas with sqlite3
engine = create_engine('postgresql://user:@localhost/db')
#create connection
conn = engine.connect()

# convert df to sql table
df.to_sql('test',engine, if_exists='replace',chunksize=1_000)
result = conn.execute("select * from test")
conn.close()
print('WRITE: {}'.format(time.time() - start))

Tested using linux, postgres 9.6, python 3.6, SQLAlchemy 1.2.2

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-05-21T06:55:35Z

@schettino72 : Thanks for tracking this down.

cc @danfrankj who authored #19664

jorisvandenbossche · 2018-06-07T21:28:02Z

So the original change was reverted in #21355 for 0.23.1, which fixes this immediate problem.
There is PR #21199 is make it optional in the future.

schettino72 mentioned this issue May 21, 2018

"too many SQL variables" Error with pandas 0.23 - enable multivalues insert #19664 issue #21103

Closed

gfyoung added Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version labels May 21, 2018

gfyoung added this to the 0.23.1 milestone May 21, 2018

schettino72 mentioned this issue May 28, 2018

ENH: 'to_sql()' add param 'method' to control insert statement (#21103) #21199

Closed

5 tasks

jorisvandenbossche closed this as completed Jun 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_sql() performance regression (#19664) when DF contains many columns #21146

to_sql() performance regression (#19664) when DF contains many columns #21146

schettino72 commented May 21, 2018

gfyoung commented May 21, 2018

jorisvandenbossche commented Jun 7, 2018

to_sql() performance regression (#19664) when DF contains many columns #21146

to_sql() performance regression (#19664) when DF contains many columns #21146

Comments

schettino72 commented May 21, 2018

Problem description

gfyoung commented May 21, 2018

jorisvandenbossche commented Jun 7, 2018