Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak inserting with df.to_sql using nvarchar(max) and fastexecutemany #1099

Closed
opme opened this issue Sep 9, 2022 · 8 comments · Fixed by #1108
Closed

memory leak inserting with df.to_sql using nvarchar(max) and fastexecutemany #1099

opme opened this issue Sep 9, 2022 · 8 comments · Fixed by #1108

Comments

@opme
Copy link

opme commented Sep 9, 2022

Environment

  • Python: Python 3.8.5
  • pyodbc: 4.0.34, pandas 1.4.2, SQLAlchemy 1.4.32
  • OS: ubuntu v20 wls2 and docker container
  • DB: azure sql database
  • driver: ODBC Driver 18 for SQL Server (same with 17)

Issue

Using nvarchar(max) and fastexecutemany a memory leak is observed. There was a similar leak in the past with 854 and pull 832 that was fixed but this look to be a different issue.

Turning off fastexecutemany or using turbodbc does not exhibit the issue. Also tried with other column types and no leak was observed.

Note: It matters how the dataframe is created before the call to to_sql whether there is a leak or not. If the dataframe is created with csv.DictReader() or pd.read_sql() the leak occurs. Creating the dataframe from pd.read_csv() is not leaking.

Code to reproduce.

import os
import pandas as pd
import psutil
import sqlalchemy as sa
import string
from urllib.parse import quote_plus
from datetime import datetime, timedelta
from sqlalchemy import (
    create_engine,
    text,
    MetaData,
    Table,
    Column,
    Unicode,
)
import csv
from io import StringIO

def create_data_table_single(engine, table_name):
    metadata = MetaData(engine)
    mltable = Table(
        table_name,
        metadata,
        Column("join_type", Unicode, nullable=True),
    )
    try:
        metadata.create_all()
    except Exception as e:
        print("failed to create table")

def printmem(msg):
    rss = process.memory_info().rss / 1048576
    vms = process.memory_info().vms / 1048576
    print(f"{msg}: rss: {rss:0.1f} MiB, vms: {vms:0.1f} MiB")

def drop_table(conn, table):
    conn.execute("DROP TABLE IF EXISTS " + table)

def writedata(filename, num):
    f = open(filename, "w")
    f.write("join_type\n")
    for i in range(num):
        f.write("explicit\n")

database_connection_string = "Driver={ODBC Driver 18 for SQL Server};Server=tcp:xxx.database.windows.net,1433;Database=database;Uid=xxxx;Pwd=xxxx;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=45;"
engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % quote_plus(database_connection_string), fast_executemany=True)
process = psutil.Process()

printmem("startup")
filename = "data.txt"
writedata(filename, 10000)
printmem("data written")
table="dftest1"
with engine.connect() as conn:
    drop_table(conn, table)
create_data_table_single(engine, table)

for iteration in range(100):
    f = open(filename, "r")
    batch = f.read()
    f.close()
    reader = csv.DictReader(StringIO(batch), delimiter=";", quoting=csv.QUOTE_NONE)
    rows = [row for row in reader]
    df = pd.DataFrame(rows)
    df.to_sql(table, engine, index=False, if_exists="append")
    printmem(f"iteration {iteration}")

Table creation ddl (code is creating this)

CREATE TABLE [dbo].[dftest1](
	[join_type] [nvarchar](max) NULL
)
GO

Results of execution:

(base) root@aadcfd1ba406:/opt/app# python testrepo.py
startup: rss: 114.1 MiB, vms: 501.8 MiB
iteration 0: rss: 125.5 MiB, vms: 518.4 MiB
iteration 1: rss: 127.5 MiB, vms: 520.5 MiB
iteration 2: rss: 129.5 MiB, vms: 522.3 MiB
iteration 3: rss: 130.9 MiB, vms: 523.8 MiB
iteration 4: rss: 132.8 MiB, vms: 525.7 MiB
iteration 5: rss: 134.2 MiB, vms: 527.0 MiB
iteration 6: rss: 135.8 MiB, vms: 528.7 MiB
iteration 7: rss: 137.0 MiB, vms: 530.0 MiB
iteration 8: rss: 138.0 MiB, vms: 531.0 MiB
iteration 9: rss: 139.0 MiB, vms: 532.2 MiB
iteration 10: rss: 140.3 MiB, vms: 533.2 MiB
iteration 11: rss: 141.4 MiB, vms: 534.5 MiB
iteration 12: rss: 142.4 MiB, vms: 535.5 MiB
iteration 13: rss: 143.7 MiB, vms: 536.5 MiB
iteration 14: rss: 144.7 MiB, vms: 537.7 MiB
iteration 15: rss: 145.8 MiB, vms: 538.7 MiB
iteration 16: rss: 146.8 MiB, vms: 539.7 MiB
iteration 17: rss: 147.8 MiB, vms: 541.0 MiB
iteration 18: rss: 149.1 MiB, vms: 542.0 MiB
iteration 19: rss: 150.1 MiB, vms: 543.2 MiB
iteration 20: rss: 151.4 MiB, vms: 544.2 MiB
iteration 21: rss: 152.5 MiB, vms: 545.2 MiB
iteration 22: rss: 153.5 MiB, vms: 546.5 MiB
iteration 23: rss: 154.5 MiB, vms: 547.5 MiB
iteration 24: rss: 155.8 MiB, vms: 548.5 MiB
iteration 25: rss: 156.8 MiB, vms: 549.7 MiB
iteration 26: rss: 157.9 MiB, vms: 550.7 MiB
iteration 27: rss: 158.9 MiB, vms: 552.0 MiB
iteration 28: rss: 160.2 MiB, vms: 553.0 MiB
iteration 29: rss: 161.2 MiB, vms: 554.0 MiB
iteration 30: rss: 162.3 MiB, vms: 555.2 MiB
iteration 31: rss: 163.5 MiB, vms: 556.4 MiB
iteration 32: rss: 164.6 MiB, vms: 557.7 MiB
iteration 33: rss: 165.6 MiB, vms: 558.7 MiB
iteration 34: rss: 166.6 MiB, vms: 559.7 MiB
iteration 35: rss: 167.9 MiB, vms: 560.9 MiB
iteration 36: rss: 169.0 MiB, vms: 561.9 MiB
iteration 37: rss: 170.0 MiB, vms: 562.9 MiB
iteration 38: rss: 171.0 MiB, vms: 564.2 MiB
iteration 39: rss: 172.3 MiB, vms: 565.2 MiB
iteration 40: rss: 173.3 MiB, vms: 566.4 MiB
iteration 41: rss: 174.4 MiB, vms: 567.4 MiB
iteration 42: rss: 175.4 MiB, vms: 568.4 MiB
iteration 43: rss: 176.7 MiB, vms: 569.7 MiB
iteration 44: rss: 177.7 MiB, vms: 570.7 MiB
iteration 45: rss: 178.8 MiB, vms: 571.7 MiB
iteration 46: rss: 180.0 MiB, vms: 572.9 MiB
iteration 47: rss: 181.1 MiB, vms: 573.9 MiB
iteration 48: rss: 182.1 MiB, vms: 575.2 MiB
iteration 49: rss: 183.1 MiB, vms: 576.2 MiB
iteration 50: rss: 184.4 MiB, vms: 577.2 MiB
iteration 51: rss: 185.5 MiB, vms: 578.4 MiB
iteration 52: rss: 186.5 MiB, vms: 579.4 MiB
iteration 53: rss: 187.7 MiB, vms: 580.4 MiB
iteration 54: rss: 188.8 MiB, vms: 581.7 MiB
iteration 55: rss: 189.8 MiB, vms: 582.7 MiB
iteration 56: rss: 190.8 MiB, vms: 583.9 MiB
iteration 57: rss: 192.1 MiB, vms: 584.9 MiB
iteration 58: rss: 193.1 MiB, vms: 585.9 MiB
iteration 59: rss: 194.2 MiB, vms: 587.2 MiB
iteration 60: rss: 195.2 MiB, vms: 588.2 MiB
iteration 61: rss: 196.5 MiB, vms: 589.4 MiB
iteration 62: rss: 197.5 MiB, vms: 590.4 MiB
iteration 63: rss: 198.6 MiB, vms: 591.4 MiB
iteration 64: rss: 199.6 MiB, vms: 592.7 MiB
iteration 65: rss: 200.9 MiB, vms: 593.7 MiB
iteration 66: rss: 201.9 MiB, vms: 594.7 MiB
iteration 67: rss: 202.9 MiB, vms: 595.9 MiB
iteration 68: rss: 204.0 MiB, vms: 596.9 MiB
iteration 69: rss: 205.3 MiB, vms: 598.2 MiB
iteration 70: rss: 206.3 MiB, vms: 599.2 MiB
iteration 71: rss: 207.3 MiB, vms: 600.2 MiB
iteration 72: rss: 208.4 MiB, vms: 601.4 MiB
iteration 73: rss: 209.4 MiB, vms: 602.4 MiB
iteration 74: rss: 210.7 MiB, vms: 603.4 MiB
iteration 75: rss: 211.8 MiB, vms: 604.7 MiB
iteration 76: rss: 212.8 MiB, vms: 605.7 MiB
iteration 77: rss: 213.8 MiB, vms: 606.9 MiB
iteration 78: rss: 215.1 MiB, vms: 608.1 MiB
iteration 79: rss: 216.2 MiB, vms: 609.1 MiB
iteration 80: rss: 217.2 MiB, vms: 610.3 MiB
iteration 81: rss: 218.5 MiB, vms: 611.3 MiB
iteration 82: rss: 219.5 MiB, vms: 612.6 MiB
iteration 83: rss: 220.5 MiB, vms: 613.6 MiB
iteration 84: rss: 221.6 MiB, vms: 614.6 MiB
iteration 85: rss: 222.9 MiB, vms: 615.8 MiB
iteration 86: rss: 223.9 MiB, vms: 616.8 MiB
iteration 87: rss: 224.9 MiB, vms: 617.8 MiB
iteration 88: rss: 226.2 MiB, vms: 619.1 MiB
iteration 89: rss: 227.2 MiB, vms: 620.1 MiB
iteration 90: rss: 228.3 MiB, vms: 621.3 MiB
iteration 91: rss: 229.3 MiB, vms: 622.3 MiB
iteration 92: rss: 230.6 MiB, vms: 623.3 MiB
iteration 93: rss: 231.6 MiB, vms: 624.6 MiB
iteration 94: rss: 232.7 MiB, vms: 625.6 MiB
iteration 95: rss: 233.7 MiB, vms: 626.6 MiB
iteration 96: rss: 235.0 MiB, vms: 627.8 MiB
iteration 97: rss: 236.0 MiB, vms: 628.8 MiB
iteration 98: rss: 237.0 MiB, vms: 630.1 MiB
iteration 99: rss: 238.1 MiB, vms: 631.1 MiB
@opme opme changed the title memory leak inserting dataframe using varchar(max) and fastexecutemany memory leak inserting dataframe using nvarchar(max) and fastexecutemany Sep 9, 2022
@opme opme changed the title memory leak inserting dataframe using nvarchar(max) and fastexecutemany memory leak inserting with df.to_sql using nvarchar(max) and fastexecutemany Sep 9, 2022
@gordthompson
Copy link
Collaborator

gordthompson commented Sep 9, 2022

I am able to reproduce this issue with the latest master branch, i.e.,

pip install git+https://github.com/mkleehammer/pyodbc

Thanks for the report and the great MCVE!

A workaround would be to use this

https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205

along with

df.to_sql(table, engine, index=False, if_exists="append", method=mssql_insert_json)

I just tested it by tweaking the MCVE and it does not leak memory. Note that this method does not need fast_executemany=True so that setting is irrelevant. (That is, it is not required. If specified it doesn't help but it also doesn't hurt.)

@opme
Copy link
Author

opme commented Sep 13, 2022

Thanks for the reply. I did check out the workaround gist and did a quick performance check. I was able to insert 10k rows in 10 seconds with fast_executemany and it took 15 seconds with the workaround gist. Around the same as turbodbc. fast_executemany is still the performance king and very desirable to those making many inserts.

@v-chojas
Copy link
Contributor

Note: It matters how the dataframe is created before the call to to_sql whether there is a leak or not. If the dataframe is created with csv.DictReader() or pd.read_sql() the leak occurs. Creating the dataframe from pd.read_csv() is not leaking.

Could you post an ODBC trace to compare the two?

@gordthompson
Copy link
Collaborator

gordthompson commented Sep 14, 2022

I compared the ODBC trace logs for the two methods of creating the DataFrame, namely "leaky"

f = open(filename, "r")
batch = f.read()
f.close()
reader = csv.DictReader(StringIO(batch), delimiter=";", quoting=csv.QUOTE_NONE)
rows = [row for row in reader]
df = pd.DataFrame(rows)

and "not leaky"

df = pd.read_csv(filename)

and they were identical. df.info() reports identical results for the DataFrames created by the two methods.

I also found that if I moved the DataFrame creation out of the loop (and only created it once) then the leaking stopped.

With DataFrame creation inside the loop if I commented out the .to_sql() call then there was no leak in either case.

So it seems to be something about that particular way of creating the DataFrame (using csv.DictReader) interacting with .to_sql() that triggers the leak.

@v-chojas
Copy link
Contributor

I also found that if I moved the DataFrame creation out of the loop (and only created it once) then the leaking stopped.

That suggests it might not be an issue inside pyODBC itself; I'm not familiar with the external libraries in use here but if the way pyODBC is being called from them can be reproduced in a script which only uses pyODBC, that would either show its innocence or provide a good repro of the issue.

@nsavoire
Copy link

nsavoire commented Sep 21, 2022

Py_INCREF(cell) here looks suspicious since we are storing encoded not cell in pParam:

// DAE
DAEParam *pParam = (DAEParam*)*outbuf;
Py_INCREF(cell);
pParam->cell = encoded.Detach();

This looks like a copy paste from here, where cell is passed into pParam:

Py_INCREF(cell);
DAEParam *pParam = (DAEParam*)*outbuf;
pParam->cell = cell;

@v-chojas
Copy link
Contributor

Good find. Have you tried removing it?

@gordthompson
Copy link
Collaborator

Commenting out the Py_INCREF(cell); line, i.e.,

@@ -344,11 +344,11 @@ static int PyToCType(Cursor *cur, unsigned char **outbuf, PyObject *cell, ParamI
             len = PyBytes_GET_SIZE(encoded);
             if (!pi->ColumnSize)
             {
                 // DAE
                 DAEParam *pParam = (DAEParam*)*outbuf;
-                Py_INCREF(cell);
+                // Py_INCREF(cell);
                 pParam->cell = encoded.Detach();
                 pParam->maxlen = cur->cnxn->GetMaxLength(pi->ValueType);
                 *outbuf += sizeof(DAEParam);
                 ind = cur->cnxn->need_long_data_len ? SQL_LEN_DATA_AT_EXEC((SQLLEN)len) : SQL_DATA_AT_EXEC;
             }

does not completely stop the leak, but it does slow it down considerably.

Before patch:

iteration 0: rss: 81.0 MiB, vms: 275.7 MiB
iteration 1: rss: 84.0 MiB, vms: 278.5 MiB
iteration 2: rss: 85.7 MiB, vms: 280.4 MiB
iteration 3: rss: 87.5 MiB, vms: 282.2 MiB
iteration 4: rss: 89.1 MiB, vms: 283.7 MiB
iteration 5: rss: 90.7 MiB, vms: 285.4 MiB
iteration 6: rss: 91.7 MiB, vms: 286.6 MiB
iteration 7: rss: 93.0 MiB, vms: 287.8 MiB
iteration 8: rss: 94.0 MiB, vms: 288.8 MiB
iteration 9: rss: 95.1 MiB, vms: 289.8 MiB
iteration 10: rss: 96.1 MiB, vms: 291.1 MiB
iteration 11: rss: 97.4 MiB, vms: 292.1 MiB
iteration 12: rss: 98.4 MiB, vms: 293.1 MiB
iteration 13: rss: 99.5 MiB, vms: 294.3 MiB
iteration 14: rss: 100.5 MiB, vms: 295.3 MiB
iteration 15: rss: 101.8 MiB, vms: 296.6 MiB
iteration 16: rss: 102.8 MiB, vms: 297.6 MiB
iteration 17: rss: 103.8 MiB, vms: 298.6 MiB
iteration 18: rss: 104.9 MiB, vms: 299.8 MiB
iteration 19: rss: 106.2 MiB, vms: 300.8 MiB
…
iteration 80: rss: 173.0 MiB, vms: 367.8 MiB
iteration 81: rss: 174.3 MiB, vms: 369.0 MiB
iteration 82: rss: 175.3 MiB, vms: 370.0 MiB
iteration 83: rss: 176.3 MiB, vms: 371.0 MiB
iteration 84: rss: 177.4 MiB, vms: 372.3 MiB
iteration 85: rss: 178.9 MiB, vms: 373.6 MiB
iteration 86: rss: 179.9 MiB, vms: 374.9 MiB
iteration 87: rss: 181.0 MiB, vms: 375.9 MiB
iteration 88: rss: 182.3 MiB, vms: 376.9 MiB
iteration 89: rss: 183.3 MiB, vms: 378.1 MiB
iteration 90: rss: 184.3 MiB, vms: 379.1 MiB
iteration 91: rss: 185.6 MiB, vms: 380.1 MiB
iteration 92: rss: 186.7 MiB, vms: 381.4 MiB
iteration 93: rss: 187.8 MiB, vms: 382.4 MiB
iteration 94: rss: 188.8 MiB, vms: 383.6 MiB
iteration 95: rss: 189.8 MiB, vms: 384.6 MiB
iteration 96: rss: 191.1 MiB, vms: 385.6 MiB
iteration 97: rss: 192.1 MiB, vms: 386.9 MiB
iteration 98: rss: 193.2 MiB, vms: 387.9 MiB
iteration 99: rss: 194.4 MiB, vms: 388.9 MiB

After patch:

iteration 0: rss: 80.9 MiB, vms: 275.7 MiB
iteration 1: rss: 81.8 MiB, vms: 276.5 MiB
iteration 2: rss: 83.2 MiB, vms: 277.9 MiB
iteration 3: rss: 84.7 MiB, vms: 279.4 MiB
iteration 4: rss: 85.1 MiB, vms: 279.9 MiB
iteration 5: rss: 85.1 MiB, vms: 279.9 MiB
iteration 6: rss: 85.3 MiB, vms: 280.1 MiB
iteration 7: rss: 85.2 MiB, vms: 280.1 MiB
iteration 8: rss: 85.2 MiB, vms: 280.1 MiB
iteration 9: rss: 85.2 MiB, vms: 280.1 MiB
iteration 10: rss: 85.2 MiB, vms: 280.1 MiB
iteration 11: rss: 85.2 MiB, vms: 280.1 MiB
iteration 12: rss: 85.2 MiB, vms: 280.1 MiB
iteration 13: rss: 85.2 MiB, vms: 280.1 MiB
iteration 14: rss: 85.5 MiB, vms: 280.1 MiB
iteration 15: rss: 85.4 MiB, vms: 280.3 MiB
iteration 16: rss: 85.4 MiB, vms: 280.5 MiB
iteration 17: rss: 85.4 MiB, vms: 280.5 MiB
iteration 18: rss: 85.4 MiB, vms: 280.5 MiB
iteration 19: rss: 85.7 MiB, vms: 280.5 MiB
…
iteration 80: rss: 86.3 MiB, vms: 281.2 MiB
iteration 81: rss: 86.3 MiB, vms: 281.2 MiB
iteration 82: rss: 86.3 MiB, vms: 281.2 MiB
iteration 83: rss: 86.3 MiB, vms: 281.2 MiB
iteration 84: rss: 86.3 MiB, vms: 281.2 MiB
iteration 85: rss: 86.3 MiB, vms: 281.2 MiB
iteration 86: rss: 86.3 MiB, vms: 281.2 MiB
iteration 87: rss: 86.3 MiB, vms: 281.2 MiB
iteration 88: rss: 86.3 MiB, vms: 281.2 MiB
iteration 89: rss: 86.3 MiB, vms: 281.2 MiB
iteration 90: rss: 86.3 MiB, vms: 281.2 MiB
iteration 91: rss: 86.3 MiB, vms: 281.2 MiB
iteration 92: rss: 86.3 MiB, vms: 281.4 MiB
iteration 93: rss: 86.3 MiB, vms: 281.4 MiB
iteration 94: rss: 86.6 MiB, vms: 281.4 MiB
iteration 95: rss: 86.6 MiB, vms: 281.4 MiB
iteration 96: rss: 86.6 MiB, vms: 281.4 MiB
iteration 97: rss: 86.6 MiB, vms: 281.4 MiB
iteration 98: rss: 86.6 MiB, vms: 281.4 MiB
iteration 99: rss: 86.6 MiB, vms: 281.4 MiB

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants