-
Notifications
You must be signed in to change notification settings - Fork 565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak inserting with df.to_sql using nvarchar(max) and fastexecutemany #1099
Comments
I am able to reproduce this issue with the latest master branch, i.e.,
Thanks for the report and the great MCVE! A workaround would be to use this https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205 along with df.to_sql(table, engine, index=False, if_exists="append", method=mssql_insert_json) I just tested it by tweaking the MCVE and it does not leak memory. Note that this method does not need |
Thanks for the reply. I did check out the workaround gist and did a quick performance check. I was able to insert 10k rows in 10 seconds with fast_executemany and it took 15 seconds with the workaround gist. Around the same as turbodbc. fast_executemany is still the performance king and very desirable to those making many inserts. |
Could you post an ODBC trace to compare the two? |
I compared the ODBC trace logs for the two methods of creating the DataFrame, namely "leaky" f = open(filename, "r")
batch = f.read()
f.close()
reader = csv.DictReader(StringIO(batch), delimiter=";", quoting=csv.QUOTE_NONE)
rows = [row for row in reader]
df = pd.DataFrame(rows) and "not leaky" df = pd.read_csv(filename) and they were identical. I also found that if I moved the DataFrame creation out of the loop (and only created it once) then the leaking stopped. With DataFrame creation inside the loop if I commented out the So it seems to be something about that particular way of creating the DataFrame (using |
That suggests it might not be an issue inside pyODBC itself; I'm not familiar with the external libraries in use here but if the way pyODBC is being called from them can be reproduced in a script which only uses pyODBC, that would either show its innocence or provide a good repro of the issue. |
// DAE
DAEParam *pParam = (DAEParam*)*outbuf;
Py_INCREF(cell);
pParam->cell = encoded.Detach(); This looks like a copy paste from here, where Py_INCREF(cell);
DAEParam *pParam = (DAEParam*)*outbuf;
pParam->cell = cell; |
Good find. Have you tried removing it? |
Commenting out the @@ -344,11 +344,11 @@ static int PyToCType(Cursor *cur, unsigned char **outbuf, PyObject *cell, ParamI
len = PyBytes_GET_SIZE(encoded);
if (!pi->ColumnSize)
{
// DAE
DAEParam *pParam = (DAEParam*)*outbuf;
- Py_INCREF(cell);
+ // Py_INCREF(cell);
pParam->cell = encoded.Detach();
pParam->maxlen = cur->cnxn->GetMaxLength(pi->ValueType);
*outbuf += sizeof(DAEParam);
ind = cur->cnxn->need_long_data_len ? SQL_LEN_DATA_AT_EXEC((SQLLEN)len) : SQL_DATA_AT_EXEC;
} does not completely stop the leak, but it does slow it down considerably. Before patch:
After patch:
|
Environment
Issue
Using nvarchar(max) and fastexecutemany a memory leak is observed. There was a similar leak in the past with 854 and pull 832 that was fixed but this look to be a different issue.
Turning off fastexecutemany or using turbodbc does not exhibit the issue. Also tried with other column types and no leak was observed.
Note: It matters how the dataframe is created before the call to to_sql whether there is a leak or not. If the dataframe is created with csv.DictReader() or pd.read_sql() the leak occurs. Creating the dataframe from pd.read_csv() is not leaking.
Code to reproduce.
Table creation ddl (code is creating this)
Results of execution:
The text was updated successfully, but these errors were encountered: