Read_csv leaks memory when used in multiple threads #19941

mrocklin · 2018-02-28T19:47:36Z

When using read_csv in threads it appears that the Python process leaks a little memory.

This is coming from this dask-focused stack overflow question: https://stackoverflow.com/questions/48954080/why-is-dask-read-csv-from-s3-keeping-so-much-memory

I've reduced it to a problem with pandas.read_csv and a concurrent.futures.ThreadPoolExecutor

Code Sample, a copy-pastable example if possible

# imports
import pandas as pd
import numpy as np
import time
import psutil
from concurrent.futures import ThreadPoolExecutor

# prep
process = psutil.Process()
e = ThreadPoolExecutor(8)

# prepare csv file, only need to run once
pd.DataFrame(np.random.random((100000, 50))).to_csv('large_random.csv')

# baseline computation making pandas dataframes with threasds.  This works fine

def f(_):
    return pd.DataFrame(np.random.random((1000000, 50)))

print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(f, range(8)))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')

before: 57.0 MB
after: 56.0 MB

# example with read_csv, this leaks memory
print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')

before: 58.0 MB
after: 323.0 MB

Output of `pd.show_versions()`

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.28a0
numpy: 1.14.1
scipy: 0.19.0
pyarrow: 0.8.0
xarray: 0.8.2-264-g0b2424a
IPython: 5.1.0
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.5.1
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.2.1
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.0.9
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

chris-b1 · 2018-02-28T22:17:06Z

I was able to repro the original SO question, but not your example, for what that's worth - only difference seems to be I'm on win64.

print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 125.0 MB
after: 129.0 MB

pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.2
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.1
fastparquet: 0.1.0
pandas_gbq: None
pandas_datareader: 0.5.0

jeremycg · 2018-03-01T14:35:13Z

Thanks a million for tracking this down (I was the asker of the original SO question)!

I can repeat this on my setup:
before: 67.0 MB
after: 66.0 MB
before: 66.0 MB
after: 297.0 MB

But, it looks like this is not the problem in the original question - using this modification on my real data, it fixes the problem with dask:

initial memory: 68.98046875
data in memory: 11390.87109375
data frame usage: 11079.813480377197
After function call: 11649.90234375

I'll make a reproducible example, and file this against dask

pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.77-31.58.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.26
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: 0.4.0
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None

birdsarah · 2018-03-09T20:11:03Z

I am using read_parquet (via dask) and also have things lurking around in memory.

mrocklin · 2018-03-09T22:15:46Z

@birdsarah it would be useful to isolate this problem to either dask or pandas by running your computation under both the single-threaded scheduler and the multi-threaded scheduler

dask.set_options(get=dask.local.get_sync)
dask.set_options(get=dask.threaded.get)

And then measure the amount of memory that your process is taking up

import psutil
psutil.Process().memory_info().rss

If it takes up a fair amount of memory when using the threaded scheduler but not when using the single threaded scheduler then I think it would be likely that we could isolate this to Pandas memory management.

mrocklin · 2018-03-09T22:16:06Z

Assuming you have time of course, which I realize may be a large assumption.

birdsarah · 2018-03-09T23:39:35Z

Think this worked. Here's a gist. Threaded seems to take up more memory. https://gist.github.com/birdsarah/ea0b4978f25f0bb1e2389cd04b4bf287

kuraga · 2018-05-29T10:55:16Z

I don't know if it's the same but:

# 400 MB RAM usage (in htop, all the system)
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 2 GB
del df
gc.collect()
# 1.15 GB

mrocklin · 2018-05-29T15:23:26Z

Thanks @kuraga . This example would be more useful if people here could reproduce it easily, ideally without downloading a particular file. Are you able to create a self-contained example, similar to the example given in the original post that demonstrates this issue?

kuraga · 2018-05-29T16:48:57Z

@mrocklin Hm... Seems like I've found a magic line...

with open('df.csv', 'wt') as f:
    f.write('item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability')
    for n in range(4000):
        f.write("""ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ray, USB. Если настроить, то работает смарт тв /
Торг",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a8713f112c67e29bb42,3032.0,0.43177""")

import pandas as pd
df = pd.read_csv('df.csv')

import gc
del df
gc.collect()

And reading is slow...

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.14.19-calculate machine: x86_64 processor: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz byteorder: little LC_ALL: None LANG: ru_RU.utf8 LOCALE: ru_RU.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

jilongliao · 2018-07-04T00:34:12Z

same problem running in a docker container to load 14gb data, however, it exceeds my 64gb memory limit very quickly..

vanerpool · 2018-07-11T12:39:19Z

also have same problem as @little-eyes
Docker + 12GB data

# 80 MB RAM usage
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 12.6 GB
del df
gc.collect()
# 6.1 GB

pandas: 0.23.1, docker: 17.12.1-ce

birdsarah · 2018-07-30T10:52:18Z

@mrocklin, I was playing with this to see if I could track anything further down.

I noticed that if I run without multithreading, I still appear to get a memory leak:

    process = psutil.Process()
    print('before:', process.memory_info().rss // 1e6, 'MB')
    for i in range(8):
        pd.read_csv(test_data, engine='python')
    time.sleep(2)
    print('after:', process.memory_info().rss // 1e6, 'MB')

(test_data is the csv written to disk by your original code)

Result 1 - engine='python':

before: 71.0 MB
after: 113.0 MB

Result 2 - engine='c':

before: 72.0 MB
after: 119.0 MB

This is on Linux (Fedora)

$ conda list pandas

# Name                    Version                   Build  Channel
pandas                    0.23.2           py36h04863e7_0

Edit: This may be nothing. If I wait longer and garbage collect it seems to clear up.

birdsarah · 2018-08-15T02:08:01Z

Relevant discussion: dask/dask#3530

setting MALLOC_MMAP_THRESHOLD_=16384 results in a significant improvement using the original code that @mrocklin posted.

TomAugspurger added the IO CSV read_csv, to_csv label Feb 28, 2018

mrocklin mentioned this issue Mar 5, 2018

Dask holding on to memory after garbage collection dask/dask#3247

Open

birdsarah mentioned this issue Mar 13, 2018

Slow running process dying at the last hurdle dask/distributed#1836

Open

mrocklin mentioned this issue May 7, 2018

client.restart does not restart connection between scheduler and workers dask/distributed#1946

Closed

TomAugspurger mentioned this issue Oct 10, 2018

Fixing memory leaks in read_csv #23072

Merged

TomAugspurger mentioned this issue Jan 16, 2019

Memory (leak) aggregation after multiple runs with .compute() dask/distributed#2464

Closed

TomAugspurger mentioned this issue Apr 30, 2019

Dask read_csv inside a loop leads threads and RAM to increase in every step dask/dask#3670

Closed

TomAugspurger mentioned this issue Oct 22, 2020

memory leak using Dask DataFrame dask/dask#6762

Open

mroeschke added the Performance Memory or execution speed performance label Jun 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read_csv leaks memory when used in multiple threads #19941

Read_csv leaks memory when used in multiple threads #19941

mrocklin commented Feb 28, 2018

INSTALLED VERSIONS

chris-b1 commented Feb 28, 2018

jeremycg commented Mar 1, 2018

birdsarah commented Mar 9, 2018

mrocklin commented Mar 9, 2018

mrocklin commented Mar 9, 2018

birdsarah commented Mar 9, 2018

kuraga commented May 29, 2018 •

edited

Loading

mrocklin commented May 29, 2018

kuraga commented May 29, 2018 •

edited

Loading

jilongliao commented Jul 4, 2018

vanerpool commented Jul 11, 2018 •

edited

Loading

birdsarah commented Jul 30, 2018 •

edited

Loading

birdsarah commented Aug 15, 2018 •

edited

Loading

Read_csv leaks memory when used in multiple threads #19941

Read_csv leaks memory when used in multiple threads #19941

Comments

mrocklin commented Feb 28, 2018

Code Sample, a copy-pastable example if possible

Output of pd.show_versions()

INSTALLED VERSIONS

chris-b1 commented Feb 28, 2018

jeremycg commented Mar 1, 2018

birdsarah commented Mar 9, 2018

mrocklin commented Mar 9, 2018

mrocklin commented Mar 9, 2018

birdsarah commented Mar 9, 2018

kuraga commented May 29, 2018 • edited Loading

mrocklin commented May 29, 2018

kuraga commented May 29, 2018 • edited Loading

jilongliao commented Jul 4, 2018

vanerpool commented Jul 11, 2018 • edited Loading

birdsarah commented Jul 30, 2018 • edited Loading

birdsarah commented Aug 15, 2018 • edited Loading

Output of `pd.show_versions()`

kuraga commented May 29, 2018 •

edited

Loading

kuraga commented May 29, 2018 •

edited

Loading

vanerpool commented Jul 11, 2018 •

edited

Loading

birdsarah commented Jul 30, 2018 •

edited

Loading

birdsarah commented Aug 15, 2018 •

edited

Loading