Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read_csv leaks memory when used in multiple threads #19941

Open
mrocklin opened this issue Feb 28, 2018 · 13 comments
Open

Read_csv leaks memory when used in multiple threads #19941

mrocklin opened this issue Feb 28, 2018 · 13 comments
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance

Comments

@mrocklin
Copy link
Contributor

When using read_csv in threads it appears that the Python process leaks a little memory.

This is coming from this dask-focused stack overflow question: https://stackoverflow.com/questions/48954080/why-is-dask-read-csv-from-s3-keeping-so-much-memory

I've reduced it to a problem with pandas.read_csv and a concurrent.futures.ThreadPoolExecutor

Code Sample, a copy-pastable example if possible

# imports
import pandas as pd
import numpy as np
import time
import psutil
from concurrent.futures import ThreadPoolExecutor

# prep
process = psutil.Process()
e = ThreadPoolExecutor(8)
# prepare csv file, only need to run once
pd.DataFrame(np.random.random((100000, 50))).to_csv('large_random.csv')
# baseline computation making pandas dataframes with threasds.  This works fine

def f(_):
    return pd.DataFrame(np.random.random((1000000, 50)))

print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(f, range(8)))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 57.0 MB
after: 56.0 MB
# example with read_csv, this leaks memory
print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 58.0 MB
after: 323.0 MB

Output of pd.show_versions()

In [2]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.28a0
numpy: 1.14.1
scipy: 0.19.0
pyarrow: 0.8.0
xarray: 0.8.2-264-g0b2424a
IPython: 5.1.0
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.5.1
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.2.1
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.0.9
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@TomAugspurger TomAugspurger added the IO CSV read_csv, to_csv label Feb 28, 2018
@chris-b1
Copy link
Contributor

I was able to repro the original SO question, but not your example, for what that's worth - only difference seems to be I'm on win64.

print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1)  # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 125.0 MB
after: 129.0 MB

pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.22.0
pytest: 3.2.1
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.2
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.7.1
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.6.3
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.0.0
xlwt: None
xlsxwriter: 0.9.6
lxml: 3.8.0
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.11
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.1
fastparquet: 0.1.0
pandas_gbq: None
pandas_datareader: 0.5.0

@jeremycg
Copy link

jeremycg commented Mar 1, 2018

Thanks a million for tracking this down (I was the asker of the original SO question)!

I can repeat this on my setup:
before: 67.0 MB
after: 66.0 MB
before: 66.0 MB
after: 297.0 MB

But, it looks like this is not the problem in the original question - using this modification on my real data, it fixes the problem with dask:

initial memory: 68.98046875
data in memory: 11390.87109375
data frame usage: 11079.813480377197
After function call: 11649.90234375

I'll make a reproducible example, and file this against dask

pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.9.77-31.58.amzn1.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.1
pytest: None
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.26
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: 0.4.0
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: 0.1.2
pandas_gbq: None
pandas_datareader: None

@birdsarah
Copy link
Contributor

I am using read_parquet (via dask) and also have things lurking around in memory.

@mrocklin
Copy link
Contributor Author

mrocklin commented Mar 9, 2018

@birdsarah it would be useful to isolate this problem to either dask or pandas by running your computation under both the single-threaded scheduler and the multi-threaded scheduler

dask.set_options(get=dask.local.get_sync)
dask.set_options(get=dask.threaded.get)

And then measure the amount of memory that your process is taking up

import psutil
psutil.Process().memory_info().rss

If it takes up a fair amount of memory when using the threaded scheduler but not when using the single threaded scheduler then I think it would be likely that we could isolate this to Pandas memory management.

@mrocklin
Copy link
Contributor Author

mrocklin commented Mar 9, 2018

Assuming you have time of course, which I realize may be a large assumption.

@birdsarah
Copy link
Contributor

Think this worked. Here's a gist. Threaded seems to take up more memory. https://gist.github.com/birdsarah/ea0b4978f25f0bb1e2389cd04b4bf287

@kuraga
Copy link

kuraga commented May 29, 2018

I don't know if it's the same but:

# 400 MB RAM usage (in htop, all the system)
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 2 GB
del df
gc.collect()
# 1.15 GB

@mrocklin
Copy link
Contributor Author

Thanks @kuraga . This example would be more useful if people here could reproduce it easily, ideally without downloading a particular file. Are you able to create a self-contained example, similar to the example given in the original post that demonstrates this issue?

@kuraga
Copy link

kuraga commented May 29, 2018

@mrocklin Hm... Seems like I've found a magic line...

with open('df.csv', 'wt') as f:
    f.write('item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability')
    for n in range(4000):
        f.write("""ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ray, USB. Если настроить, то работает смарт тв /
Торг",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a8713f112c67e29bb42,3032.0,0.43177""")
import pandas as pd
df = pd.read_csv('df.csv')

import gc
del df
gc.collect()

And reading is slow...

INSTALLED VERSIONS ------------------ commit: None python: 3.6.3.final.0 python-bits: 64 OS: Linux OS-release: 4.14.19-calculate machine: x86_64 processor: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz byteorder: little LC_ALL: None LANG: ru_RU.utf8 LOCALE: ru_RU.UTF-8

pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 39.1.0
Cython: None
numpy: 1.14.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.4.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.2.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jilongliao
Copy link

same problem running in a docker container to load 14gb data, however, it exceeds my 64gb memory limit very quickly..

@vanerpool
Copy link

vanerpool commented Jul 11, 2018

also have same problem as @little-eyes
Docker + 12GB data

# 80 MB RAM usage
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 12.6 GB
del df
gc.collect()
# 6.1 GB

pandas: 0.23.1, docker: 17.12.1-ce

@birdsarah
Copy link
Contributor

birdsarah commented Jul 30, 2018

@mrocklin, I was playing with this to see if I could track anything further down.

I noticed that if I run without multithreading, I still appear to get a memory leak:

    process = psutil.Process()
    print('before:', process.memory_info().rss // 1e6, 'MB')
    for i in range(8):
        pd.read_csv(test_data, engine='python')
    time.sleep(2)
    print('after:', process.memory_info().rss // 1e6, 'MB')

(test_data is the csv written to disk by your original code)

Result 1 - engine='python':

before: 71.0 MB
after: 113.0 MB

Result 2 - engine='c':

before: 72.0 MB
after: 119.0 MB

This is on Linux (Fedora)

$ conda list pandas

# Name                    Version                   Build  Channel
pandas                    0.23.2           py36h04863e7_0 

Edit: This may be nothing. If I wait longer and garbage collect it seems to clear up.

@birdsarah
Copy link
Contributor

birdsarah commented Aug 15, 2018

Relevant discussion: dask/dask#3530

setting MALLOC_MMAP_THRESHOLD_=16384 results in a significant improvement using the original code that @mrocklin posted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

9 participants