-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read_csv leaks memory when used in multiple threads #19941
Comments
I was able to repro the original SO question, but not your example, for what that's worth - only difference seems to be I'm on win64. print('before:', process.memory_info().rss // 1e6, 'MB')
list(e.map(pd.read_csv, ['large_random.csv'] * 8))
time.sleep(1) # let things settle
print('after:', process.memory_info().rss // 1e6, 'MB')
before: 125.0 MB
after: 129.0 MB
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.22.0 |
Thanks a million for tracking this down (I was the asker of the original SO question)! I can repeat this on my setup: But, it looks like this is not the problem in the original question - using this modification on my real data, it fixes the problem with dask: initial memory: 68.98046875 I'll make a reproducible example, and file this against dask
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.77-31.58.amzn1.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.1 |
I am using read_parquet (via dask) and also have things lurking around in memory. |
@birdsarah it would be useful to isolate this problem to either dask or pandas by running your computation under both the single-threaded scheduler and the multi-threaded scheduler
And then measure the amount of memory that your process is taking up import psutil
psutil.Process().memory_info().rss If it takes up a fair amount of memory when using the threaded scheduler but not when using the single threaded scheduler then I think it would be likely that we could isolate this to Pandas memory management. |
Assuming you have time of course, which I realize may be a large assumption. |
Think this worked. Here's a gist. Threaded seems to take up more memory. https://gist.github.com/birdsarah/ea0b4978f25f0bb1e2389cd04b4bf287 |
I don't know if it's the same but: # 400 MB RAM usage (in htop, all the system)
import pandas as pd
import gc
df = pd.read_csv('df.csv')
# 2 GB
del df
gc.collect()
# 1.15 GB |
Thanks @kuraga . This example would be more useful if people here could reproduce it easily, ideally without downloading a particular file. Are you able to create a self-contained example, similar to the example given in the original post that demonstrates this issue? |
@mrocklin Hm... Seems like I've found a magic line... with open('df.csv', 'wt') as f:
f.write('item_id,user_id,region,city,parent_category_name,category_name,param_1,param_2,param_3,title,description,price,item_seq_number,activation_date,user_type,image,image_top_1,deal_probability')
for n in range(4000):
f.write("""ba83aefab5dc,91e2f88dd6e3,Ростовская область,Ростов-на-Дону,Бытовая электроника,Аудио и видео,"Видео, DVD и Blu-ray плееры",,,Philips bluray,"В хорошем состоянии, домашний кинотеатр с blu ray, USB. Если настроить, то работает смарт тв /
Торг",4000.0,9,2017-03-20,Private,b7f250ee3f39e1fedd77c141f273703f4a9be59db4b48a8713f112c67e29bb42,3032.0,0.43177""") import pandas as pd
df = pd.read_csv('df.csv')
import gc
del df
gc.collect() And reading is slow...
INSTALLED VERSIONS
------------------
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.14.19-calculate
machine: x86_64
processor: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
byteorder: little
LC_ALL: None
LANG: ru_RU.utf8
LOCALE: ru_RU.UTF-8
pandas: 0.23.0 |
same problem running in a docker container to load 14gb data, however, it exceeds my 64gb memory limit very quickly.. |
also have same problem as @little-eyes
pandas: 0.23.1, docker: 17.12.1-ce |
@mrocklin, I was playing with this to see if I could track anything further down. I noticed that if I run without multithreading, I still appear to get a memory leak: process = psutil.Process()
print('before:', process.memory_info().rss // 1e6, 'MB')
for i in range(8):
pd.read_csv(test_data, engine='python')
time.sleep(2)
print('after:', process.memory_info().rss // 1e6, 'MB') (test_data is the csv written to disk by your original code) Result 1 -
Result 2 -
This is on Linux (Fedora)
Edit: This may be nothing. If I wait longer and garbage collect it seems to clear up. |
Relevant discussion: dask/dask#3530 setting |
When using
read_csv
in threads it appears that the Python process leaks a little memory.This is coming from this dask-focused stack overflow question: https://stackoverflow.com/questions/48954080/why-is-dask-read-csv-from-s3-keeping-so-much-memory
I've reduced it to a problem with
pandas.read_csv
and aconcurrent.futures.ThreadPoolExecutor
Code Sample, a copy-pastable example if possible
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-26-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.28a0
numpy: 1.14.1
scipy: 0.19.0
pyarrow: 0.8.0
xarray: 0.8.2-264-g0b2424a
IPython: 5.1.0
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: 1.5.1
bottleneck: None
tables: 3.3.0
numexpr: 2.6.2
feather: None
matplotlib: 2.0.0
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.2.1
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: 0.0.9
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: