-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516
Comments
|
Thanks for replying :) @gfyoung I mean that adding trajectory = [pd.read_csv(f, delim_whitespace=True, header=None, chunksize=10000, low_memory=True) for f in filelist] the memory usage is not change in contrast to the case without this option. |
The environment is: From #21353 , I tracked the memory usage: import psutil
import pandas as pd
traj = []
i = 0
for f in argv[1:]:
a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
traj.append(a)
if not i % 100:
print('%s th file, memory: ' % (i),psutil.Process().memory_info().rss / 1024**2)
i += 1 and the output: 0 th file, memory: 61.96484375
100 th file, memory: 214.66015625
200 th file, memory: 367.32421875
300 th file, memory: 520.046875
400 th file, memory: 674.76953125
500 th file, memory: 829.5
600 th file, memory: 982.22265625
700 th file, memory: 1134.9453125
800 th file, memory: 1287.66796875
900 th file, memory: 1442.3828125
1000 th file, memory: 1597.109375
1100 th file, memory: 1749.84765625
1200 th file, memory: 1932.57421875
1300 th file, memory: 2122.796875
1400 th file, memory: 2313.01953125
1500 th file, memory: 2503.2421875
...
4600 th file, memory: 8414.0234375
4700 th file, memory: 8604.24609375
4800 th file, memory: 8794.4765625
4900 th file, memory: 8984.6953125
5000 th file, memory: 9174.921875
5100 th file, memory: 9367.14453125
5200 th file, memory: 9557.37109375
5300 th file, memory: 9747.59375
5400 th file, memory: 9937.81640625
5500 th file, memory: 10128.04296875
5600 th file, memory: 10320.26953125 It turns out that the memory increases ~1.9 mB per file. The files using in this test is about 800 kB for each. Also tried import psutil
import pandas as pd
from ctypes import cdll, CDLL
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")
traj = []
i = 0
for f in argv[1:]:
libc.malloc_trim(0)
a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
traj.append(a)
if not i % 100:
print('%s th file, memory: ' % (i),psutil.Process().memory_info().rss / 1024**2)
i += 1 The results are same with above, the memory usage still increases quickly. |
Hmm...admittedly, this is the first time I've been seeing so many of these issues regarding memory leakage in cc @jreback |
@Shirui816 You're appending the result of pd.read_csv to a list:
Adding objects to a list means they can't be garbage collected. Thus you're keeping keeping thousands of file handles and the related iterator objects open-- so we would expect memory use to grow. I've confirmed that that memory does not grow if you remove the If the issue is that the memory use is growing faster than you expect based on the filesizes (based on your comment ("It turns out that the memory increases ~1.9 mB per file. The files using in this test is about 800 kB for each."), note that you're not actually reading the file in all the way in the above call, you're creating a persistent iterator and file handle on the file, because you're using the chunksize parameter. If you only want the first 10000 lines of the file, use
This will throw away the handle and the rest of the iterator object and contain just your data. |
@Liam3851 Thank you very much for the explanation. I increased file size and re-ran the test, the memory gain per file was still about 1.9mB. This handler is much larger than the The environment is: |
I am using anaconda and my pandas version is 0.23.1. When dealing with single large file, setting chunksize or iterator=True works fine and memory usage is low. The problem raises when I am trying to dealing with 5000+ files (file names are in
filelist
):The memory usage raises very soon and exceeds 20GB+ quickly. However,
trajectory = [open(f, 'r')....]
and reading 10000 lines from each file works fine.I also tried
low_memory=True
option but it's not working. Bothengine='python'
andmemory_map=<some file>
options solve the memory problem but when I use the datas withThe multi-threading of MKL-FFT does not work anymore.
The text was updated successfully, but these errors were encountered: