-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Memory leak after pd.read_csv() with default parameters #51667
Comments
Hi, thanks for your report. Contributions are welcome |
Hi @phofl, Thank you for triaging this issue and it is my pleasure to report the issue. Could you comment which area of contributions you are suggesting? Further investigation using parquet interface also suffering similar issue: # Credits: https://github.com/pandas-dev/pandas/issues/21353
import gc
import os.path
import psutil
import sys
import pandas as pd
from memory_profiler import profile
@profile
def read_from_file():
memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
print("memory_before_read: ", memory_before_read)
df = pd.read_parquet(file_name, engine="fastparquet")
print("df.shape: ", df.shape)
memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
print("memory_after_read: ", memory_after_read)
del df
# Attempt to trace memory leak
gc.collect()
memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
print("memory_after_gc: ", memory_after_gc)
print("memory leak: ", memory_after_gc - memory_before_read)
if len(gc.garbage) > 0:
# Inspect the output of the garbage collector
print("-" * 120)
print("ERROR: gc.garbage:")
print("-" * 120)
print(gc.garbage)
print()
'''
The output of the garbage collector will show you the objects that were not successfully freed up by the
garbage collector. These objects are likely the source of the memory leak.
Once you have identified the objects that are causing the memory leak, you can inspect your code to
determine why these objects are not being garbage collected properly. Common causes of memory leaks
include circular references, which occur when objects reference each other in a way that prevents them
from being garbage collected, and forgetting to close file handles or database connections.
You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
'''
if __name__ == '__main__':
'''
Usage: python ./memory_leak_when_read_parquet.py 10000000 20
'''
m = int(sys.argv[1])
n = int(sys.argv[2])
print("pd.__version__: ", pd.__version__)
file_name = 'df_{}_{}.parquet'.format(m, n)
if not os.path.exists(file_name):
# Create the dataframe
df = pd.DataFrame({'c{}'.format(i): 1 for i in range(n)}, index=[0])
df.index.name = 'row'
df = pd.concat([df] * m, ignore_index=True)
df.to_parquet(file_name, engine="fastparquet", compression="snappy")
read_from_file() Sample run output indicating 17.16015625 MB of memory leak using the similar dataframe: python.exe memory_leak_when_read_parquet.py 10000000 20
pd.__version__: 1.5.3
memory_before_read: 1561.9375
df.shape: (10000000, 20)
memory_after_read: 3089.140625
memory_after_gc: 1579.09765625
memory leak: 17.16015625
Filename: memory_leak_when_read_parquet.py
Line # Mem usage Increment Occurrences Line Contents
=============================================================
13 1561.9 MiB 1561.9 MiB 1 @profile
14 def read_from_file():
15 1561.9 MiB 0.1 MiB 1 memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
16 1562.0 MiB 0.0 MiB 1 print("memory_before_read: ", memory_before_read)
17
18 3089.1 MiB 1527.2 MiB 1 df = pd.read_parquet(file_name, engine="fastparquet")
19
20 3089.1 MiB 0.0 MiB 1 print("df.shape: ", df.shape)
21 3089.1 MiB 0.0 MiB 1 memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
22 3089.1 MiB 0.0 MiB 1 print("memory_after_read: ", memory_after_read)
23
24 1563.3 MiB -1525.9 MiB 1 del df
25
26 # Attempt to trace memory leak
27 1579.1 MiB 15.8 MiB 1 gc.collect()
28
29 1579.1 MiB 0.0 MiB 1 memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
30 1579.1 MiB 0.0 MiB 1 print("memory_after_gc: ", memory_after_gc)
31 1579.1 MiB 0.0 MiB 1 print("memory leak: ", memory_after_gc - memory_before_read)
32
33 1579.1 MiB 0.0 MiB 1 if len(gc.garbage) > 0:
34 # Inspect the output of the garbage collector
35 print("-" * 120)
36 print("ERROR: gc.garbage:")
37 print("-" * 120)
38 print(gc.garbage)
39 print()
40 '''
41 The output of the garbage collector will show you the objects that were not successfully freed up by the
42 garbage collector. These objects are likely the source of the memory leak.
43
44 Once you have identified the objects that are causing the memory leak, you can inspect your code to
45 determine why these objects are not being garbage collected properly. Common causes of memory leaks
46 include circular references, which occur when objects reference each other in a way that prevents them
47 from being garbage collected, and forgetting to close file handles or database connections.
48
49 You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
50 '''
Process finished with exit code 0 Looks like the memory leaks in parquet file is less severe compared to csv file but there is still leak. |
I probably have the same issue, I am using parquet, with pyathena. I would be happy to contribute to it if you have something. |
Hi, CSV file that I'm reading is relatively small so after hours or run my machine runs out of memory. After spending some time with memory_profiler came to know the pandas.read_csv() is the culprit. Pandas 2.2.0 I'm using, is there any fix to this issue already? |
Hi, I walked the same road as @OmerMajNition, I had a minimal memory leak and it took hours to find the reason in pandas.read_csv(). Update: |
I had the similar problem with pandas 2.2.0, python 3.11. Simple script demonstrating the problem:
Resulted in increasing memory usage:
Using gc.collect() or sleep() doesn't change the memory usage. After downgrading version 2.2.0->2.1.4 as @objective-jh suggested:
|
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Memory still leak despite fix https://github.com/pandas-dev/pandas/pull/24837/commits still intact in main branch for pd.read_csv().
Sample output in my run:
In the run above, it is causing 27.25390625 MB of memory leak. IMHO, it is a huge leak for csv file size of 391MB.
Expected Behavior
There should be no memory leak after
del df
and garbage is collected.This memory leak issue was originally detected by VPS crash in the cloud. After some digging, it has been discovered that the memory leak occurs in both Linux and Windows platform. Logically, you should be able to reproduce it in any platform.
The chances for the memory leak to occur is sourcing from C language buffer that either over allocated or under freed. IMHO you could consider running Memory Leak Detection and Management Tools of your choice to completely eliminate memory leak from the C programming code.
Related issues FYR:
#21353
#37031
#49582
I hope you will help to prioritize this issue as I trust there will be more users discovering this issue as time passes.
Installed Versions
The text was updated successfully, but these errors were encountered: