Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Memory leak after pd.read_csv() with default parameters #51667

Open
3 tasks done
viper7882 opened this issue Feb 27, 2023 · 6 comments
Open
3 tasks done

BUG: Memory leak after pd.read_csv() with default parameters #51667

viper7882 opened this issue Feb 27, 2023 · 6 comments
Labels
Bug IO CSV read_csv, to_csv Performance Memory or execution speed performance

Comments

@viper7882
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Credits: https://github.com/pandas-dev/pandas/issues/21353
import gc
import os.path

import psutil
import sys

import pandas as pd

from memory_profiler import profile


@profile
def read_from_file():
    memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_before_read: ", memory_before_read)

    df = pd.read_csv(file_name)

    print("df.shape: ", df.shape)
    memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_after_read: ", memory_after_read)

    del df

    # Attempt to trace memory leak
    gc.collect()

    memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_after_gc: ", memory_after_gc)
    print("memory leak: ", memory_after_gc - memory_before_read)

    if len(gc.garbage) > 0:
        # Inspect the output of the garbage collector
        print("-" * 120)
        print("ERROR: gc.garbage:")
        print("-" * 120)
        print(gc.garbage)
        print()
        '''
        The output of the garbage collector will show you the objects that were not successfully freed up by the
        garbage collector. These objects are likely the source of the memory leak.

        Once you have identified the objects that are causing the memory leak, you can inspect your code to
        determine why these objects are not being garbage collected properly. Common causes of memory leaks
        include circular references, which occur when objects reference each other in a way that prevents them
        from being garbage collected, and forgetting to close file handles or database connections.

        You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
        '''


if __name__ == '__main__':
    '''
    Usage: python ./memory_leak_when_read_csv.py 10000000 20
    '''
    m = int(sys.argv[1])
    n = int(sys.argv[2])

    print("pd.__version__: ", pd.__version__)

    file_name = 'df_{}_{}.csv'.format(m, n)
    if not os.path.exists(file_name):
        mode = "wt"
        with open(file_name, mode) as f:
            for i in range(n - 1):
                f.write('c' + str(i) + ',')
            f.write('c' + str(n - 1) + '\n')
            for j in range(m):
                for i in range(n - 1):
                    f.write('1,')
                f.write('1\n')

    read_from_file()

Issue Description

Memory still leak despite fix https://github.com/pandas-dev/pandas/pull/24837/commits still intact in main branch for pd.read_csv().

Sample output in my run:

python memory_leak_when_read_csv.py 10000000 20 
pd.__version__: 1.5.3
memory_before_read:  13.765625
df.shape:  (10000000, 20)
memory_after_read:  1549.0234375
memory_after_gc:  41.01953125
memory leak:  27.25390625
Filename: memory_leak_when_read_csv.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    13     13.7 MiB     13.7 MiB           1   @profile
    14                                         def read_from_file():
    15     13.8 MiB      0.1 MiB           1       memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
    16     13.8 MiB      0.0 MiB           1       print("memory_before_read: ", memory_before_read)
    17                                         
    18   1549.0 MiB   1535.2 MiB           1       df = pd.read_csv(file_name)
    19                                         
    20   1549.0 MiB      0.0 MiB           1       print("df.shape: ", df.shape)
    21   1549.0 MiB      0.0 MiB           1       memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
    22   1549.0 MiB      0.0 MiB           1       print("memory_after_read: ", memory_after_read)
    23                                         
    24     23.1 MiB  -1525.9 MiB           1       del df
    25                                         
    26                                             # Attempt to trace memory leak
    27     41.0 MiB     17.9 MiB           1       gc.collect()
    28                                         
    29     41.0 MiB      0.0 MiB           1       memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
    30     41.0 MiB      0.0 MiB           1       print("memory_after_gc: ", memory_after_gc)
    31     41.0 MiB      0.0 MiB           1       print("memory leak: ", memory_after_gc - memory_before_read)
    32                                         
    33     41.0 MiB      0.0 MiB           1       if len(gc.garbage) > 0:
    34                                                 # Inspect the output of the garbage collector
    35                                                 print("-" * 120)
    36                                                 print("ERROR: gc.garbage:")
    37                                                 print("-" * 120)
    38                                                 print(gc.garbage)
    39                                                 print()
    40                                                 '''
    41                                                 The output of the garbage collector will show you the objects that were not successfully freed up by the
    42                                                 garbage collector. These objects are likely the source of the memory leak.
    43                                         
    44                                                 Once you have identified the objects that are causing the memory leak, you can inspect your code to
    45                                                 determine why these objects are not being garbage collected properly. Common causes of memory leaks
    46                                                 include circular references, which occur when objects reference each other in a way that prevents them
    47                                                 from being garbage collected, and forgetting to close file handles or database connections.
    48                                         
    49                                                 You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
    50                                                 '''



Process finished with exit code 0

In the run above, it is causing 27.25390625 MB of memory leak. IMHO, it is a huge leak for csv file size of 391MB.

Expected Behavior

There should be no memory leak after del df and garbage is collected.

This memory leak issue was originally detected by VPS crash in the cloud. After some digging, it has been discovered that the memory leak occurs in both Linux and Windows platform. Logically, you should be able to reproduce it in any platform.

The chances for the memory leak to occur is sourcing from C language buffer that either over allocated or under freed. IMHO you could consider running Memory Leak Detection and Management Tools of your choice to completely eliminate memory leak from the C programming code.

Related issues FYR:
#21353
#37031
#49582

I hope you will help to prioritize this issue as I trust there will be more users discovering this issue as time passes.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 2e218d1 python : 3.10.2.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.22621 machine : AMD64 processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : English_united states.1252 pandas : 1.5.3 numpy : 1.24.2 pytz : 2022.7.1 dateutil : 2.8.2 setuptools : 67.0.0 pip : 23.0.1 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None brotli : None fastparquet : None fsspec : None gcsfs : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : None snappy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None zstandard : None tzdata : None
@viper7882 viper7882 added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 27, 2023
@phofl phofl added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 27, 2023
@phofl
Copy link
Member

phofl commented Feb 27, 2023

Hi, thanks for your report. Contributions are welcome

@viper7882
Copy link
Author

Hi @phofl,

Thank you for triaging this issue and it is my pleasure to report the issue. Could you comment which area of contributions you are suggesting?


Further investigation using parquet interface also suffering similar issue:

# Credits: https://github.com/pandas-dev/pandas/issues/21353
import gc
import os.path

import psutil
import sys

import pandas as pd

from memory_profiler import profile


@profile
def read_from_file():
    memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_before_read: ", memory_before_read)

    df = pd.read_parquet(file_name, engine="fastparquet")

    print("df.shape: ", df.shape)
    memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_after_read: ", memory_after_read)

    del df

    # Attempt to trace memory leak
    gc.collect()

    memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
    print("memory_after_gc: ", memory_after_gc)
    print("memory leak: ", memory_after_gc - memory_before_read)

    if len(gc.garbage) > 0:
        # Inspect the output of the garbage collector
        print("-" * 120)
        print("ERROR: gc.garbage:")
        print("-" * 120)
        print(gc.garbage)
        print()
        '''
        The output of the garbage collector will show you the objects that were not successfully freed up by the
        garbage collector. These objects are likely the source of the memory leak.

        Once you have identified the objects that are causing the memory leak, you can inspect your code to
        determine why these objects are not being garbage collected properly. Common causes of memory leaks
        include circular references, which occur when objects reference each other in a way that prevents them
        from being garbage collected, and forgetting to close file handles or database connections.

        You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
        '''


if __name__ == '__main__':
    '''
    Usage: python ./memory_leak_when_read_parquet.py 10000000 20
    '''
    m = int(sys.argv[1])
    n = int(sys.argv[2])

    print("pd.__version__: ", pd.__version__)

    file_name = 'df_{}_{}.parquet'.format(m, n)
    if not os.path.exists(file_name):
        # Create the dataframe
        df = pd.DataFrame({'c{}'.format(i): 1 for i in range(n)}, index=[0])
        df.index.name = 'row'
        df = pd.concat([df] * m, ignore_index=True)
        df.to_parquet(file_name, engine="fastparquet", compression="snappy")

    read_from_file()

Sample run output indicating 17.16015625 MB of memory leak using the similar dataframe:

python.exe memory_leak_when_read_parquet.py 10000000 20 
pd.__version__:  1.5.3
memory_before_read:  1561.9375
df.shape:  (10000000, 20)
memory_after_read:  3089.140625
memory_after_gc:  1579.09765625
memory leak:  17.16015625
Filename: memory_leak_when_read_parquet.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    13   1561.9 MiB   1561.9 MiB           1   @profile
    14                                         def read_from_file():
    15   1561.9 MiB      0.1 MiB           1       memory_before_read = psutil.Process().memory_info().rss / 1024 ** 2
    16   1562.0 MiB      0.0 MiB           1       print("memory_before_read: ", memory_before_read)
    17                                         
    18   3089.1 MiB   1527.2 MiB           1       df = pd.read_parquet(file_name, engine="fastparquet")
    19                                         
    20   3089.1 MiB      0.0 MiB           1       print("df.shape: ", df.shape)
    21   3089.1 MiB      0.0 MiB           1       memory_after_read = psutil.Process().memory_info().rss / 1024 ** 2
    22   3089.1 MiB      0.0 MiB           1       print("memory_after_read: ", memory_after_read)
    23                                         
    24   1563.3 MiB  -1525.9 MiB           1       del df
    25                                         
    26                                             # Attempt to trace memory leak
    27   1579.1 MiB     15.8 MiB           1       gc.collect()
    28                                         
    29   1579.1 MiB      0.0 MiB           1       memory_after_gc = psutil.Process().memory_info().rss / 1024 ** 2
    30   1579.1 MiB      0.0 MiB           1       print("memory_after_gc: ", memory_after_gc)
    31   1579.1 MiB      0.0 MiB           1       print("memory leak: ", memory_after_gc - memory_before_read)
    32                                         
    33   1579.1 MiB      0.0 MiB           1       if len(gc.garbage) > 0:
    34                                                 # Inspect the output of the garbage collector
    35                                                 print("-" * 120)
    36                                                 print("ERROR: gc.garbage:")
    37                                                 print("-" * 120)
    38                                                 print(gc.garbage)
    39                                                 print()
    40                                                 '''
    41                                                 The output of the garbage collector will show you the objects that were not successfully freed up by the
    42                                                 garbage collector. These objects are likely the source of the memory leak.
    43                                         
    44                                                 Once you have identified the objects that are causing the memory leak, you can inspect your code to
    45                                                 determine why these objects are not being garbage collected properly. Common causes of memory leaks
    46                                                 include circular references, which occur when objects reference each other in a way that prevents them
    47                                                 from being garbage collected, and forgetting to close file handles or database connections.
    48                                         
    49                                                 You can also use third-party tools like memory_profiler or objgraph to help you track down memory leaks.
    50                                                 '''



Process finished with exit code 0

Looks like the memory leaks in parquet file is less severe compared to csv file but there is still leak.

@innicoder
Copy link

I probably have the same issue, I am using parquet, with pyathena.
laughingman7743/PyAthena#417

I would be happy to contribute to it if you have something.

@OmerMajNition
Copy link

Hi,

CSV file that I'm reading is relatively small so after hours or run my machine runs out of memory. After spending some time with memory_profiler came to know the pandas.read_csv() is the culprit.

Pandas 2.2.0 I'm using, is there any fix to this issue already?

@objective-jh
Copy link

objective-jh commented Feb 23, 2024

Hi,

I walked the same road as @OmerMajNition, I had a minimal memory leak and it took hours to find the reason in pandas.read_csv().
I am also on pandas version 2.2.0, but python 3.9, so maybe the python version could be a problem?
I ended in downgrading to pandas to 2.1.4, which fixed the leak

Update:
I ran the same script in python 3.11, also leaking there

@zorka5
Copy link

zorka5 commented Apr 17, 2024

I had the similar problem with pandas 2.2.0, python 3.11.

Simple script demonstrating the problem:

import gc
import time

import pandas as pd
import psutil

print("pandas version: ", pd.__version__)

filepath = "testfile.csv" #190kB

for i in range(1, 30001):
    df = pd.read_csv(filepath)
    process = psutil.Process()
    if i % 1000 == 0:
        #gc.collect()
        print(
            f"{i}: {process.memory_info().rss} b / {process.memory_info().rss/1e6} mb / {process.memory_info().rss/1e9} gb"
        )
        if process.memory_info().rss / 1e9 > 2:
            break
       # time.sleep(5)

Resulted in increasing memory usage:

pandas version:  2.2.0
1000: 277692416 b / 277.692416 mb / 0.277692416 gb
2000: 488382464 b / 488.382464 mb / 0.488382464 gb
3000: 698818560 b / 698.81856 mb / 0.69881856 gb
4000: 909266944 b / 909.266944 mb / 0.909266944 gb
5000: 1119588352 b / 1119.588352 mb / 1.119588352 gb
6000: 1330180096 b / 1330.180096 mb / 1.330180096 gb
7000: 1540481024 b / 1540.481024 mb / 1.540481024 gb
8000: 1750970368 b / 1750.970368 mb / 1.750970368 gb
9000: 1961480192 b / 1961.480192 mb / 1.961480192 gb
10000: 2171985920 b / 2171.98592 mb / 2.17198592 gb

Using gc.collect() or sleep() doesn't change the memory usage. After downgrading version 2.2.0->2.1.4 as @objective-jh suggested:

pandas version:  2.1.4
1000: 65118208 b / 65.118208 mb / 0.065118208 gb
2000: 65605632 b / 65.605632 mb / 0.065605632 gb
3000: 65556480 b / 65.55648 mb / 0.06555648 gb
4000: 65605632 b / 65.605632 mb / 0.065605632 gb
5000: 65617920 b / 65.61792 mb / 0.06561792 gb
6000: 65343488 b / 65.343488 mb / 0.065343488 gb
7000: 65425408 b / 65.425408 mb / 0.065425408 gb
8000: 65466368 b / 65.466368 mb / 0.065466368 gb
9000: 65466368 b / 65.466368 mb / 0.065466368 gb
10000: 65437696 b / 65.437696 mb / 0.065437696 gb
11000: 65650688 b / 65.650688 mb / 0.065650688 gb
12000: 65736704 b / 65.736704 mb / 0.065736704 gb
13000: 65937408 b / 65.937408 mb / 0.065937408 gb
14000: 66002944 b / 66.002944 mb / 0.066002944 gb
15000: 66142208 b / 66.142208 mb / 0.066142208 gb
16000: 66248704 b / 66.248704 mb / 0.066248704 gb
17000: 66306048 b / 66.306048 mb / 0.066306048 gb
18000: 66170880 b / 66.17088 mb / 0.06617088 gb
19000: 66187264 b / 66.187264 mb / 0.066187264 gb
20000: 66400256 b / 66.400256 mb / 0.066400256 gb
21000: 66506752 b / 66.506752 mb / 0.066506752 gb
22000: 66555904 b / 66.555904 mb / 0.066555904 gb
23000: 66670592 b / 66.670592 mb / 0.066670592 gb
24000: 66555904 b / 66.555904 mb / 0.066555904 gb
25000: 66572288 b / 66.572288 mb / 0.066572288 gb
26000: 66662400 b / 66.6624 mb / 0.0666624 gb
27000: 66711552 b / 66.711552 mb / 0.066711552 gb
28000: 66760704 b / 66.760704 mb / 0.066760704 gb
29000: 66809856 b / 66.809856 mb / 0.066809856 gb
30000: 66760704 b / 66.760704 mb / 0.066760704 gb

@mroeschke mroeschke added the Performance Memory or execution speed performance label Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

7 participants