Leaks memory when input is not a numpy array #201

batterseapower · 2019-01-02T17:16:42Z

If you run the following program you see that nansum leaks all the memory it are given when passed a Pandas object. If it is passed the ndarray underlying the Pandas object instead then there is no leak:

import psutil
import gc

def f():
    x = np.zeros(10*1024*1024, dtype='f4')

    # Leaks 40MB/iteration
    bottleneck.nansum(pd.Series(x))
    # No leak:
    #bottleneck.nansum(x)

process = psutil.Process(os.getpid())

def _get_usage():
    gc.collect()
    return process.memory_info().private / (1024*1024)

last_usage = _get_usage()
print(last_usage)

for _ in range(10):
    f()
    usage = _get_usage()
    print(usage - last_usage)
    last_usage = usage

This affects not just nansum, but apparently all the reduction functions (with or without axis specified), and at least some other functions like move_max.

I'm not completely sure why this happens, but maybe it's because PyArray_FROM_O is allocating a new array in this case, and the ref count of that is not being decremented by anyone? https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/src/reduce_template.c#L1237

I'm using Bottleneck 1.2.1 with Pandas 0.23.1. sys.version is 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)].

The text was updated successfully, but these errors were encountered:

kwgoodman · 2019-01-02T18:12:26Z

Is there a memory leak if instead of pd.Series(x) you use x.tolist()?

batterseapower · 2019-01-02T19:01:28Z

Yes. In fact, it leaks twice as much memory for some reason.

kwgoodman · 2019-01-02T20:10:09Z

The memory leak is a big find. Thank you!

batterseapower · 2019-01-03T07:52:25Z

Yeah, I'm very happy I finally worked out why my Jupyter sessions have required a weekly restart for the last 6 months :-)

kwgoodman · 2019-01-04T16:40:10Z

@batterseapower could you try the leak_201 branch? In it I think I fixed the memory leak (but only for reduce functions). Your code doesn't run for me (maybe because I am on py2.7?). If you could check both for numpy array and non-numpy arrays that would be great. My quick checks (watching htop) looked good.

kwgoodman · 2019-01-04T17:39:33Z

@shoyer thanks for the suggestion. Did I implement it correctly? This change will touch every function so I am wondering if I should make it right before a release. A second set of eyes will give me confidence.

batterseapower · 2019-01-04T17:53:43Z

You might not be able to run the code if you aren't on Windows: I tried the sample on my Mac and it looks like the private member is not available there (I guess you can use rss as a replacement).

Anyway, your fix seems to have worked: neither Pandas nor numpy objects leak (and neither do Python lists from .tolist()). Thanks for the quick response!

kwgoodman · 2019-01-04T17:57:09Z

@shoyer what if the input is a numpy array and an error occurs after the Py_INCREF but before the function returns? That would be a memory leak whereas the original fix would not leak. (Edit: Oh, the original fix would leak if an error occurred with a non-numpy array)

@batterseapower thanks for the checks. I'm on linux so will try the rss.

kwgoodman · 2019-01-04T18:16:58Z

OK, so here is the proposed fix for the memory leak: https://github.com/kwgoodman/bottleneck/compare/leak_201

Comments welcome. If all looks good then I will apply the fix to the other functions (nonreduce, moving window, etc)

shoyer · 2019-01-04T19:35:53Z

@kwgoodman could you kindly open a pull request so I can comment inline?

shoyer · 2019-01-04T19:43:46Z

what if the input is a numpy array and an error occurs after the Py_INCREF but before the function returns? That would be a memory leak whereas the original fix would not leak.

Yes, you should call Py_DECREF or Py_XDECREF before returning.

A common style you'll see in NumPy is to use goto for error handling, e.g.,

    PyObject *result = NULL;
    PyObject *x;
    x = something_else();
    if (x == NULL) {
        result = NULL;
        goto cleanup;
    }
    result = other_stuff(x);
cleanup:
    Py_XDECREF(x);
    return result;

kwgoodman · 2019-01-04T20:01:07Z

I had forgotten about Py_XDECREF(a). I see that Py_DECREF(a) is faster but will crash if a is NULL. So good thing, @batterseapower , that you suggested removing Py_DECREF for the case where a is NULL.

Thank you both for the review. I'll make the changes to the rest of the functions.

If anyone thinks I shouldn't include these changes right before a release, let me know.

shoyer · 2019-01-04T20:09:57Z

I would suggest opening the pull request from your branch first. Then I can review the changes for one function before you do it for everything :)

kwgoodman · 2019-01-05T21:49:00Z

OK, I merged the memory leak fix into master.

tensionhead · 2022-11-16T13:08:44Z

Sorry, it's actually fine.. just if someone stumbles over this again I add this here:
I underestimated how much memory np.sum and np.nansum use temporarily. Here is a profile for both sum operations, with either only numpy arrays or a mix of one array and one h5py.Dataset like np.sum([arr, dset]). A single array/dataset has 256MB, and we always create/operate on two of those:

kwgoodman changed the title ~~Leaks memory when given Pandas objects~~ Leaks memory when input is not a numpy array Jan 4, 2019

kwgoodman added a commit that referenced this issue Jan 4, 2019

fix memory leak #201 but only for reduce functions

25402b7

kwgoodman added a commit that referenced this issue Jan 4, 2019

refactor bugfix #201

3dec62b

kwgoodman added a commit that referenced this issue Jan 4, 2019

plug memory leak when raising exception #201

1381dce

kwgoodman added a commit that referenced this issue Jan 4, 2019

refactor bugfix #201

2545cf5

kwgoodman added a commit that referenced this issue Jan 4, 2019

remove superfluous Py_DECREF #201

d30741d

kwgoodman mentioned this issue Jan 4, 2019

Memory leak when input is not a numpy array #202

Merged

kwgoodman added a commit that referenced this issue Jan 4, 2019

refactor bugfix #201

3eb2ff8

kwgoodman added a commit that referenced this issue Jan 5, 2019

fix memory leak #201

8d8754f

kwgoodman closed this as completed Jan 5, 2019

tensionhead mentioned this issue Nov 15, 2022

Excessive memory consumption of np.nansum esi-neuroscience/syncopy#380

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaks memory when input is not a numpy array #201

Leaks memory when input is not a numpy array #201

batterseapower commented Jan 2, 2019

kwgoodman commented Jan 2, 2019

batterseapower commented Jan 2, 2019

kwgoodman commented Jan 2, 2019

batterseapower commented Jan 3, 2019

kwgoodman commented Jan 4, 2019 •

edited

Loading

kwgoodman commented Jan 4, 2019

batterseapower commented Jan 4, 2019

kwgoodman commented Jan 4, 2019 •

edited

Loading

kwgoodman commented Jan 4, 2019

shoyer commented Jan 4, 2019

shoyer commented Jan 4, 2019

kwgoodman commented Jan 4, 2019

shoyer commented Jan 4, 2019

kwgoodman commented Jan 5, 2019

tensionhead commented Nov 16, 2022

Leaks memory when input is not a numpy array #201

Leaks memory when input is not a numpy array #201

Comments

batterseapower commented Jan 2, 2019

kwgoodman commented Jan 2, 2019

batterseapower commented Jan 2, 2019

kwgoodman commented Jan 2, 2019

batterseapower commented Jan 3, 2019

kwgoodman commented Jan 4, 2019 • edited Loading

kwgoodman commented Jan 4, 2019

batterseapower commented Jan 4, 2019

kwgoodman commented Jan 4, 2019 • edited Loading

kwgoodman commented Jan 4, 2019

shoyer commented Jan 4, 2019

shoyer commented Jan 4, 2019

kwgoodman commented Jan 4, 2019

shoyer commented Jan 4, 2019

kwgoodman commented Jan 5, 2019

tensionhead commented Nov 16, 2022

kwgoodman commented Jan 4, 2019 •

edited

Loading

kwgoodman commented Jan 4, 2019 •

edited

Loading