Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaks memory when input is not a numpy array #201

Closed
batterseapower opened this issue Jan 2, 2019 · 15 comments
Closed

Leaks memory when input is not a numpy array #201

batterseapower opened this issue Jan 2, 2019 · 15 comments

Comments

@batterseapower
Copy link

If you run the following program you see that nansum leaks all the memory it are given when passed a Pandas object. If it is passed the ndarray underlying the Pandas object instead then there is no leak:

import psutil
import gc

def f():
    x = np.zeros(10*1024*1024, dtype='f4')

    # Leaks 40MB/iteration
    bottleneck.nansum(pd.Series(x))
    # No leak:
    #bottleneck.nansum(x)

process = psutil.Process(os.getpid())

def _get_usage():
    gc.collect()
    return process.memory_info().private / (1024*1024)

last_usage = _get_usage()
print(last_usage)

for _ in range(10):
    f()
    usage = _get_usage()
    print(usage - last_usage)
    last_usage = usage

This affects not just nansum, but apparently all the reduction functions (with or without axis specified), and at least some other functions like move_max.

I'm not completely sure why this happens, but maybe it's because PyArray_FROM_O is allocating a new array in this case, and the ref count of that is not being decremented by anyone? https://github.com/kwgoodman/bottleneck/blob/master/bottleneck/src/reduce_template.c#L1237

I'm using Bottleneck 1.2.1 with Pandas 0.23.1. sys.version is 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)].

@kwgoodman
Copy link
Collaborator

Is there a memory leak if instead of pd.Series(x) you use x.tolist()?

@batterseapower
Copy link
Author

Yes. In fact, it leaks twice as much memory for some reason.

@kwgoodman
Copy link
Collaborator

The memory leak is a big find. Thank you!

@batterseapower
Copy link
Author

Yeah, I'm very happy I finally worked out why my Jupyter sessions have required a weekly restart for the last 6 months :-)

@kwgoodman kwgoodman changed the title Leaks memory when given Pandas objects Leaks memory when input is not a numpy array Jan 4, 2019
@kwgoodman
Copy link
Collaborator

kwgoodman commented Jan 4, 2019

@batterseapower could you try the leak_201 branch? In it I think I fixed the memory leak (but only for reduce functions). Your code doesn't run for me (maybe because I am on py2.7?). If you could check both for numpy array and non-numpy arrays that would be great. My quick checks (watching htop) looked good.

kwgoodman added a commit that referenced this issue Jan 4, 2019
@kwgoodman
Copy link
Collaborator

@shoyer thanks for the suggestion. Did I implement it correctly? This change will touch every function so I am wondering if I should make it right before a release. A second set of eyes will give me confidence.

@batterseapower
Copy link
Author

You might not be able to run the code if you aren't on Windows: I tried the sample on my Mac and it looks like the private member is not available there (I guess you can use rss as a replacement).

Anyway, your fix seems to have worked: neither Pandas nor numpy objects leak (and neither do Python lists from .tolist()). Thanks for the quick response!

@kwgoodman
Copy link
Collaborator

kwgoodman commented Jan 4, 2019

@shoyer what if the input is a numpy array and an error occurs after the Py_INCREF but before the function returns? That would be a memory leak whereas the original fix would not leak. (Edit: Oh, the original fix would leak if an error occurred with a non-numpy array)

@batterseapower thanks for the checks. I'm on linux so will try the rss.

@kwgoodman
Copy link
Collaborator

OK, so here is the proposed fix for the memory leak: https://github.com/kwgoodman/bottleneck/compare/leak_201

Comments welcome. If all looks good then I will apply the fix to the other functions (nonreduce, moving window, etc)

kwgoodman added a commit that referenced this issue Jan 4, 2019
kwgoodman added a commit that referenced this issue Jan 4, 2019
@shoyer
Copy link
Member

shoyer commented Jan 4, 2019

@kwgoodman could you kindly open a pull request so I can comment inline?

@shoyer
Copy link
Member

shoyer commented Jan 4, 2019

what if the input is a numpy array and an error occurs after the Py_INCREF but before the function returns? That would be a memory leak whereas the original fix would not leak.

Yes, you should call Py_DECREF or Py_XDECREF before returning.

A common style you'll see in NumPy is to use goto for error handling, e.g.,

    PyObject *result = NULL;
    PyObject *x;
    x = something_else();
    if (x == NULL) {
        result = NULL;
        goto cleanup;
    }
    result = other_stuff(x);
cleanup:
    Py_XDECREF(x);
    return result;

@kwgoodman
Copy link
Collaborator

I had forgotten about Py_XDECREF(a). I see that Py_DECREF(a) is faster but will crash if a is NULL. So good thing, @batterseapower , that you suggested removing Py_DECREF for the case where a is NULL.

Thank you both for the review. I'll make the changes to the rest of the functions.

If anyone thinks I shouldn't include these changes right before a release, let me know.

@shoyer
Copy link
Member

shoyer commented Jan 4, 2019

I would suggest opening the pull request from your branch first. Then I can review the changes for one function before you do it for everything :)

kwgoodman added a commit that referenced this issue Jan 4, 2019
kwgoodman added a commit that referenced this issue Jan 5, 2019
@kwgoodman
Copy link
Collaborator

OK, I merged the memory leak fix into master.

@tensionhead
Copy link

Sorry, it's actually fine.. just if someone stumbles over this again I add this here:
I underestimated how much memory np.sum and np.nansum use temporarily. Here is a profile for both sum operations, with either only numpy arrays or a mix of one array and one h5py.Dataset like np.sum([arr, dset]). A single array/dataset has 256MB, and we always create/operate on two of those:

Screenshot from 2022-11-16 13:59:41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants