Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory leak when mdarray is used in a file that is used by cython/python interface #740

Closed
rg20 opened this issue Jul 13, 2022 · 3 comments · Fixed by #764
Closed
Assignees
Labels
bug Something isn't working

Comments

@rg20
Copy link

rg20 commented Jul 13, 2022

Describe the bug
When using mdarray in C++ code alone, it works as expected. However, when using the python interface, we get memory issues such as:

  • Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
  • munmap_chunk(): invalid pointer, Aborted (core dumped)
  • corrupted double-linked list. Aborted (core dumped)

Steps/Code to reproduce bug
A PR in cuOpt that reproduces the bug is here: https://github.com/rapidsai/cuopt/pull/605. In this PR, mdarray is declared as a member variable in a class and not at all used (null op)

Expected behavior
The C++ version is working just fine. The mdarray (in this case) is not part of the cython or python class, so it should work fine.

Environment details (please complete the following information):
cuOpt needs to be installed to reproduce the bug. There may be other projects (like cuGraph) that can be used to reproduce the bug.

Additional context
Add any other context about the problem here.

@rg20 rg20 added the bug Something isn't working label Jul 13, 2022
@afender
Copy link
Member

afender commented Jul 14, 2022

We are getting close to release and impacted authors will go on PTO soon. Unless this is a quick fix that can happen this week, cuOpt should fall back on rmm for 22.08.

@tfeher
Copy link
Contributor

tfeher commented Jul 27, 2022

Note that cuML had the same issue: segfault during the destruction of objects, if we were wrapping a class with an mdarray class member (integration PR that was failing befor the workaround was implemented rapidsai/cuml#4823).

Here is the ivf_flat object in RAFT that had mdarray member
https://github.com/achirkin/raft/blob/7f640a98217eca04bfc8c1b87feb998232464697/cpp/include/raft/spatial/knn/ivf_flat_types.hpp#L89

The error appeared when we tried to run the following cuML unit test:

pytest -s -vv python/cuml/tests/test_nearest_neighbors.py::test_ann_distances_metrics[sqeuclidean-ivfflat]

This is the relevant part of the test to reproduce the issue:
https://github.com/rapidsai/cuml/blob/2ea114abbec2ceb8a90c0d2611c577dbd4d0adfe/python/cuml/tests/test_nearest_neighbors.py#L289-L297

    X, y = make_blobs(n_samples=500, centers=2,
                      n_features=128, random_state=0)

    cu_knn = cuKNN(algorithm="ivfflat", metric="sqeuclidean")
    cu_knn.fit(X)
    
    del cu_knn
    gc.collect()

What happens is during fitting we construct a new knnIndex object: https://github.com/rapidsai/cuml/blob/2ea114abbec2ceb8a90c0d2611c577dbd4d0adfe/python/cuml/neighbors/nearest_neighbors.pyx#L404-L405

This is later destroyed here https://github.com/rapidsai/cuml/blob/2ea114abbec2ceb8a90c0d2611c577dbd4d0adfe/python/cuml/neighbors/nearest_neighbors.pyx#L907

But the destruction order is possibly not correct when it is triggered from the Python side. In contrast, everything works if we are just using the object from C++.

The current workround is to replace the device_mdarrays with device_uvectors achirkin@8b26750

@tfeher
Copy link
Contributor

tfeher commented Jul 27, 2022

This looks like a similar issue: rapidsai/rmm#931

rapids-bot bot pushed a commit that referenced this issue Jul 28, 2022
Close #740 .

Found by @achirkin , the size of `mapping` is non-deterministic. This is caused by the macro definition around whether `no_unique_address` is supported with nvcc. This PR uses the standard cpp version to define the macro regardless of the compiler being used.

Authors:
  - Jiaming Yuan (https://github.com/trivialfis)

Approvers:
  - Mark Hoemmen (https://github.com/mhoemmen)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #764
@rapids-bot rapids-bot bot closed this as completed in #764 Jul 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants