Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MG get_two_hop_neighbors fails with OOM on certain start_vertices regardless of graph size #3746

Closed
rlratzel opened this issue Jul 26, 2023 · 0 comments · Fixed by #3758
Closed
Assignees
Labels
bug Something isn't working

Comments

@rlratzel
Copy link
Contributor

This failed for both a smaller Graph500 graph and even Karate Club on a MG system with 80+GB per GPU. A minimal repro script is attached to be run on a >1 GPU machine. The output is shown below for Karate Club:

(cugraph_dev-23.08) user@machine ~ $ CUDA_VISIBLE_DEVICES=0,1 python repro.py
<Client: 'tcp://127.0.0.1:34469' processes=2 threads=2, memory=503.24 GiB>
   src  dst
0    1    0
1    2    0
2    3    0
3    4    0
4    5    0
getting 2-hop neighbors for vertex=0...
getting 2-hop neighbors for vertex=1...
2023-07-26 17:16:43,222 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-6405c9228f50b400e0280aa31a604b98
Function:  _call_plc_two_hop_neighbors
args:      (b'\x84\xef"\xa1\xde\xfaA\x12\x8el\xc7\xb0\x92p\x1b\x95', <pylibcugraph.graphs.MGGraph object at 0x7f15981aaa70>, 0    1
dtype: int32)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /home/user/miniconda3/envs/cugraph_dev-23.08/include/rmm/mr/device/cuda_memory_resource.hpp')"

repro py script

The above was also attempted using the MG APIs in a single GPU environment (using CUDA_VISIBLE_DEVICES=0) but that did not reproduce the issue, so it seems to require >1 GPU.

@rlratzel rlratzel added the bug Something isn't working label Jul 26, 2023
rapids-bot bot pushed a commit that referenced this issue Jul 31, 2023
A customer identified an issue trying to run Jaccard.  In MG calls they were seeing failed memory allocation calls.

Vertices were being shuffled incorrectly in the C API, so we were getting vertices processed on the wrong GPU, resulting in out-of-bounds memory references.

Moved the shuffle before renumbering, which puts vertices to be on proper GPU

Closes #3746

Authors:
  - Chuck Hastings (https://github.com/ChuckHastings)

Approvers:
  - Seunghwa Kang (https://github.com/seunghwak)

URL: #3758
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants