MG get_two_hop_neighbors fails with OOM on certain start_vertices regardless of graph size #3746

rlratzel · 2023-07-26T22:37:54Z

This failed for both a smaller Graph500 graph and even Karate Club on a MG system with 80+GB per GPU. A minimal repro script is attached to be run on a >1 GPU machine. The output is shown below for Karate Club:

(cugraph_dev-23.08) user@machine ~ $ CUDA_VISIBLE_DEVICES=0,1 python repro.py
<Client: 'tcp://127.0.0.1:34469' processes=2 threads=2, memory=503.24 GiB>
   src  dst
0    1    0
1    2    0
2    3    0
3    4    0
4    5    0
getting 2-hop neighbors for vertex=0...
getting 2-hop neighbors for vertex=1...
2023-07-26 17:16:43,222 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-6405c9228f50b400e0280aa31a604b98
Function:  _call_plc_two_hop_neighbors
args:      (b'\x84\xef"\xa1\xde\xfaA\x12\x8el\xc7\xb0\x92p\x1b\x95', <pylibcugraph.graphs.MGGraph object at 0x7f15981aaa70>, 0    1
dtype: int32)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /home/user/miniconda3/envs/cugraph_dev-23.08/include/rmm/mr/device/cuda_memory_resource.hpp')"

repro py script

The above was also attempted using the MG APIs in a single GPU environment (using CUDA_VISIBLE_DEVICES=0) but that did not reproduce the issue, so it seems to require >1 GPU.

The text was updated successfully, but these errors were encountered:

A customer identified an issue trying to run Jaccard. In MG calls they were seeing failed memory allocation calls. Vertices were being shuffled incorrectly in the C API, so we were getting vertices processed on the wrong GPU, resulting in out-of-bounds memory references. Moved the shuffle before renumbering, which puts vertices to be on proper GPU Closes #3746 Authors: - Chuck Hastings (https://github.com/ChuckHastings) Approvers: - Seunghwa Kang (https://github.com/seunghwak) URL: #3758

rlratzel added the bug Something isn't working label Jul 26, 2023

rlratzel assigned ChuckHastings Jul 26, 2023

ChuckHastings mentioned this issue Jul 28, 2023

Fix bug discovered in Jaccard testing #3758

Merged

rapids-bot bot closed this as completed in #3758 Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MG get_two_hop_neighbors fails with OOM on certain start_vertices regardless of graph size #3746

MG get_two_hop_neighbors fails with OOM on certain start_vertices regardless of graph size #3746

rlratzel commented Jul 26, 2023

MG get_two_hop_neighbors fails with OOM on certain start_vertices regardless of graph size #3746

MG get_two_hop_neighbors fails with OOM on certain start_vertices regardless of graph size #3746

Comments

rlratzel commented Jul 26, 2023