Jaccard, Sorensen, and Overlap OOM in bench_algos.py #4510

nv-rliu · 2024-06-28T15:05:42Z

Version

24.08

Which installation method(s) does this occur on?

Source

Describe the bug.

The three algorithm calls (Jaccard, Overlap, Sorenson) are encountering OOM errors when called inside cugraph/benchmarks/cugraph/pytest-based/bench_algos.py

Minimum reproducible example

pytest --import-mode=append -v bench_algos.py

Relevant log output

06/28/24-11:18:48.121397660_UTC>>>> NODE 0: ******** STARTING BENCHMARK FROM: ./bench_algos.py::bench_jaccard, using 8 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/benchmarks
configfile: pytest.ini
plugins: cov-5.0.0, benchmark-4.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 720 items / 719 deselected / 1 selected

bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True] [1719573540.937835] [rno1-m02-b07-dgx1-012:2540068:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.937835] [rno1-m02-b07-dgx1-012:2540068:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.938259] [rno1-m02-b07-dgx1-012:2540068:0]            sock.c:470  UCX  ERROR bind(fd=143 addr=0.0.0.0:38102) failed: Address already in use
[1719573540.936350] [rno1-m02-b07-dgx1-012:2540076:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.936350] [rno1-m02-b07-dgx1-012:2540076:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.936189] [rno1-m02-b07-dgx1-012:2540072:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.936189] [rno1-m02-b07-dgx1-012:2540072:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.935299] [rno1-m02-b07-dgx1-012:2540083:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.935299] [rno1-m02-b07-dgx1-012:2540083:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.936662] [rno1-m02-b07-dgx1-012:2540091:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.936662] [rno1-m02-b07-dgx1-012:2540091:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.935887] [rno1-m02-b07-dgx1-012:2540088:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.935887] [rno1-m02-b07-dgx1-012:2540088:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.932334] [rno1-m02-b07-dgx1-012:2540079:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.932334] [rno1-m02-b07-dgx1-012:2540079:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.937876] [rno1-m02-b07-dgx1-012:2540063:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.937876] [rno1-m02-b07-dgx1-012:2540063:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-1bafe1bfa67fc2c637d3403830024c4d
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14fca5ffd070>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"

2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-a312245088c4f7bc20c6efd7fee56857
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14815013ddf0>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"

2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-abd0f84409465f528bca749e8e9d0f7e
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x1466dcfecb70>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"

2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-d1d386a168f6ca0c34a56daaf94a2831
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14caa966b930>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"

2024-06-28 04:19:18,314 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-5438dfeb9412a4bd2811bfdaaa092a34
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14bfec3c09b0>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"

2024-06-28 04:19:18,315 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-419d254dd61fb79c335a271bdc2d48be
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x150972dd36f0>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"

2024-06-28 04:19:18,315 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-847d9e6fe8f13c5357c4d58857183ffd
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14a18f4b5330>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"

2024-06-28 04:19:18,317 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_two_hop_neighbors-88f1c795207268843e983b567c0e2e32
Function:  _call_plc_two_hop_neighbors
args:      (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14e48792a5b0>, None)
kwargs:    {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"


Dask client/cluster created using LocalCUDACluster
FAILED/opt/conda/lib/python3.10/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Not saving anything, no benchmarks have been run!
  warner(PytestBenchmarkWarning(text))

Dask client closed.


=================================== FAILURES ===================================
_______________ bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True] _______________

gpubenchmark = <rapids_pytest_benchmark.plugin.GPUBenchmarkFixture object at 0x14a611538040>
unweighted_graph = <cugraph.structure.graph_classes.Graph object at 0x14a65fa8b0d0>

    def bench_jaccard(gpubenchmark, unweighted_graph):
        G = unweighted_graph
        jaccard = dask_cugraph.jaccard if is_graph_distributed(G) else cugraph.jaccard
>       gpubenchmark(jaccard, G)

bench_algos.py:333: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:125: in __call__
    return self._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/rapids_pytest_benchmark/plugin.py:322: in _raw
    function_result = super()._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:147: in _raw
    duration, iterations, loops_range = self._calibrate_timer(runner)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:275: in _calibrate_timer
    duration = runner(loops_range)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:90: in runner
    function_to_benchmark(*args, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/dask/link_prediction/jaccard.py:120: in jaccard
    vertex_pair = input_graph.get_two_hop_neighbors()
/opt/conda/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py:948: in get_two_hop_neighbors
    ddf = dask_cudf.from_delayed(cudf_result).persist()
/opt/conda/lib/python3.10/site-packages/dask_expr/io/_delayed.py:104: in from_delayed
    meta = delayed(make_meta)(dfs[0]).compute()
/opt/conda/lib/python3.10/site-packages/dask/base.py:375: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/opt/conda/lib/python3.10/site-packages/dask/base.py:661: in compute
    results = schedule(dsk, keys, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py:871: in _call_plc_two_hop_neighbors
    results_ = pylibcugraph_get_two_hop_neighbors(
two_hop_neighbors.pyx:101: in pylibcugraph.two_hop_neighbors.get_two_hop_neighbors
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   RuntimeError: non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp

utils.pyx:53: RuntimeError
------------------------------ Captured log setup ------------------------------
INFO     distributed.http.proxy:proxy.py:85 To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO     distributed.scheduler:scheduler.py:1711 State start
INFO     distributed.scheduler:scheduler.py:4072   Scheduler at:     tcp://127.0.0.1:34703
INFO     distributed.scheduler:scheduler.py:4087   dashboard at:  http://127.0.0.1:8787/status
INFO     distributed.scheduler:scheduler.py:7874 Registering Worker plugin shuffle
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:37923'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:44233'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:45047'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:36609'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:34709'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:38253'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:32781'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:34771'
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:41117', name: 7, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:41117
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49920
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:34081', name: 1, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:34081
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49924
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:38575', name: 6, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:38575
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49922
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:46873', name: 4, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:46873
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49930
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:45979', name: 0, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:45979
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49928
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:42671', name: 2, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:42671
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49932
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:44085', name: 3, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:44085
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49934
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:33685', name: 5, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:33685
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49926
INFO     distributed.scheduler:scheduler.py:5686 Receive client connection: Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49936
INFO     distributed.worker:worker.py:3178 Run out-of-band function '_func_set_scheduler_as_nccl_root'
INFO     distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
INFO     distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
---------------------------- Captured log teardown -----------------------------
INFO     distributed.worker:worker.py:3178 Run out-of-band function '_func_destroy_scheduler_session'
INFO     distributed.scheduler:scheduler.py:5730 Remove client Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49936; closing.
INFO     distributed.scheduler:scheduler.py:5730 Remove client Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO     distributed.scheduler:scheduler.py:5722 Close client connection: Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO     distributed.scheduler:scheduler.py:7317 Retire worker addresses (0, 1, 2, 3, 4, 5, 6, 7)
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:37923'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:44233'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:45047'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:36609'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:34709'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:38253'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:32781'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:34771'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49928; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49924; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49932; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49934; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49930; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:45979', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.838292')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:34081', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8387687')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:42671', name: 2, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8392265')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:44085', name: 3, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8396282')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:46873', name: 4, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8400137')
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49926; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49922; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:33685', name: 5, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8412776')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:38575', name: 6, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8414953')
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49920; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:41117', name: 7, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8423226')
INFO     distributed.scheduler:scheduler.py:5331 Lost all workers
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:34703 remote=tcp://127.0.0.1:49926>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:34703 remote=tcp://127.0.0.1:49922>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.scheduler:scheduler.py:4146 Scheduler closing due to unknown reason...
INFO     distributed.scheduler:scheduler.py:4164 Scheduler closing all comms
=============================== warnings summary ===============================
cugraph/pytest-based/bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True]
  /opt/conda/lib/python3.10/site-packages/cudf/core/reshape.py:350: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
    warnings.warn(

cugraph/pytest-based/bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True]
  cugraph/pytest-based/bench_algos.py:330: PytestBenchmarkWarning: Benchmark fixture was not used at all in this test!

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
- generated xml file: /gpfs/fs1/projects/sw_rapids/users/rratzel/cugraph-results/latest/benchmarks/8-GPU/bench_jaccard/test_results.json -
=========================== short test summary info ============================
FAILED bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True] - Run...
================ 1 failed, 719 deselected, 2 warnings in 33.72s ================
06/28/24-11:19:23.452942759_UTC>>>> NODE 0: pytest exited with code: 1, run-py-tests.sh overall exit code is: 1
06/28/24-11:19:23.561210764_UTC>>>> NODE 0: remaining python processes: [ 2535211 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
06/28/24-11:19:23.586115056_UTC>>>> NODE 0: remaining dask processes: [  ]

Environment details

These can be reproduced on dgx18 and are typically run inside the nightly cugraph containers.

Other/Misc.

@nv-rliu is working on a fix.

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

nv-rliu added graph-devops Issues for the graph-devops team benchmarks labels Jun 28, 2024

nv-rliu added this to the 24.08 milestone Jun 28, 2024

nv-rliu self-assigned this Jun 28, 2024

nv-rliu mentioned this issue Jul 8, 2024

Fix OOM Bug for Jaccard, Sorensen, and Overlap benchmarks #4524

Merged

rapids-bot bot closed this as completed in #4524 Jul 9, 2024

rapids-bot bot closed this as completed in 4464504 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jaccard, Sorensen, and Overlap OOM in bench_algos.py #4510

Jaccard, Sorensen, and Overlap OOM in bench_algos.py #4510

nv-rliu commented Jun 28, 2024

Jaccard, Sorensen, and Overlap OOM in bench_algos.py #4510

Jaccard, Sorensen, and Overlap OOM in bench_algos.py #4510

Comments

nv-rliu commented Jun 28, 2024

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.

Code of Conduct