You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The three algorithm calls (Jaccard, Overlap, Sorenson) are encountering OOM errors when called inside cugraph/benchmarks/cugraph/pytest-based/bench_algos.py
Minimum reproducible example
pytest --import-mode=append -v bench_algos.py
Relevant log output
06/28/24-11:18:48.121397660_UTC>>>> NODE 0: ******** STARTING BENCHMARK FROM: ./bench_algos.py::bench_jaccard, using 8 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/benchmarks
configfile: pytest.ini
plugins: cov-5.0.0, benchmark-4.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 720 items / 719 deselected / 1 selected
bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True] [1719573540.937835] [rno1-m02-b07-dgx1-012:2540068:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.937835] [rno1-m02-b07-dgx1-012:2540068:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.938259] [rno1-m02-b07-dgx1-012:2540068:0] sock.c:470 UCX ERROR bind(fd=143 addr=0.0.0.0:38102) failed: Address already in use
[1719573540.936350] [rno1-m02-b07-dgx1-012:2540076:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.936350] [rno1-m02-b07-dgx1-012:2540076:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.936189] [rno1-m02-b07-dgx1-012:2540072:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.936189] [rno1-m02-b07-dgx1-012:2540072:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.935299] [rno1-m02-b07-dgx1-012:2540083:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.935299] [rno1-m02-b07-dgx1-012:2540083:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.936662] [rno1-m02-b07-dgx1-012:2540091:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.936662] [rno1-m02-b07-dgx1-012:2540091:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.935887] [rno1-m02-b07-dgx1-012:2540088:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.935887] [rno1-m02-b07-dgx1-012:2540088:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.932334] [rno1-m02-b07-dgx1-012:2540079:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.932334] [rno1-m02-b07-dgx1-012:2540079:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573540.937876] [rno1-m02-b07-dgx1-012:2540063:0] parser.c:2036 UCX WARN unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573540.937876] [rno1-m02-b07-dgx1-012:2540063:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-1bafe1bfa67fc2c637d3403830024c4d
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14fca5ffd070>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-a312245088c4f7bc20c6efd7fee56857
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14815013ddf0>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-abd0f84409465f528bca749e8e9d0f7e
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x1466dcfecb70>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
2024-06-28 04:19:18,253 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-d1d386a168f6ca0c34a56daaf94a2831
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14caa966b930>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
2024-06-28 04:19:18,314 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-5438dfeb9412a4bd2811bfdaaa092a34
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14bfec3c09b0>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
2024-06-28 04:19:18,315 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-419d254dd61fb79c335a271bdc2d48be
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x150972dd36f0>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
2024-06-28 04:19:18,315 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-847d9e6fe8f13c5357c4d58857183ffd
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14a18f4b5330>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
2024-06-28 04:19:18,317 - distributed.worker - WARNING - Compute Failed
Key: _call_plc_two_hop_neighbors-88f1c795207268843e983b567c0e2e32
Function: _call_plc_two_hop_neighbors
args: (b'7\x1a\xd3\x85h\xe0O-\x8b\xe9\xa1\x85!^J_', <pylibcugraph.graphs.MGGraph object at 0x14e48792a5b0>, None)
kwargs: {}
Exception: "RuntimeError('non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp')"
Dask client/cluster created using LocalCUDACluster
FAILED/opt/conda/lib/python3.10/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Not saving anything, no benchmarks have been run!
warner(PytestBenchmarkWarning(text))
Dask client closed.
=================================== FAILURES ===================================
_______________ bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True] _______________
gpubenchmark = <rapids_pytest_benchmark.plugin.GPUBenchmarkFixture object at 0x14a611538040>
unweighted_graph = <cugraph.structure.graph_classes.Graph object at 0x14a65fa8b0d0>
def bench_jaccard(gpubenchmark, unweighted_graph):
G = unweighted_graph
jaccard = dask_cugraph.jaccard if is_graph_distributed(G) else cugraph.jaccard
> gpubenchmark(jaccard, G)
bench_algos.py:333:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:125: in __call__
return self._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/rapids_pytest_benchmark/plugin.py:322: in _raw
function_result = super()._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:147: in _raw
duration, iterations, loops_range = self._calibrate_timer(runner)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:275: in _calibrate_timer
duration = runner(loops_range)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:90: in runner
function_to_benchmark(*args, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/dask/link_prediction/jaccard.py:120: in jaccard
vertex_pair = input_graph.get_two_hop_neighbors()
/opt/conda/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py:948: in get_two_hop_neighbors
ddf = dask_cudf.from_delayed(cudf_result).persist()
/opt/conda/lib/python3.10/site-packages/dask_expr/io/_delayed.py:104: in from_delayed
meta = delayed(make_meta)(dfs[0]).compute()
/opt/conda/lib/python3.10/site-packages/dask/base.py:375: in compute
(result,) = compute(self, traverse=False, **kwargs)
/opt/conda/lib/python3.10/site-packages/dask/base.py:661: in compute
results = schedule(dsk, keys, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py:871: in _call_plc_two_hop_neighbors
results_ = pylibcugraph_get_two_hop_neighbors(
two_hop_neighbors.pyx:101: in pylibcugraph.two_hop_neighbors.get_two_hop_neighbors
???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>???
E RuntimeError: non-success value returned from two_hop_neighbors: CUGRAPH_UNKNOWN_ERROR std::bad_alloc: out_of_memory: CUDA error at: /opt/conda/include/rmm/mr/device/cuda_memory_resource.hpp
utils.pyx:53: RuntimeError
------------------------------ Captured log setup ------------------------------
INFO distributed.http.proxy:proxy.py:85 To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO distributed.scheduler:scheduler.py:1711 State start
INFO distributed.scheduler:scheduler.py:4072 Scheduler at: tcp://127.0.0.1:34703
INFO distributed.scheduler:scheduler.py:4087 dashboard at: http://127.0.0.1:8787/status
INFO distributed.scheduler:scheduler.py:7874 Registering Worker plugin shuffle
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:37923'
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:44233'
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:45047'
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:36609'
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:34709'
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:38253'
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:32781'
INFO distributed.nanny:nanny.py:368 Start Nanny at: 'tcp://127.0.0.1:34771'
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:41117', name: 7, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:41117
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49920
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:34081', name: 1, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:34081
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49924
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:38575', name: 6, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:38575
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49922
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:46873', name: 4, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:46873
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49930
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:45979', name: 0, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:45979
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49928
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:42671', name: 2, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:42671
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49932
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:44085', name: 3, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:44085
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49934
INFO distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:33685', name: 5, status: init, memory: 0, processing: 0>
INFO distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:33685
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49926
INFO distributed.scheduler:scheduler.py:5686 Receive client connection: Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:49936
INFO distributed.worker:worker.py:3178 Run out-of-band function'_func_set_scheduler_as_nccl_root'
INFO distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
INFO distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
---------------------------- Captured log teardown -----------------------------
INFO distributed.worker:worker.py:3178 Run out-of-band function'_func_destroy_scheduler_session'
INFO distributed.scheduler:scheduler.py:5730 Remove client Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49936; closing.
INFO distributed.scheduler:scheduler.py:5730 Remove client Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO distributed.scheduler:scheduler.py:5722 Close client connection: Client-37e66edd-3540-11ef-81dd-a81e84c35e35
INFO distributed.scheduler:scheduler.py:7317 Retire worker addresses (0, 1, 2, 3, 4, 5, 6, 7)
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:37923'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:44233'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:45047'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:36609'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:34709'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:38253'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:32781'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:34771'. Reason: nanny-close
INFO distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49928; closing.
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49924; closing.
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49932; closing.
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49934; closing.
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49930; closing.
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:45979', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.838292')
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:34081', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8387687')
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:42671', name: 2, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8392265')
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:44085', name: 3, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8396282')
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:46873', name: 4, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8400137')
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49926; closing.
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49922; closing.
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:33685', name: 5, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8412776')
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:38575', name: 6, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8414953')
INFO distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:49920; closing.
INFO distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:41117', name: 7, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573559.8423226')
INFO distributed.scheduler:scheduler.py:5331 Lost all workers
INFO distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:34703 remote=tcp://127.0.0.1:49926>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
raise CommClosedError()
distributed.comm.core.CommClosedError
INFO distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:34703 remote=tcp://127.0.0.1:49922>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
nbytes = yield coro
File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
value = future.result()
File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
raise CommClosedError()
distributed.comm.core.CommClosedError
INFO distributed.scheduler:scheduler.py:4146 Scheduler closing due to unknown reason...
INFO distributed.scheduler:scheduler.py:4164 Scheduler closing all comms
=============================== warnings summary ===============================
cugraph/pytest-based/bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True]
/opt/conda/lib/python3.10/site-packages/cudf/core/reshape.py:350: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
warnings.warn(
cugraph/pytest-based/bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True]
cugraph/pytest-based/bench_algos.py:330: PytestBenchmarkWarning: Benchmark fixture was not used at all in this test!
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
- generated xml file: /gpfs/fs1/projects/sw_rapids/users/rratzel/cugraph-results/latest/benchmarks/8-GPU/bench_jaccard/test_results.json -
=========================== short test summary info ============================
FAILED bench_algos.py::bench_jaccard[ds:rmat_mg_20_16-mm:False-pa:True] - Run...
================ 1 failed, 719 deselected, 2 warnings in 33.72s ================
06/28/24-11:19:23.452942759_UTC>>>> NODE 0: pytest exited with code: 1, run-py-tests.sh overall exit code is: 1
06/28/24-11:19:23.561210764_UTC>>>> NODE 0: remaining python processes: [ 2535211 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
06/28/24-11:19:23.586115056_UTC>>>> NODE 0: remaining dask processes: [ ]
Environment details
These can be reproduced on dgx18 and are typically run inside the nightly cugraph containers.
Version
24.08
Which installation method(s) does this occur on?
Source
Describe the bug.
The three algorithm calls (Jaccard, Overlap, Sorenson) are encountering OOM errors when called inside
cugraph/benchmarks/cugraph/pytest-based/bench_algos.py
Minimum reproducible example
Relevant log output
Environment details
Other/Misc.
@nv-rliu is working on a fix.
Code of Conduct
The text was updated successfully, but these errors were encountered: