Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cugraph python graph symmetrization is prohibitively inefficient, needs optimization #3500

Open
Tracked by #3304
rlratzel opened this issue Apr 19, 2023 · 1 comment
Assignees
Labels
Story Tracking issues as a story level - likely to exist within a single release.

Comments

@rlratzel
Copy link
Contributor

rlratzel commented Apr 19, 2023

Who: cugraph users
What: improve memory efficiency when creating undirected graphs to reduce OOM errors
Why: undirected graphs require a symmetrization step which is currently prohibitively inefficient, resulting in OOM errors where they seemingly should not. This results in users not being able to use cugraph when they, in many cases, should be able to based on input size and GPU memory size.

@rlratzel rlratzel added the Story Tracking issues as a story level - likely to exist within a single release. label Apr 19, 2023
@kelly-grizzle-sp
Copy link

Hi @rlratzel do you have any ideas on when this might be fixed?

Some background - I am using dask cugraph to perform louvain community detection on large graphs (300m to 1b edges). This was originally implemented using 22.04 and worked well on a multi-GPU system with 64GB of GPU memory. After upgrading to 22.12 we started getting OOM errors for graphs that previously succeeded. The OOM usually happens in compute_renumber_edge_list() when creating the NumberMap.

After some investigation, I believe that this is due to the graph symmetrization added here - #2247 - which causes growth in the data frames. Shortly after this change was merged, another PR (#2394) forced renumbering and storing the plc graph in simpleDistributedGraphImpl.__from_edgelist.

The following traceback shows where the OOM is being produced.

  File "/louvain/louvain.py", line 37, in detect_communities
    graph.from_dask_cudf_edgelist(edges_df, source="from", destination="to", edge_attr="weight", renumber=True)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_classes.py", line 299, in from_dask_cudf_edgelist
    self._Impl._simpleDistributedGraphImpl__from_edgelist(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 236, in __from_edgelist
    self.compute_renumber_edge_list(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/graph_implementation/simpleDistributedGraph.py", line 937, in compute_renumber_edge_list
    ) = NumberMap.renumber_and_segment(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/structure/number_map.py", line 595, in renumber_and_segment
    data = get_distributed_data(df)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/dask/common/input_utils.py", line 244, in get_distributed_data
    data = DistributedDataHandler.create(data=ddf)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/cugraph/dask/common/input_utils.py", line 106, in create
    gpu_futures = client.sync(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 339, in sync
    return sync(
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 406, in sync
    raise exc.with_traceback(tb)
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/distributed/utils.py", line 379, in f
    result = yield future
  File "/opt/conda/envs/rapids/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
    value = future.result()
RuntimeError: coroutine raised StopIteration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Story Tracking issues as a story level - likely to exist within a single release.
Projects
None yet
Development

No branches or pull requests

2 participants