Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor python code for similarity algos to use latest CAPI #3828

Merged
merged 21 commits into from
Sep 21, 2023
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
38d917a
Refactor python code for similarity algos to use latest CAPI, get rid…
Aug 26, 2023
052e7bc
Remove legacy jaccard and overlap .cu files
Aug 26, 2023
28d422b
Pull changes from branch-23.10
Aug 26, 2023
2af9b90
Update CMakeLists.txt and fixes copyright issues
Aug 28, 2023
69a036f
Add test for sorensen
Aug 28, 2023
3b3c35b
Use variable for column names
Aug 28, 2023
c11e064
Add test coverage for sorensen_w
Aug 28, 2023
cbb414a
Update tests for sorensen and overlap, add more test coverage
Aug 28, 2023
0f15b97
Change test names
Aug 28, 2023
8cd7866
Update doc strings, keep APIs backward compatible
Aug 28, 2023
a0c4bad
checkout pre-commit-config.yaml from 23.10
Aug 28, 2023
555e632
stlye fix
Aug 28, 2023
0dc6faf
Merge branch 'branch-23.10' of github.com:rapidsai/cugraph into refac…
Aug 30, 2023
4ca2b0c
Update MG similarity tests
Aug 30, 2023
a7862d8
Fix doc test errors
naimnv Aug 30, 2023
34cdea1
Merge branch 'branch-23.10' into refactor_similarity_algos
naimnv Aug 30, 2023
09f1ac0
Address PR comments - update doc strings and tests, remove a few test…
naimnv Sep 19, 2023
1b3205a
Merge branch 'refactor_similarity_algos' of github.com:naimnv/cugraph…
naimnv Sep 19, 2023
7138a55
Merge branch 'branch-23.10' of github.com:rapidsai/cugraph into refac…
naimnv Sep 19, 2023
2252e7b
Merge branch 'branch-23.10' of github.com:rapidsai/cugraph into refac…
naimnv Sep 20, 2023
70c0b5b
Updates a few comments, warnings and doc-strings
naimnv Sep 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions python/cugraph/cugraph/experimental/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -51,3 +51,19 @@
from cugraph.gnn.data_loading import EXPERIMENTAL__BulkSampler

BulkSampler = experimental_warning_wrapper(EXPERIMENTAL__BulkSampler)


from cugraph.link_prediction.jaccard import jaccard, jaccard_coefficient

jaccard = promoted_experimental_warning_wrapper(jaccard)
jaccard_coefficient = promoted_experimental_warning_wrapper(jaccard_coefficient)

from cugraph.link_prediction.sorensen import sorensen, sorensen_coefficient

sorensen = promoted_experimental_warning_wrapper(sorensen)
sorensen_coefficient = promoted_experimental_warning_wrapper(sorensen_coefficient)

from cugraph.link_prediction.overlap import overlap, overlap_coefficient

overlap = promoted_experimental_warning_wrapper(overlap)
overlap_coefficient = promoted_experimental_warning_wrapper(overlap_coefficient)
87 changes: 52 additions & 35 deletions python/cugraph/cugraph/link_prediction/jaccard.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,23 @@
)
import cudf
import warnings
from typing import Union, Iterable

from pylibcugraph import (
jaccard_coefficients as pylibcugraph_jaccard_coefficients,
)
from pylibcugraph import ResourceHandle

from cugraph.structure import Graph
from cugraph.utilities.utils import import_optional

# FIXME: the networkx.Graph type used in the type annotation for
# induced_subgraph() is specified using a string literal to avoid depending on
# and importing networkx. Instead, networkx is imported optionally, which may
# cause a problem for a type checker if run in an environment where networkx is
# not installed.
naimnv marked this conversation as resolved.
Show resolved Hide resolved
networkx = import_optional("networkx")


# FIXME: Move this function to the utility module so that it can be
# shared by other algos
Expand All @@ -43,13 +54,12 @@ def ensure_valid_dtype(input_graph, vertex_pair):
return vertex_pair


# FIXME:
# 1. Be consistent with column names for output/result DataFrame
# 2. Enforce that 'vertex_pair' is a cudf Dataframe with only columns for vertex pairs
# 3. We need to add support for multi-column vertices


def jaccard(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=False):
def jaccard(
input_graph: Graph,
vertex_pair: cudf.DataFrame = None,
do_expensive_check: bool = False, # deprecated
use_weight: bool = False,
):
"""
Compute the Jaccard similarity between each pair of vertices connected by
an edge, or between arbitrary pairs of vertices specified by the user.
Expand Down Expand Up @@ -99,12 +109,11 @@ def jaccard(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=
----------
input_graph : cugraph.Graph
cuGraph Graph instance, should contain the connectivity information
as an edge list (edge weights are not supported yet for this algorithm). The
graph should be undirected where an undirected edge is represented by a
directed edge in both direction. The adjacency list will be computed if
not already present.
as an edge list. The graph should be undirected where an undirected
edge is represented by a directed edge in both direction.The adjacency
list will be computed if not already present.

This implementation only supports undirected, unweighted Graph.
This implementation only supports undirected, non-multi Graphs.

vertex_pair : cudf.DataFrame, optional (default=None)
A GPU dataframe consisting of two columns representing pairs of
Expand All @@ -114,12 +123,15 @@ def jaccard(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=
adjacent vertices in the graph.

do_expensive_check : bool, optional (default=False)
Deprecated, no longer needed
Deprecated.
Originally, when set to Ture, jaccard implementation checked if
the vertices in the graph are (re)numbered from 0 to V-1 where
V is the total number of vertices.
naimnv marked this conversation as resolved.
Show resolved Hide resolved

use_weight : bool, optional (default=False)
Flag to indicate whether to compute weighted jaccard (if use_weight==True)
or un-weighted jaccard (if use_weight==False).
'input_graph' must be wighted if 'use_weight=True'.
'input_graph' must be weighted if 'use_weight=True'.


Returns
Expand Down Expand Up @@ -149,8 +161,9 @@ def jaccard(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=
"""
if do_expensive_check:
warnings.warn(
"do_expensive_check is deprecated since it is no longer needed",
DeprecationWarning,
"do_expensive_check is deprecated since vertex IDs are no longer "
"required to be consecutively numbered",
FutureWarning,
naimnv marked this conversation as resolved.
Show resolved Hide resolved
)

if input_graph.is_directed():
Expand Down Expand Up @@ -202,28 +215,38 @@ def jaccard(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=
return df


def jaccard_coefficient(G, ebunch=None, do_expensive_check=False):
def jaccard_coefficient(
G: Union[Graph, "networkx.Graph"],
ebunch: Union[cudf.DataFrame, Iterable[Union[int, str, float]]] = None,
do_expensive_check: bool = False, # deprecated
):
"""
For NetworkX Compatability. See `jaccard`

Parameters
----------
G : cugraph.Graph
cuGraph Graph instance, should contain the connectivity information
as an edge list (edge weights are not supported yet for this algorithm). The
graph should be undirected where an undirected edge is represented by a
directed edge in both direction. The adjacency list will be computed if
not already present.
G : cugraph.Graph or NetworkX.Graph
cuGraph or NetworkX Graph instance, should contain the connectivity
information as an edge list. The graph should be undirected where an
undirected edge is represented by a directed edge in both direction.
The adjacency list will be computed if not already present.

This implementation only supports undirected, non-multi Graphs.

ebunch : cudf.DataFrame, optional (default=None)
ebunch : cudf.DataFrame or iterable of node pairs, optional (default=None)
A GPU dataframe consisting of two columns representing pairs of
vertices. If provided, the jaccard coefficient is computed for the
given vertex pairs. If the vertex_pair is not provided then the
current implementation computes the jaccard coefficient for all
adjacent vertices in the graph.
vertices or iterable of 2-tuples (u, v) where u and v are nodes in
the graph.

If provided, the Overlap coefficient is computed for the given vertex
pairs. Otherwise, the current implementation computes the overlap
coefficient for all adjacent vertices in the graph.

do_expensive_check : bool, optional (default=False)
Deprecated, longer needed
Deprecated.
Originally, when set to Ture, jaccard implementation checked if
the vertices in the graph are (re)numbered from 0 to V-1 where
V is the total number of vertices.

Returns
-------
Expand All @@ -250,12 +273,6 @@ def jaccard_coefficient(G, ebunch=None, do_expensive_check=False):
>>> df = jaccard_coefficient(G)

"""
if do_expensive_check:
warnings.warn(
"do_expensive_check is deprecated since it is no longer needed",
DeprecationWarning,
)
naimnv marked this conversation as resolved.
Show resolved Hide resolved

vertex_pair = None

G, isNx = ensure_cugraph_obj_for_nx(G)
Expand Down
86 changes: 52 additions & 34 deletions python/cugraph/cugraph/link_prediction/overlap.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,23 @@
)
import cudf
import warnings
from typing import Union, Iterable

from pylibcugraph import (
overlap_coefficients as pylibcugraph_overlap_coefficients,
)
from pylibcugraph import ResourceHandle

from cugraph.structure import Graph
from cugraph.utilities.utils import import_optional

# FIXME: the networkx.Graph type used in the type annotation for
# induced_subgraph() is specified using a string literal to avoid depending on
# and importing networkx. Instead, networkx is imported optionally, which may
# cause a problem for a type checker if run in an environment where networkx is
# not installed.
networkx = import_optional("networkx")


# FIXME: Move this function to the utility module so that it can be
# shared by other algos
Expand All @@ -43,28 +54,38 @@ def ensure_valid_dtype(input_graph, vertex_pair):
return vertex_pair


def overlap_coefficient(G, ebunch=None, do_expensive_check=False):
def overlap_coefficient(
G: Union[Graph, "networkx.Graph"],
ebunch: Union[cudf.DataFrame, Iterable[Union[int, str, float]]] = None,
do_expensive_check: bool = False, # deprecated
):
"""
For NetworkX Compatability. See `overlap`
Compute overlap coefficient.

Parameters
----------
G : cugraph.Graph
cuGraph Graph instance, should contain the connectivity information
as an edge list (edge weights are not supported yet for this algorithm). The
graph should be undirected where an undirected edge is represented by a
directed edge in both direction. The adjacency list will be computed if
not already present.
G : cugraph.Graph or NetworkX.Graph
cuGraph or NetworkX Graph instance, should contain the connectivity
information as an edge list. The graph should be undirected where an
undirected edge is represented by a directed edge in both direction.
The adjacency list will be computed if not already present.

ebunch : cudf.DataFrame, optional (default=None)
This implementation only supports undirected, non-multi edge Graph.

ebunch : cudf.DataFrame or iterable of node pairs, optional (default=None)
A GPU dataframe consisting of two columns representing pairs of
vertices. If provided, the Overlap coefficient is computed for the
given vertex pairs. If the vertex_pair is not provided then the
current implementation computes the overlap coefficient for all
adjacent vertices in the graph.
vertices or iterable of 2-tuples (u, v) where u and v are nodes in
the graph.

If provided, the Overlap coefficient is computed for the given vertex
pairs. Otherwise, the current implementation computes the overlap
coefficient for all adjacent vertices in the graph.

do_expensive_check : bool, optional (default=False)
Deprecated, no longer needed
Deprecated.
Originally, when set to Ture, overlap implementation checked if
the vertices in the graph are (re)numbered from 0 to V-1 where
V is the total number of vertices.

Returns
-------
Expand All @@ -90,11 +111,6 @@ def overlap_coefficient(G, ebunch=None, do_expensive_check=False):
>>> G = karate.get_graph(download=True, ignore_weights=True)
>>> df = overlap_coefficient(G)
"""
if do_expensive_check:
warnings.warn(
"do_expensive_check is deprecated since it is no longer needed",
DeprecationWarning,
)
vertex_pair = None

G, isNx = ensure_cugraph_obj_for_nx(G)
Expand All @@ -114,13 +130,12 @@ def overlap_coefficient(G, ebunch=None, do_expensive_check=False):
return df


# FIXME:
# 1. Be consistent with column names for output/result DataFrame
# 2. Enforce that 'vertex_pair' is a cudf Dataframe with only columns for vertex pairs
# 3. We need to add support for multi-column vertices


def overlap(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=False):
def overlap(
input_graph: Graph,
vertex_pair: cudf.DataFrame = None,
do_expensive_check: bool = False, # deprecated
use_weight: bool = False,
):
"""
Compute the Overlap Coefficient between each pair of vertices connected by
an edge, or between arbitrary pairs of vertices specified by the user.
Expand All @@ -142,23 +157,25 @@ def overlap(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=
----------
input_graph : cugraph.Graph
cuGraph Graph instance, should contain the connectivity information
as an edge list (edge weights are not supported yet for this algorithm). The
adjacency list will be computed if not already present.

This implementation only supports undirected, unweighted Graph.
as an edge list. The adjacency list will be computed if not already
present.

This implementation only supports undirected, non-multi edge Graph.
vertex_pair : cudf.DataFrame, optional (default=None)
A GPU dataframe consisting of two columns representing pairs of
vertices. If provided, the overlap coefficient is computed for the
given vertex pairs, else, it is computed for all vertex pairs.

do_expensive_check : bool, optional (default=False)
Deprecated, no longer needed
Deprecated.
Originally, when set to Ture, overlap implementation checked if
the vertices in the graph are (re)numbered from 0 to V-1 where
V is the total number of vertices.

use_weight : bool, optional (default=False)
Flag to indicate whether to compute weighted overlap (if use_weight==True)
or un-weighted overlap (if use_weight==False).
'input_graph' must be wighted if 'use_weight=True'.
'input_graph' must be weighted if 'use_weight=True'.



Expand Down Expand Up @@ -189,8 +206,9 @@ def overlap(input_graph, vertex_pair=None, do_expensive_check=False, use_weight=
"""
if do_expensive_check:
warnings.warn(
"do_expensive_check is deprecated since it is no longer needed",
DeprecationWarning,
"do_expensive_check is deprecated since vertex IDs are no longer "
"required to be consecutively numbered",
FutureWarning,
)

if input_graph.is_directed():
Expand Down
Loading