Add new all-pairs similarity algorithm #4158

ChuckHastings · 2024-02-08T16:59:17Z

Added a new entry point for similarity functionality that combines the functionality of k_hop_nbrs and similarity.

This entry point allows us to compute similarity for all pairs of vertices in the graph in a single call. We also add the optional parameter topk which, if specified, will only return the vertices that have the highest scores. If topk is specified on an all pairs call, we compute the scores for pairs in batches and extract the topk as we go along to keep the memory footprint low.

This PR also updates a FIXME in the C++ similarity test. The C++ similarity test had been written before we had a k_hop_nbrs call, so there was some inefficient test code to compute that. Now that we have a k_hop_nbrs call, the test code was refactored to use that call.

Supersedes PR #4134

seunghwak · 2024-02-08T22:43:36Z

cpp/include/cugraph/algorithms.hpp

+/**
+ * @brief     Compute Jaccard all pairs similarity coefficient
+ *
+ * Similarity is computed for all pairs of vertices.  If the vertices


the vertices variable => @p vertices to make sure that vertices here refers to the input parameter.

Addressed in next push

seunghwak · 2024-02-08T22:45:46Z

cpp/include/cugraph/algorithms.hpp

+ * of these seeds.  If the vertices variable is not specified it will be
+ * all pairs of all two hop neighbors.
+ *
+ * If topk is specified only the top scoring vertex pairs will be returned,


only the top scoring => only the top @p topk scoring?

Addressed in next push

seunghwak · 2024-02-08T22:47:37Z

cpp/include/cugraph/algorithms.hpp

+ * @param topk optional specification of the how many of the top scoring vertex pairs should be
+ * returned
+ * @param do_expensive_check A flag to run expensive checks for input arguments (if set to `true`).
+ * @return tuple containing the tuples (t1, t2, similarity score)


Have we explained what t1 and t2 are?

Tried a new description in the latest push.

seunghwak · 2024-02-08T22:49:39Z

cpp/include/cugraph/algorithms.hpp

+ * all pairs of all two hop neighbors.
+ *
+ * If topk is specified only the top scoring vertex pairs will be returned,
+ * if not specified then all vertex pairs will be returned.


Are we returning for all vertex pairs (for a graph with V vertices, V^2 pairs) or the entire set of vertex pairs derived from two-neighbors of the entire set of vertices?

This can be mistaken as the first interpretation.

And should we better warn that the return value size can be prohibited for large graphs if @p topk is not specified?

Addressed in next push

seunghwak · 2024-02-08T22:50:51Z

cpp/include/cugraph_c/similarity_algorithms.h

+/**
+ * @brief     Perform All-Pairs Jaccard similarity computation
+ *
+ * Compute the similarity for all vertex pairs derived from an optional specified


derived from "the two-hop neighbors of" an optional specified ... ?

Addressed in next push

seunghwak · 2024-02-08T23:55:41Z

cpp/src/link_prediction/similarity_impl.cuh

+
+    // Let's compute the maximum size of the 2-hop neighborhood of each vertex
+    // FIXME: If sources is specified, this could be done on a subset of the vertices
+    //


Yeah... so we need to update per_v_transform_reduce_incoming|outgoing_e to (optionally) take a vertex frontier.

seunghwak · 2024-02-09T00:01:26Z

cpp/src/link_prediction/similarity_impl.cuh

+    size_t current_pos{0};
+    size_t next_pos{0};
+
+    while (true) {


We may use https://github.com/rapidsai/cugraph/blob/branch-24.04/cpp/include/cugraph/utilities/misc_utils.cuh#L40

I think vertex_t and edge_t in (template <typename vertex_t, typename edge_t>) should better be renamed, but we can get all the chunk boundaries at once with this function.

Refactored to use this function in next push.

seunghwak · 2024-02-09T00:16:45Z

cpp/src/link_prediction/similarity_impl.cuh

+    //  We can reduce memory footprint by doing work in batches and
+    //  computing/updating topk with each batch
+    rmm::device_uvector<vertex_t> top_v1(0, handle.get_stream());
+    rmm::device_uvector<vertex_t> top_v2(0, handle.get_stream());
+    rmm::device_uvector<weight_t> top_score(0, handle.get_stream());
+
+    top_v1.reserve(*topk, handle.get_stream());
+    top_v2.reserve(*topk, handle.get_stream());
+    top_score.reserve(*topk, handle.get_stream());


We can defer defining/allocating these variables till right before the beginning of the while loop to minimize the variable scopes.

Fixed in next push

seunghwak · 2024-02-09T00:20:25Z

cpp/src/link_prediction/similarity_impl.cuh

+      thrust::sort_by_key(handle.get_thrust_policy(),
+                          score.begin(),
+                          score.end(),
+                          thrust::make_zip_iterator(v1.begin(), v2.begin()),
+                          thrust::greater<weight_t>{});


In case top_v1.size() >= topk (in case of multi-gpu, aggregate top_v1.size() >= topk), we know the minimum threshold to be included in topk, and we can run thrust::remove_if first to avoid sorting a large array.

Added logic to use remove_if here in next push.

seunghwak · 2024-02-09T00:26:39Z

cpp/src/link_prediction/similarity_impl.cuh

+      thrust::copy(handle.get_thrust_policy(),
+                   score.begin(),
+                   score.begin() + top_v1.size(),
+                   top_score.begin());


In multi-GPU, shouldn't we return topk pairs in aggregate?

Yes. I thought I had written that logic. It was probably in my first attempt that I abandoned and not here. I will add it here.

Back in next push.

seunghwak

Looks good beside the below comments about code cosmetics.

seunghwak · 2024-02-28T22:21:57Z

cpp/include/cugraph/algorithms.hpp

+ * a score of 0.
+ *
+ * If @p vertices is specified we will compute similarity on two hop
+ * neighbors the @p vertices.  If @p vertices is not specified it will


neighbors the @p vertices=>neighbors "of" @p vertices?

seunghwak · 2024-02-28T22:33:43Z

cpp/include/cugraph/utilities/misc_utils.cuh

-  edge_t num_edges,
+  offset_t const* offsets,
+  data_t num_offsets,
+  offset_t num_elements,


Should we use raw pointers or better switch to device_span?

And we need to double check whether the offset array size is num_vertices or num_vertices + 1 (I assume it should be num_vertices + 1). In this case, offsets array size is now num_offsets + 1. I guess this is a bit misleading.

What about something like

template <typename major_idx_t, typename minor_idx_t> std::tuple<std::vector<major_idx_t>, std::vector<minor_idx_t>> compute_offset_aligned_compressed_sparse_array_chunks( raft::handle_t const& handle, raft::device_span<minor_idx_t> offsets, // num_majors == offsets.size() - 1 minor_idx_t num_minors, size_t approx_minor_chunk_size)

Pushing an update that makes the input a device_span. I named it a bit differently than you suggested. My use of the function seems (at least to me) less like major/minor, so I left things as data_t and offset_t, but I tweaked the names to get rid of the notion of vertex and edge. Let me know what you think.

So, here we are partitioning/chunking a compressed sparse array, right?

In case of the all-pairs similarity algorithm, we have seeds and their two hop degree offsets (to index an array to store each seed's two-hop neighbors).

offset_t sounds right, but my concern with data_t or element_t is that it is not clear whether this is to index the seeds or the two-hop neighbors.

Not sure what should be the right name for seeds or vertices in chunking compressed sparse array.

Maybe just use vertex_t as seeds are still vertices?

seunghwak · 2024-02-28T22:48:41Z

cpp/src/link_prediction/similarity_impl.cuh

+    thrust::exclusive_scan(handle.get_thrust_policy(),
+                           two_hop_degrees.begin(),
+                           two_hop_degrees.end(),
+                           two_hop_degrees.begin());


auto two_hop_degree_offsets = std::move(two_hop_degrees)? two_hop_degrees is a misnomer from this point.

OK. I added this. Don't use it very much from here, but it's essentially free to move and perhaps a bit clearer.

seunghwak · 2024-03-01T23:12:48Z

cpp/include/cugraph/utilities/misc_utils.cuh

-  edge_t num_edges,
+  offset_t const* offsets,
+  data_t num_offsets,
+  offset_t num_elements,


So, here we are partitioning/chunking a compressed sparse array, right?

In case of the all-pairs similarity algorithm, we have seeds and their two hop degree offsets (to index an array to store each seed's two-hop neighbors).

offset_t sounds right, but my concern with data_t or element_t is that it is not clear whether this is to index the seeds or the two-hop neighbors.

Not sure what should be the right name for seeds or vertices in chunking compressed sparse array.

Maybe just use vertex_t as seeds are still vertices?

seunghwak · 2024-03-01T23:13:44Z

cpp/src/prims/per_v_transform_reduce_dst_key_aggregated_outgoing_e.cuh

+            edge_partition.dcs_nzd_vertices()
+              ? (*segment_offsets)[detail::num_sparse_segments_per_vertex_partition] +
+                  *(edge_partition.dcs_nzd_vertex_count())
+              : edge_partition.major_range_size())},


Shouldn't we add + 1 to the size of the device_span?

seunghwak · 2024-03-01T23:14:29Z

cpp/src/sampling/sampling_post_processing_impl.cuh

-      (*renumber_map_label_offsets).data(),
-      static_cast<size_t>((*renumber_map_label_offsets).size() - 1),
+      raft::device_span<size_t const>{(*renumber_map_label_offsets).data(),
+                                      (*renumber_map_label_offsets).size() - 1},


Shouldn't we remove -1 here?

ChuckHastings · 2024-03-07T17:32:23Z

/merge

naimnv

Sorry that I missed reviewing this interesting PR.

add new all-pairs similarity algorithm

880f1d6

github-actions bot added cuGraph CMake labels Feb 8, 2024

ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 8, 2024

ChuckHastings marked this pull request as ready for review February 8, 2024 22:03

ChuckHastings requested review from a team as code owners February 8, 2024 22:03

ChuckHastings self-assigned this Feb 8, 2024

seunghwak reviewed Feb 9, 2024

View reviewed changes

ChuckHastings added 2 commits February 26, 2024 17:16

address PR feedback, add some missing MG functionality

8830f12

Merge branch 'branch-24.04' into all_pairs_similarity2

3965de9

seunghwak approved these changes Feb 28, 2024

View reviewed changes

ChuckHastings added 3 commits February 29, 2024 15:33

Merge branch 'branch-24.04' into all_pairs_similarity2

9aa7b1a

update function signature for compute_offset_aligned_edge_chunks

525518e

fix style issue

4319d55

seunghwak reviewed Mar 1, 2024

View reviewed changes

ChuckHastings added 3 commits March 4, 2024 10:30

a couple of off-by-one errors in last update

2eeaeda

cpp/include/cugraph/utilities/misc_utils.cuh

0c426fe

Merge branch 'branch-24.04' into all_pairs_similarity2

941a4a5

rapids-bot bot merged commit 5b5061e into rapidsai:branch-24.04 Mar 7, 2024
137 checks passed

naimnv reviewed Mar 11, 2024

View reviewed changes

Add new all-pairs similarity algorithm #4158

Add new all-pairs similarity algorithm #4158

Conversation

ChuckHastings commented Feb 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seunghwak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChuckHastings commented Mar 7, 2024

naimnv left a comment

Choose a reason for hiding this comment