Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new all-pairs similarity algorithm #4158

Merged

Conversation

ChuckHastings
Copy link
Collaborator

Added a new entry point for similarity functionality that combines the functionality of k_hop_nbrs and similarity.

This entry point allows us to compute similarity for all pairs of vertices in the graph in a single call. We also add the optional parameter topk which, if specified, will only return the vertices that have the highest scores. If topk is specified on an all pairs call, we compute the scores for pairs in batches and extract the topk as we go along to keep the memory footprint low.

This PR also updates a FIXME in the C++ similarity test. The C++ similarity test had been written before we had a k_hop_nbrs call, so there was some inefficient test code to compute that. Now that we have a k_hop_nbrs call, the test code was refactored to use that call.

Supersedes PR #4134

@ChuckHastings ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 8, 2024
@ChuckHastings ChuckHastings marked this pull request as ready for review February 8, 2024 22:03
@ChuckHastings ChuckHastings requested review from a team as code owners February 8, 2024 22:03
@ChuckHastings ChuckHastings self-assigned this Feb 8, 2024
/**
* @brief Compute Jaccard all pairs similarity coefficient
*
* Similarity is computed for all pairs of vertices. If the vertices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the vertices variable => @p vertices to make sure that vertices here refers to the input parameter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in next push

* of these seeds. If the vertices variable is not specified it will be
* all pairs of all two hop neighbors.
*
* If topk is specified only the top scoring vertex pairs will be returned,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the top scoring => only the top @p topk scoring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in next push

* @param topk optional specification of the how many of the top scoring vertex pairs should be
* returned
* @param do_expensive_check A flag to run expensive checks for input arguments (if set to `true`).
* @return tuple containing the tuples (t1, t2, similarity score)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we explained what t1 and t2 are?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried a new description in the latest push.

* all pairs of all two hop neighbors.
*
* If topk is specified only the top scoring vertex pairs will be returned,
* if not specified then all vertex pairs will be returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we returning for all vertex pairs (for a graph with V vertices, V^2 pairs) or the entire set of vertex pairs derived from two-neighbors of the entire set of vertices?

This can be mistaken as the first interpretation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And should we better warn that the return value size can be prohibited for large graphs if @p topk is not specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in next push

/**
* @brief Perform All-Pairs Jaccard similarity computation
*
* Compute the similarity for all vertex pairs derived from an optional specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

derived from "the two-hop neighbors of" an optional specified ... ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in next push


// Let's compute the maximum size of the 2-hop neighborhood of each vertex
// FIXME: If sources is specified, this could be done on a subset of the vertices
//
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... so we need to update per_v_transform_reduce_incoming|outgoing_e to (optionally) take a vertex frontier.

size_t current_pos{0};
size_t next_pos{0};

while (true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may use https://github.com/rapidsai/cugraph/blob/branch-24.04/cpp/include/cugraph/utilities/misc_utils.cuh#L40

I think vertex_t and edge_t in (template <typename vertex_t, typename edge_t>) should better be renamed, but we can get all the chunk boundaries at once with this function.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use this function in next push.

Comment on lines 244 to 252
// We can reduce memory footprint by doing work in batches and
// computing/updating topk with each batch
rmm::device_uvector<vertex_t> top_v1(0, handle.get_stream());
rmm::device_uvector<vertex_t> top_v2(0, handle.get_stream());
rmm::device_uvector<weight_t> top_score(0, handle.get_stream());

top_v1.reserve(*topk, handle.get_stream());
top_v2.reserve(*topk, handle.get_stream());
top_score.reserve(*topk, handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can defer defining/allocating these variables till right before the beginning of the while loop to minimize the variable scopes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Comment on lines 374 to 378
thrust::sort_by_key(handle.get_thrust_policy(),
score.begin(),
score.end(),
thrust::make_zip_iterator(v1.begin(), v2.begin()),
thrust::greater<weight_t>{});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case top_v1.size() >= topk (in case of multi-gpu, aggregate top_v1.size() >= topk), we know the minimum threshold to be included in topk, and we can run thrust::remove_if first to avoid sorting a large array.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added logic to use remove_if here in next push.

thrust::copy(handle.get_thrust_policy(),
score.begin(),
score.begin() + top_v1.size(),
top_score.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In multi-GPU, shouldn't we return topk pairs in aggregate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought I had written that logic. It was probably in my first attempt that I abandoned and not here. I will add it here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back in next push.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good beside the below comments about code cosmetics.

* a score of 0.
*
* If @p vertices is specified we will compute similarity on two hop
* neighbors the @p vertices. If @p vertices is not specified it will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neighbors the @p vertices=>neighbors "of" @p vertices?

edge_t num_edges,
offset_t const* offsets,
data_t num_offsets,
offset_t num_elements,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use raw pointers or better switch to device_span?

And we need to double check whether the offset array size is num_vertices or num_vertices + 1 (I assume it should be num_vertices + 1). In this case, offsets array size is now num_offsets + 1. I guess this is a bit misleading.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like

template <typename major_idx_t, typename minor_idx_t>
std::tuple<std::vector<major_idx_t>, std::vector<minor_idx_t>> compute_offset_aligned_compressed_sparse_array_chunks(
  raft::handle_t const& handle,
  raft::device_span<minor_idx_t> offsets,  // num_majors == offsets.size() - 1
  minor_idx_t num_minors,
  size_t approx_minor_chunk_size)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushing an update that makes the input a device_span. I named it a bit differently than you suggested. My use of the function seems (at least to me) less like major/minor, so I left things as data_t and offset_t, but I tweaked the names to get rid of the notion of vertex and edge. Let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here we are partitioning/chunking a compressed sparse array, right?

In case of the all-pairs similarity algorithm, we have seeds and their two hop degree offsets (to index an array to store each seed's two-hop neighbors).

offset_t sounds right, but my concern with data_t or element_t is that it is not clear whether this is to index the seeds or the two-hop neighbors.

Not sure what should be the right name for seeds or vertices in chunking compressed sparse array.

Maybe just use vertex_t as seeds are still vertices?

thrust::exclusive_scan(handle.get_thrust_policy(),
two_hop_degrees.begin(),
two_hop_degrees.end(),
two_hop_degrees.begin());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto two_hop_degree_offsets = std::move(two_hop_degrees)? two_hop_degrees is a misnomer from this point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I added this. Don't use it very much from here, but it's essentially free to move and perhaps a bit clearer.

edge_t num_edges,
offset_t const* offsets,
data_t num_offsets,
offset_t num_elements,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here we are partitioning/chunking a compressed sparse array, right?

In case of the all-pairs similarity algorithm, we have seeds and their two hop degree offsets (to index an array to store each seed's two-hop neighbors).

offset_t sounds right, but my concern with data_t or element_t is that it is not clear whether this is to index the seeds or the two-hop neighbors.

Not sure what should be the right name for seeds or vertices in chunking compressed sparse array.

Maybe just use vertex_t as seeds are still vertices?

edge_partition.dcs_nzd_vertices()
? (*segment_offsets)[detail::num_sparse_segments_per_vertex_partition] +
*(edge_partition.dcs_nzd_vertex_count())
: edge_partition.major_range_size())},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we add + 1 to the size of the device_span?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

(*renumber_map_label_offsets).data(),
static_cast<size_t>((*renumber_map_label_offsets).size() - 1),
raft::device_span<size_t const>{(*renumber_map_label_offsets).data(),
(*renumber_map_label_offsets).size() - 1},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we remove -1 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@ChuckHastings
Copy link
Collaborator Author

/merge

@rapids-bot rapids-bot bot merged commit 5b5061e into rapidsai:branch-24.04 Mar 7, 2024
137 checks passed
Copy link
Contributor

@naimnv naimnv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that I missed reviewing this interesting PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cuGraph improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants