-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new all-pairs similarity algorithm #4158
Add new all-pairs similarity algorithm #4158
Conversation
cpp/include/cugraph/algorithms.hpp
Outdated
/** | ||
* @brief Compute Jaccard all pairs similarity coefficient | ||
* | ||
* Similarity is computed for all pairs of vertices. If the vertices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the vertices variable => @p vertices to make sure that vertices here refers to the input parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in next push
cpp/include/cugraph/algorithms.hpp
Outdated
* of these seeds. If the vertices variable is not specified it will be | ||
* all pairs of all two hop neighbors. | ||
* | ||
* If topk is specified only the top scoring vertex pairs will be returned, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the top scoring => only the top @p topk scoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in next push
cpp/include/cugraph/algorithms.hpp
Outdated
* @param topk optional specification of the how many of the top scoring vertex pairs should be | ||
* returned | ||
* @param do_expensive_check A flag to run expensive checks for input arguments (if set to `true`). | ||
* @return tuple containing the tuples (t1, t2, similarity score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we explained what t1 and t2 are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried a new description in the latest push.
cpp/include/cugraph/algorithms.hpp
Outdated
* all pairs of all two hop neighbors. | ||
* | ||
* If topk is specified only the top scoring vertex pairs will be returned, | ||
* if not specified then all vertex pairs will be returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we returning for all vertex pairs (for a graph with V vertices, V^2 pairs) or the entire set of vertex pairs derived from two-neighbors of the entire set of vertices?
This can be mistaken as the first interpretation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And should we better warn that the return value size can be prohibited for large graphs if @p topk is not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in next push
/** | ||
* @brief Perform All-Pairs Jaccard similarity computation | ||
* | ||
* Compute the similarity for all vertex pairs derived from an optional specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
derived from "the two-hop neighbors of" an optional specified ... ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in next push
|
||
// Let's compute the maximum size of the 2-hop neighborhood of each vertex | ||
// FIXME: If sources is specified, this could be done on a subset of the vertices | ||
// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... so we need to update per_v_transform_reduce_incoming|outgoing_e to (optionally) take a vertex frontier.
size_t current_pos{0}; | ||
size_t next_pos{0}; | ||
|
||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think vertex_t and edge_t in (template <typename vertex_t, typename edge_t>) should better be renamed, but we can get all the chunk boundaries at once with this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored to use this function in next push.
// We can reduce memory footprint by doing work in batches and | ||
// computing/updating topk with each batch | ||
rmm::device_uvector<vertex_t> top_v1(0, handle.get_stream()); | ||
rmm::device_uvector<vertex_t> top_v2(0, handle.get_stream()); | ||
rmm::device_uvector<weight_t> top_score(0, handle.get_stream()); | ||
|
||
top_v1.reserve(*topk, handle.get_stream()); | ||
top_v2.reserve(*topk, handle.get_stream()); | ||
top_score.reserve(*topk, handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can defer defining/allocating these variables till right before the beginning of the while loop to minimize the variable scopes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next push
thrust::sort_by_key(handle.get_thrust_policy(), | ||
score.begin(), | ||
score.end(), | ||
thrust::make_zip_iterator(v1.begin(), v2.begin()), | ||
thrust::greater<weight_t>{}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case top_v1.size() >= topk (in case of multi-gpu, aggregate top_v1.size() >= topk), we know the minimum threshold to be included in topk, and we can run thrust::remove_if first to avoid sorting a large array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added logic to use remove_if
here in next push.
thrust::copy(handle.get_thrust_policy(), | ||
score.begin(), | ||
score.begin() + top_v1.size(), | ||
top_score.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In multi-GPU, shouldn't we return topk pairs in aggregate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I thought I had written that logic. It was probably in my first attempt that I abandoned and not here. I will add it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Back in next push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good beside the below comments about code cosmetics.
* a score of 0. | ||
* | ||
* If @p vertices is specified we will compute similarity on two hop | ||
* neighbors the @p vertices. If @p vertices is not specified it will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edge_t num_edges, | ||
offset_t const* offsets, | ||
data_t num_offsets, | ||
offset_t num_elements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use raw pointers or better switch to device_span?
And we need to double check whether the offset array size is num_vertices or num_vertices + 1 (I assume it should be num_vertices + 1). In this case, offsets array size is now num_offsets + 1. I guess this is a bit misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like
template <typename major_idx_t, typename minor_idx_t>
std::tuple<std::vector<major_idx_t>, std::vector<minor_idx_t>> compute_offset_aligned_compressed_sparse_array_chunks(
raft::handle_t const& handle,
raft::device_span<minor_idx_t> offsets, // num_majors == offsets.size() - 1
minor_idx_t num_minors,
size_t approx_minor_chunk_size)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushing an update that makes the input a device_span. I named it a bit differently than you suggested. My use of the function seems (at least to me) less like major/minor, so I left things as data_t
and offset_t
, but I tweaked the names to get rid of the notion of vertex and edge. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here we are partitioning/chunking a compressed sparse array, right?
In case of the all-pairs similarity algorithm, we have seeds and their two hop degree offsets (to index an array to store each seed's two-hop neighbors).
offset_t
sounds right, but my concern with data_t or element_t is that it is not clear whether this is to index the seeds or the two-hop neighbors.
Not sure what should be the right name for seeds or vertices in chunking compressed sparse array.
Maybe just use vertex_t as seeds are still vertices?
thrust::exclusive_scan(handle.get_thrust_policy(), | ||
two_hop_degrees.begin(), | ||
two_hop_degrees.end(), | ||
two_hop_degrees.begin()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto two_hop_degree_offsets = std::move(two_hop_degrees)? two_hop_degrees
is a misnomer from this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I added this. Don't use it very much from here, but it's essentially free to move and perhaps a bit clearer.
edge_t num_edges, | ||
offset_t const* offsets, | ||
data_t num_offsets, | ||
offset_t num_elements, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here we are partitioning/chunking a compressed sparse array, right?
In case of the all-pairs similarity algorithm, we have seeds and their two hop degree offsets (to index an array to store each seed's two-hop neighbors).
offset_t
sounds right, but my concern with data_t or element_t is that it is not clear whether this is to index the seeds or the two-hop neighbors.
Not sure what should be the right name for seeds or vertices in chunking compressed sparse array.
Maybe just use vertex_t as seeds are still vertices?
edge_partition.dcs_nzd_vertices() | ||
? (*segment_offsets)[detail::num_sparse_segments_per_vertex_partition] + | ||
*(edge_partition.dcs_nzd_vertex_count()) | ||
: edge_partition.major_range_size())}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we add + 1 to the size of the device_span?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
(*renumber_map_label_offsets).data(), | ||
static_cast<size_t>((*renumber_map_label_offsets).size() - 1), | ||
raft::device_span<size_t const>{(*renumber_map_label_offsets).data(), | ||
(*renumber_map_label_offsets).size() - 1}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we remove -1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
/merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that I missed reviewing this interesting PR.
Added a new entry point for similarity functionality that combines the functionality of k_hop_nbrs and similarity.
This entry point allows us to compute similarity for all pairs of vertices in the graph in a single call. We also add the optional parameter topk which, if specified, will only return the vertices that have the highest scores. If topk is specified on an all pairs call, we compute the scores for pairs in batches and extract the topk as we go along to keep the memory footprint low.
This PR also updates a FIXME in the C++ similarity test. The C++ similarity test had been written before we had a k_hop_nbrs call, so there was some inefficient test code to compute that. Now that we have a k_hop_nbrs call, the test code was refactored to use that call.
Supersedes PR #4134