Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Uniform Neighborhood Sampling #2258

Conversation

ChuckHastings
Copy link
Collaborator

@ChuckHastings ChuckHastings commented Apr 30, 2022

This PR will refactor the Uniform Neighborhood Sampling implementation to meet the new C API.

Major elements:

  • Moved old implementation details into cugraph::detail::original
  • Edge ids will be passed in as the edge weight, to allow them to be controlled by the caller. Edge weight will be an integer type, but we will treat int32_t * as float * (or int64_t * as double *). The algorithms will be flagged so that they won't do computations on the weight if it is an edge id
  • Adding an SG implementation (only partially done as of the creation of this PR, will be finished before this PR is ready for review)

@ChuckHastings ChuckHastings requested review from a team as code owners April 30, 2022 00:56
@ChuckHastings ChuckHastings self-assigned this Apr 30, 2022
@ChuckHastings ChuckHastings added 2 - In Progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 30, 2022
@ChuckHastings ChuckHastings added this to the 22.06 milestone Apr 30, 2022
@ChuckHastings ChuckHastings changed the title [skip-ci] Refactor Uniform Neighborhood Sampling Refactor Uniform Neighborhood Sampling May 11, 2022
Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Part 1.

* @param with_replacement boolean flag specifying if random sampling is done with replacement
* (true); or, without replacement (false); default = true;
* @return tuple of tuple of device vectors and counts:
* ((vertex_t source_vertex, vertex_t destination_vertex, int rank, edge_t index), rx_counts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this comment is out-dated copy-and-paste from the previous implementation. I assume we are returning a tuple of edge source, edge destination, and edge weight vectors (the last might be actually edge ID right at this moment?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

* @return tuple of tuple of device vectors and counts:
* ((vertex_t source_vertex, vertex_t destination_vertex, int rank, edge_t index), rx_counts)
*/
template <typename graph_view_t>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... and we are sort of mixing

template <typename graph_view_t> and using typename graph_view_t::vertex_type, ...
and
template <typename vertext_t, typename edge_t, typename weight_t, bool store_transpoed, bool multi_gpu> and using graph_view_t<vertex_t, edge_t, weight_t, store_transposed, multi_gpu>.

I think we'd better be consistent and any preference in one over the other?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong preference for me.

There is, I think, and advantage to the template <typename graph_view_t> approach in that if we change the implementation of graph_view (adding or removing a template parameter), as long as typename graph_view_t::vertex_type is still defined the API works without modification. I believe Andrei copied this from my Louvain definition which uses this approach. I implemented Louvain this way so that I could support both the Legacy graph and the graph_t with the same API.

But the syntax is a bit cleaner with your original approach. I don't think it's likely that we will frequently change the template signature of the API, and we will eventually get rid of the legacy graph class.

I'd be happy to change this back to your original approach, or if we like the template <typename graph_view_t> approach better I can add that to the list of things to gradually update in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I don't have strong preference either but I have strong preference for consistency.

I am also using for primitives but wondering I should better use graph_view_t<vertex_t, edge_t, weight_t, store_transpoed, multi_gpu> instead.

I am getting more inclined to the graph_view_t<vertex_t, edge_t, weight_t, store_transpoed, multi_gpu> approach as this code does not work for a general graph view type but works only with our graph_view_t (e.g. the implementation depends on multiple member functions only exist in graph_view_t).

And hopefully we can eliminate the legacy code sooner than later; at that point, I slightly prefer graph_view_t<vertex_t, edge_t, weight_t, store_transpoed, multi_gpu> even though this will have pretty much very minimal impact on end-user experiences.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I will make those changes in the next push. I will leave Louvain as it is now. I plan to create a PR to add Louvain to the C API, I will refactor the Louvain API in that PR.

uniform_nbr_sample(raft::handle_t const& handle,
graph_view_t const& graph_view,
raft::device_span<typename graph_view_t::vertex_type> d_starting_vertices,
raft::host_span<const int> h_fan_out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess d_ and h_ here are a bit redundant (especially with device_span and host_span). Or we should use this naming convention in all the functions in the public API. My current practice is to use d_ and h_ only when we have both host and device vectors with the same name, but open to discussions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah.... and this API is way more intuitive than the previous one!!!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love how the span variants clean up the API. I'll drop the extra prefixes in the next push

@@ -1536,6 +1537,32 @@ uniform_nbr_sample(raft::handle_t const& handle,
std::vector<int> const& h_fan_out,
bool with_replacement = true);

/**
* @brief Multi-GPU Uniform Neighborhood Sampling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really a Multi-GPU only thing or for both SG & MG

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both. Updated the comment.

handle.get_stream());

return d_rx_vertices;
}

template <typename vertex_t>
rmm::device_uvector<vertex_t> shuffle_vertices_by_gpu_id(raft::handle_t const& handle,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we better rename this to shuffle_ext_vertices_by_gpu_id?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in next push

@@ -47,6 +48,22 @@ struct compute_gpu_id_from_vertex_t {
}
};

template <typename vertex_t>
struct compute_gpu_id_from_int_vertex_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we better rename other functors working on external vertex IDs to ext_vertex_t and ext_edge_t?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done for vertex in the next push.

Do we ever try and use these functors on an int_edge_t? I'm inclined not to add the ext to the name unless we need to distinguish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha, agreed.

template <typename vertex_t>
struct compute_gpu_id_from_int_vertex_t {
vertex_t const* vertex_partition_range_lasts;
size_t num_vertex_partitions;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... maybe just a FIXME statement, but we should eventually replace this (pointer, size) pairs to raft::device_span.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to span in the next push.

zip_iter,
zip_iter + d_vertices.size(),
zip_iter,
[] __device__(auto pair) { return thrust::get<1>(pair) > 0; });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: https://github.com/NVIDIA/thrust/issues/1302
Maybe do copy_if in chunks or add check for d_vertices.size() and throw an exception if d_vertices.size() overflows 32 bit integer (if you expect this will unlikely to happen and we'd better wait for thrust folks to fix this).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I guess thrust::remove_if is more intuitive than copy_if here (unless you willing to copy in chunks). You may look for https://github.com/rapidsai/cugraph/pull/2253/files#diff-ce8c8b8ffdc670a97313ca4ce20de7bf8a18daa81f5a1fde50f3b162bf75b75bR1238

#if 1  // FIXME: work-around for the 32 bit integer overflow issue in thrust::remove,
       // thrust::remove_if, and thrust::copy_if (https://github.com/NVIDIA/thrust/issues/1302)
    rmm::device_uvector<vertex_t> tmp_indices(
      thrust::count_if(handle.get_thrust_policy(),
                       nbr_intersection_indices.begin(),
                       nbr_intersection_indices.end(),
                       detail::not_equal_t<vertex_t>{invalid_vertex_id<vertex_t>::value}),
      handle.get_stream());
    size_t num_copied{0};
    size_t num_scanned{0};
    while (num_scanned < nbr_intersection_indices.size()) {
      size_t this_scan_size = std::min(
        size_t{1} << 30,
        static_cast<size_t>(thrust::distance(nbr_intersection_indices.begin() + num_scanned,
                                             nbr_intersection_indices.end())));
      num_copied += static_cast<size_t>(thrust::distance(
        tmp_indices.begin() + num_copied,
        thrust::copy_if(handle.get_thrust_policy(),
                        nbr_intersection_indices.begin() + num_scanned,
                        nbr_intersection_indices.begin() + num_scanned + this_scan_size,
                        tmp_indices.begin() + num_copied,
                        detail::not_equal_t<vertex_t>{invalid_vertex_id<vertex_t>::value})));
      num_scanned += this_scan_size;
    }
    nbr_intersection_indices = std::move(tmp_indices);
#else
    nbr_intersection_indices.resize(
      thrust::distance(nbr_intersection_indices.begin(),
                       thrust::remove(handle.get_thrust_policy(),
                                      nbr_intersection_indices.begin(),
                                      nbr_intersection_indices.end(),
                                      invalid_vertex_id<vertex_t>::value)),
      handle.get_stream());
#endif

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to remove_if.

Seems unlikely to have an overflow issue, at least with current memory sizes, as the number of elements in a vertex array on each partition is likely to be < 2^31-1. But I added a FIXME so we can remember.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add CUGRAPH_EXPECTS(d_vertices.size() < std::numeric_limit<int32_t>::max()) as well. I agree that this is unlikely to happen, but if this happens in user side or large scale testing, it is very difficult for us to figure out this is actually due to the overflow. With the check, it will be way easier to figure out what went awry.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the CUGRAPH_EXPECTS here and the other two places where I call remove_if.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Part 2

namespace detail {

/**
* @brief Compute local out degrees of the majors belonging to the adjacency matrices
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double check but I guess this computes out-degrees if major == source and in-degrees if major == destination.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's correct. The sampling code forces store_transposed=false, so this function assumes that.

I'm not sure that's a good long-term assumption (feels like sampling on incoming vertices would be a reasonable thing to do). But at the moment this is sufficient.

Perhaps a FIXME to address this later?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a FIXME near the beginning of these function definitions to reflect that we should revisit this.

* @param handle RAFT handle object to encapsulate resources (e.g. CUDA stream, communicator, and
* handles to various CUDA libraries) to run graph algorithms.
* @param graph_view Non-owning graph object.
* @return A single vector containing the local out degrees of the majors belong to the adjacency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out degrees may not be accurate here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same observation as above, store_tranposed=false for the sampling algorithms.

const rmm::device_uvector<typename GraphViewType::edge_type>& global_out_degrees);

/**
* @brief Gather active majors across gpus in a column communicator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this gather or allgather (the results will be stored only in root or every process in the column communicator?)? If allgather, better rename to avoid confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

rmm::device_uvector<vertex_t>&& d_in);

/**
* @brief Return global out degrees of active majors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double check "out" degrees here is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampling forces store_transposed=false


template <typename vertex_t>
rmm::device_uvector<vertex_t> gather_active_majors(raft::handle_t const& handle,
rmm::device_uvector<vertex_t>&& d_in)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is using allgatherv, so this function should better be renamed to "allgather_active_majors".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

template <typename GraphViewType>
rmm::device_uvector<typename GraphViewType::edge_type> compute_local_major_degrees(
raft::handle_t const& handle, GraphViewType const& graph_view)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a FIXME.

I actually think much of this logic should be moved into the graph_view, we assume too much regarding implementation in these functions.


auto compacted_length = thrust::distance(
input_iter,
thrust::remove_if(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: The current version of thrust::remove_if does not work properly if minors.size() overflows 32bit integer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added FIXME to both of these remove_if calls in this file (both branches of the if)

thrust::make_optional(rmm::device_uvector<weight_t>(0, handle.get_stream()));

size_t level{0};
size_t num_rows{1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better re-name this to row_comm_size (this is more of a consistency thing).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@codecov-commenter
Copy link

codecov-commenter commented May 12, 2022

Codecov Report

Merging #2258 (bad27dc) into branch-22.06 (e906c98) will decrease coverage by 5.93%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06    #2258      +/-   ##
================================================
- Coverage         69.91%   63.97%   -5.94%     
================================================
  Files               175      100      -75     
  Lines             11503     4436    -7067     
================================================
- Hits               8042     2838    -5204     
+ Misses             3461     1598    -1863     
Impacted Files Coverage Δ
python/pylibcugraph/pylibcugraph/_version.py 0.00% <0.00%> (ø)
python/cugraph/cugraph/tests/test_hypergraph.py
...ugraph/cugraph/tests/test_maximum_spanning_tree.py
python/cugraph/cugraph/tests/test_core_number.py
python/cugraph/cugraph/tests/mg/test_mg_hits.py
...hon/cugraph/cugraph/tests/test_k_truss_subgraph.py
.../cugraph/cugraph/tests/test_subgraph_extraction.py
python/cugraph/cugraph/tests/test_ecg.py
...ython/cugraph/cugraph/tests/test_triangle_count.py
...thon/pylibcugraph/pylibcugraph/tests/test_utils.py
... and 71 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e906c98...bad27dc. Read the comment docs.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except for few minor complaints about documentation.

* @param handle RAFT handle object to encapsulate resources (e.g. CUDA stream, communicator, and
* handles to various CUDA libraries) to run graph algorithms.
* @param graph_view Graph View object to generate NBR Sampling on.
* @param d_starting_vertices Device span of starting vertex IDs for the NBR Sampling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d_starting_vertices=>starting_vertices as we renamed the input parameters.

* @param graph_view Graph View object to generate NBR Sampling on.
* @param d_starting_vertices Device span of starting vertex IDs for the NBR Sampling.
* @param h_fan_out Host span defining branching out (fan-out) degree per source vertex for each
* level
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

h_fan_out to fan_out.

@@ -350,6 +365,10 @@ void partially_decompress_edge_partition_to_fill_edgelist(
thrust::fill(
thrust::seq, majors + major_offset, majors + major_offset + local_degree, major);
thrust::copy(thrust::seq, indices, indices + local_degree, minors + major_offset);
if (weights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can lead to thread-divergence if local_degree values vary significantly within the threads in a single Warp. May add a FIXME statement. I have the same issue in Triangle Counting implementation (https://github.com/rapidsai/cugraph/pull/2253/files#diff-ce8c8b8ffdc670a97313ca4ce20de7bf8a18daa81f5a1fde50f3b162bf75b75bR434).

You may add a similar FIXME. Later, we may address this together by adding something like (delayed) segmented_copy(or fill).

@ChuckHastings
Copy link
Collaborator Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 4a6263a into rapidsai:branch-22.06 May 13, 2022
@ChuckHastings ChuckHastings deleted the fea_uniform_neighborhood_sampling_refactor branch August 4, 2022 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants