Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate accumulate_into_selected from ANN utils into linalg::reduce_rows_by_keys #909

Merged
merged 8 commits into from
Oct 19, 2022

Conversation

Nyrio
Copy link
Contributor

@Nyrio Nyrio commented Oct 10, 2022

accumulate_into_selected achieves much better performance than the previous implementation of reduce_rows_by_keys for large nkeys (sum_rows_by_key_large_nkeys_kernel_rowmajor). According to the benchmark that I added for this primitive, the difference is a factor of 240x for sizes relevant to IVF-Flat (and a factor of ~10x for smaller nkeys, e.g 64).

This is mostly because the legacy implementation, probably in an attempt to reduce atomic conflicts, assigned a key and a tile of the matrix to each block, and the block only reduces the rows corresponding to the assigned key. With a very large number of keys, e.g 1k, this results in blocks iterating over a large number of rows (possibly tens of thousands) and only reading and accumulating 1 in 1k rows.

This PR:

  • Replaces sum_rows_by_key_large_nkeys_rowmajor with accumulate_into_selected (I didn't find any cases in which the old kernel performed better).
  • Removes accumulate_into_selected from ann_utils.cuh.
  • Fixes support for custom iterators in reduce_rows_by_keys.
  • Uses the raft prims in calc_centers_and_sizes.

Perf notes:

  • The original kmeans gets a 15-20% speedup for large numbers of clusters.
  • The performance of ivf_flat::build stays the same as before.
  • There are a bunch of extra steps since I separated the cluster size count from the reduction by key, but they are quite neglectable in comparison.

Question: the change breaks support for host-side-only arrays in calc_centers_and_sizes, is it actually a possibility? Should I add a branch and not use the raft prims when all arrays are host-side?

cc @achirkin @tfeher @cjnolet

@Nyrio Nyrio requested review from a team as code owners October 10, 2022 12:26
@Nyrio Nyrio added 3 - Ready for Review improvement Improvement / enhancement to an existing function non-breaking Non-breaking change CMake and removed CMake labels Oct 10, 2022
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @Nyrio for this consolidation and optimization! Minor things again

cpp/bench/CMakeLists.txt Show resolved Hide resolved
@@ -383,6 +317,8 @@ __global__ void map_along_rows_kernel(
* @brief Map a binary function over a matrix and a vector element-wise, broadcasting the vector
* values along rows: `m[i, j] = op(m[i,j], v[i])`
*
* @todo(lsugy): replace with matrix_vector_op
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a GitHub issue for this and reference it here just to make sure we are tracking it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I didn't mean to leave that comment here, but I already have a WIP PR for this: #911

int nkeys,
DataIteratorT* d_sums,
cudaStream_t stream)
IdxT nrows,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to be deprecating the raw pointer APIs soon in favor of the new (more self-documenting) mdspan APIs. Do you see any reasons why we should prefer to keep the iterator-based APIs over the mdspan APIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterator-based APIs combined with fancy iterators such as cub::TransformInputIterator avoid unnecessary steps when the input needs to be converted (e.g int to float mapping, key-value to key-only or value-only, etc). This comes at the expense of more template instantiations and some optimizations that are only possible with raw pointers (though we can use if constexpr to account for that).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I get the point of the "streamability" of the iterators vs raw pointers. Given that the mdspan is really just a very lightweight wrapper around any type (it can be but doesn't have to be a raw pointer) and we're just forwarding the underlying data_handle() to the function in the detail namespace, can't we also wrap the mdspan around an iterator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mdspan is really just a very lightweight wrapper around any type (it can be but doesn't have to be a raw pointer)

Not exactly. As far as I understand, span or mdspan can only wrap around data that exists in memory (and continuously). std::span even has a member data which returns a pointer to the memory location of the first element. So it can wrap around simple iterators like vector::begin but not cub::TransformInputIterator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mdspan and span are a little different in both form and function. we have been investing in mdspan more than span for our public APIs. mdspan itself does not require contiguity, though we do enforce that in many of the public APIs for the simple reason that we are using it as a facade for raw pointers. If we are expecting iterators, we could relax that constraint a bit when the underlying data_handle() is in fact an iterator. Also, @mhoemmen can correct me if I'm wrong but I don't believe the data backing an mdspan does need to exist in memory- I think it could even be a file pointer or a pointer to a remote data buffer so long as the accessor knows how to materialize it and it allows for random access.

Im not trying to discourage the use of iterators in our public API and I definitely see the benefits of using them to apply functors lazily upon materialization to avoid copies / additional allocations. I'm hoping we can incorporate acceptance of iterators more consistently and broadly across our APIs and avoid having a only a couple functions throughout the codebase that accept iterators while most do not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also okay accepting this PR and adding a todo to figure out how we can accept iterator types more broadly across the new APIs so we can consolidate the dev experience a bit for our end-users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought of mdspan as a multi-dimensional std::span but it seems that it's a bit more versatile.
So, to make sure I understand: each custom iterator would require a different accessor? And the mdspan type for iterator-accepting args will still be a template parameter, while args expecting raw pointers can use matrix or vector views?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, @mhoemmen can correct me if I'm wrong but I don't believe the data backing an mdspan does need to exist in memory- I think it could even be a file pointer or a pointer to a remote data buffer so long as the accessor knows how to materialize it and it allows for random access.

@cjnolet is correct (as usual!). We say "mdspan doesn't need a backing span."

  1. Elements don't need to exist in memory;
  2. data_handle_type doesn't need to be ElementType*;
  3. even if it is, or even if data_handle_type presents the syntax of an iterator, it doesn't need to act like one;
  4. reference doesn't need to be element_type&; and
  5. access(p, i) doesn't need to return p[i].

One example of (3) is MPI_Win (which could be void*, but it's not a pointer to an array), a handle to remote memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal view is that mdspan is a multipass multidimensional iterator. That is, it's the preferred interface for viewing a multidimensional range.

If you really want an iterator range for a generic rank-1 mdspan x, std::views::iota(0, x.extent(0)) | std::views::transform([x] (auto index) { /* function using x[index] */ }) works fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it looks like all of RAFT's rank-1 mdspan would let one use data_handle(), data_handle() + extent(0) as an iterator range.

cub::TransformInputIterator<float, utils::mapping<float>, const T*> mapping_itr(dataset,
mapping_op);

// todo(lsugy): use iterator from KV output of fusedL2NN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you create an issue for this and reference it here for tracking? Do you see any reason it would be more beneficial to use iterators here over the mdspan API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of using an iterator here would be to avoid having one extra step after fusedL2NN to extract and cast the key (fusedL2NN can output key-value, value-only, but not key-only).

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Nyrio for the PR! I confirm that there is no need to handle host only pointers in calc_centers_and_sizes: only historically was that necessary, now all the calculation is done in device or managed memory for the IVF methods. Some nitpicks below, otherwise it looks good.

cpp/include/raft/linalg/detail/reduce_rows_by_key.cuh Outdated Show resolved Hide resolved
@Nyrio
Copy link
Contributor Author

Nyrio commented Oct 17, 2022

Thanks @tfeher for confirming that!

@Nyrio Nyrio requested a review from tfeher October 17, 2022 13:42
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Louis for the update, it looks good to me!

@Nyrio
Copy link
Contributor Author

Nyrio commented Oct 19, 2022

@cjnolet If that's ok with you, can we merge this and mdspanify later? mdspanifying custom iterators requires helpers and types that we don't have yet.

@cjnolet
Copy link
Member

cjnolet commented Oct 19, 2022

@Nyrio yeah I think we can push that change off until later. Can you create a quick GitHub issue for it so that it doesn't get lost?

@cjnolet
Copy link
Member

cjnolet commented Oct 19, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0de9ece into rapidsai:branch-22.12 Oct 19, 2022
@tfeher tfeher mentioned this pull request Oct 27, 2022
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants