-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use fusedL2NN in ANN kmeans #821
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very nice speedup! Two main points:
- Would you consider to templatize the
fusedL2NN
to allow unsigned labels instead? It feels like more internals of ann_kmeans there would still need to be changed to properly switch toLabelT
if we go this way. - Out of curiosity, did you check if the
argmin_along_rows
could be optimized to achieve the same speedup not fused? I remember it looked awfully slow in some settings in the nsys timeline when I tested it...
inline void predict_float_core(const handle_t& handle, | ||
const float* centers, | ||
uint32_t n_clusters, | ||
uint32_t dim, | ||
const float* dataset, | ||
const float* dataset_norm, | ||
size_t n_rows, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should n_rows
and all variables/arrays referring to the dataset indices/sizes be the same type as indices, i.e. IdxT
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I left dim
and n_clusters
in uint32_t
, replaced uint64_t
with IdxT
, and solved a few overflows (I've tested with very large datasets to debug overflows).
Which internals of ann kmeans still need to be changed? The IVF-Flat tests are working at least. I'll look into
I'll look into this, but by my quick calculation, I would have to speed up |
It's the things like |
Yes, it's not always clear which index types should be used, and I tried not to change all the types in this PR but only those related the labels. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Conflicts: # cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh # cpp/include/raft/spatial/knn/detail/ann_utils.cuh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Louis, for addressing the issues so far. I have a few more nitpicks, otherwise it looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Louis for addressing the issues, the PR looks good to me!
@achirkin could you also have another look at the PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR looks good so far. I'm just reviewing as comments for now since I have a few questions but I don't want to hold up progress.
* @param n_rows number samples in the `dataset` | ||
* @param[out] labels output predictions [n_rows] | ||
* @param metric | ||
* @param stream | ||
* @param mr (optional) memory resource to use for temporary allocations | ||
*/ | ||
|
||
template <typename T> | ||
template <typename T, typename IdxT, typename LabelT> | ||
void predict(const handle_t& handle, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would it take at this point to move the train and predict functions from raft/spatial/knn/
over to raft/cluster
and provide the balanced kmeans as either an option in the existing kmeans or as an additional separate API (maybe raft/cluster/kmeans_balanced.cuh
?). Do you think that would be a significant amount of work or would t be pretty trivial once the two are using similar primitives?
The CuOPT team has been interested in using the balanced k-means and I'm sure there are other users that could find it very useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving as a separate API sounds feasible. Merging the two APIs requires first using similar primitives, for which I have some ongoing work. I'm not sure which path will be preferred, but if we expose that as a separate API now, it's an API we're also going to have to maintain for a while once we merge it with the existing kmeans.
@@ -689,21 +821,40 @@ auto build_fine_clusters(const handle_t& handle, | |||
"Number of fine clusters must be non-zero for a non-empty mesocluster"); | |||
} | |||
|
|||
utils::copy_selected( | |||
mesocluster_sizes[i], dim, dataset_mptr, mc_trainset_ids, dim, mc_trainset, dim, stream); | |||
utils::copy_selected((IdxT)mesocluster_sizes[i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you didn't add this function to the implementation originally, but how does copy_selected
differ from raft::matrix::copy_rows
which accepts an array of selected ids? I ask because it sounds like if they are similar enough, we could add a note to consolidate them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is used all over the place in both new ivf-* implementations and it uses the mapping
type to transform between input and output along the way (e.g. cast uint8_t to float and then divide by 256).
However, in case of ann_kmeans_balanced, I think we'd better use raft::matrix::copy_rows
, because we copy indices here rather than values, and logically it has nothing to do with the ann mapping of the value types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, in case of ann_kmeans_balanced, I think we'd better use raft::matrix::copy_rows, because we copy indices here rather than values, and logically it has nothing to do with the ann mapping of the value types.
What do you mean "we copy indices"? The two calls of copy_selected
in ann_kmeans_balanced
copy parts of the dataset and the norm respectively. While the norm is float
and can easily make use of matrix::copyRows
(or a better kernel, since n_cols=1
), the dataset can be int8
/uint8
and definitely needs the mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But maybe we could adapt copyRows
to accept an iterator, and then branch inside matrix::detail::copyRows
to select the most efficient implementation based on whether the iterator is a simple pointer or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean "we copy indices"?
Oh, sorry, I looked the wrong way. Indeed you're right, we do copy values here and do make use of the mapping T -> float
. Which, strictly speaking, is a part of our implementation of ANN methods so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but in that regard, it matters little whether that hardcoded mapping type is given through an iterator or directly in the kernel such as in copy_selected_kernel
, I'd even argue I prefer the former because the primitive doesn't have to know about the mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, I totally agree (just make sure we don't lose much in terms of performance). I meant to explain my issues with the other topic discussed above - moving ann_kmeans_balanced into the cluster namespace, whereas it currently contains ANN-implementation-specific logic for transforming the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Nyrio trying to use raft primitives where at all possible, would raft::matrix::gather work in place of thrust::gather?
Also, just FYI, we are in the process of migrating the RAFT APIs to accept mdspan everywhere to make them consistent across the board, cleaner, easier to use. Maybe we could also use raft::span (which can construct an interator)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cjnolet matrix::gather
will be very inefficient for D=1
:
for (int i = threadIdx.x; i < D; i += TPB) {
out[outRowStart + i] = in[inRowStart + i];
}
We could have an array/vector gather but it would essentially be a simple wrapper around thrust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I used thrust::gather
only for the norm, we could however use matrix::gather
to select rows in the dataset (in another PR). Its support for iterators is convenient, I could use that to do the int-float mapping.
Don't we have some redundancy between copyRows
and gather
? The latter seems to be a generalization of the former, am I right?
* @param n_rows | ||
* @param n_cols | ||
* @param[in] a device pointer to the row-major matrix [n_rows, n_cols] | ||
* @param[out] out device pointer to the vector of selected indices [n_rows] | ||
* @param stream | ||
*/ | ||
template <typename IdxT, typename LabelT> | ||
inline void argmin_along_rows( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is in the todos for future work in the ann impls already, but i want to point out again that this would be a great primitive for raft::matrix
(along w/ most of the functions in this file). While I suppose raft/cluster/hierarchical_kmeans.cuh
could still reference the raft::spatial::knn::detail
namespace, it would also lend itself to better and more clean separation if these were made more centralized within raft.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've noted there in the docs that it would be nice to merge this into select-top-k as a special case
* TODO: specialize select_k for the case of `k == 1` and use that one instead. |
I suggest I will do this as soon as ivf-pq implementation is merged. Here's the linked issue #853
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argmin and argmax themselves are useful primitives. We do have an argmax primitive that uses cub underneath currently. We could create an argmin from that, or use this one. Either way, i think it's generally useful and could be moved to raft::matrix::argmin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm, all of the k-means changes seem to be compatible with my ivf-pq PR and hence LGTM :)
# Conflicts: # cpp/include/raft/spatial/knn/detail/ann_kmeans_balanced.cuh # cpp/include/raft/spatial/knn/detail/ann_utils.cuh
@Nyrio, it looks like there's a couple conflicts but otherwise good to go. We can merge after the conflicts are resolved. |
@cjnolet Yes, I am just running some tests on local changes (conflicts + using the new compiled wrapper around |
@gpucibot merge |
…ANN kmeans (#912) This PR follows up on [a suggestion](#821 (comment)) from @cjnolet. The new `argmin` primitive is up to 5x faster than `argmin_along_rows` for dimensions relevant to ANN kmeans, and removes code duplication. The reasons why it is faster are: - `argmin_along_rows` often misses on doing a sequential reduction before the tree reduction, especially as it uses large block sizes, as much as 1024. - CUB has a better reduction algorithm than the basic shared-mem reduction used in `argmin_along_rows`. - If we switch the `argmin` prim to using the `cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY` algorithm, we can get up to 30% further speedup! (I believe it's safe to use the commutative algorithm here since the offset is contained in the key-value pair so the reduction operation is commutative). The speedup that I have measured for IVF-Flat build with the `InnerProduct` metric is around 15%. Authors: - Louis Sugy (https://github.com/Nyrio) Approvers: - Micka (https://github.com/lowener) - Corey J. Nolet (https://github.com/cjnolet) - Tamas Bela Feher (https://github.com/tfeher) URL: #912
The bottleneck of
ivf_flat::build
is the label prediction, currently done with GEMM + argmin. This PR gives a 25-30% speedup by replacing it with the fused primitive.