Replace `normalize_rows` in `ann_utils.cuh` by a new `rowNormalize` prim and improve performance for thin matrices (small `n_cols`) #979

Nyrio · 2022-11-02T16:39:55Z

This follows up on a discussion at #652 (comment). The main goal of this PR is to make this helper accessible as a raft primitive.

I also used the opportunity to look at the performance of this primitive, and have improved it for:

Thin matrices: less than 32 threads per row with shuffle-based reductions.
Thick matrices: cub-based reduction doing one row per block.

Here is an overview of the before/after performance on A100:

achirkin · 2022-11-03T11:23:40Z

Hi, @Nyrio, I'd like to have a look at this when I'm back from vacation on Monday.
In the meanwhile, would you consider also the case of an extremely thick matrix, i.e. running multiple blocks per row?

Nyrio · 2022-11-03T12:44:54Z

In the meanwhile, would you consider also the case of an extremely thick matrix, i.e. running multiple blocks per row?

Is that a realistic use case for this primitive? Do we have examples where the feature space is very large and the number of rows small? I'm asking because we will need to write another kernel for thick matrices, and I don't think it's worth doing if we don't have any use for it yet. We can always do it when the need arises.

Nyrio · 2022-11-03T12:53:01Z

Note on thick matrices: we need to all-reduce the norm between blocks collaborating on the same row, so it's probably worth first using cub::BlockReduce, then thread 0 synchronizes with the other collaborating blocks, then broadcasts the value to all threads in the block so they can apply the division. With those synchronization overheads, this hypothetical kernel should only be used when the number of rows is really small.

achirkin · 2022-11-07T08:00:27Z

Ideally, I think, it would be nice to have a public-facing generic row/col-normalize function somewhere in the distance namespace, which would also take the distance type as an argument. What do you think about this, @cjnolet?

Maybe even with a python interface? I remember, we had some issues with normalization being done slow by cudf in cuml/svm. @tfeher, do you remember if we could potentially have extreme matrix dimensions there (m >> n and n >> m)?

achirkin · 2022-11-07T08:05:33Z

Is that a realistic use case for this primitive? Do we have examples where the feature space is very large and the number of rows small?

Could potentially be the case for linear methods when one wants to solve the dual problem (e.g. dual coordinate descent)?..
In any case, if we put this outside of detail namespace, I think, it's important to make sure performance doesn't degrade too much for any extreme matrices.

cpp/include/raft/linalg/normalize.cuh

Nyrio · 2022-11-07T09:41:53Z

it would be nice to have a public-facing generic row/col-normalize function somewhere in the distance namespace

Totally agree, I'll see about making the non-coalesced kernel to support row/col normalize with row/col-major and might add it to this PR or a separate one. If we take the norm type as an argument, would you prefer to systematically apply the square root for L2, or provide an option? Not sure if there is any case where one would want to divide by the sum of squares and not the real L2 norm.

Regarding thick matrices, I'm implementing that for the coalesced reduction in a separate branch, and if we think it's important for this PR too I'll also see about adding it as well.

achirkin · 2022-11-07T10:08:55Z

Hmm, although we have both squared and normal L2 enum values in the the DistanceType enumeration, I think, we should divide by the actual (sqrt) norm in both cases. The guideline here is that if we apply normalization for any supported kind of distance, and then compute this same distance on the output, the values should be equal to 1.0 for every row.

Nyrio · 2022-11-07T10:19:31Z

although we have both squared and normal L2 enum values in the the DistanceType enumeration

@achirkin Please note that the norm / rowNorm / colNorm prims do not take a DistanceType but NormType defined as follows:

enum NormType { L1Norm = 0, L2Norm };

And while fin_op can be used to compute the square root, the default behavior of the L2 norm, in that case, is to compute the sum of squares, but that is because it is used for fusedL2NN.

For normalize, I also think it's fine to apply the square root by default, but there might be some inconsistency if we use that same structure NormType with a different definition of the L2 norm. On the other hand, that is the real definition of an L2 norm.

cjnolet · 2022-11-07T13:33:45Z

Is that a realistic use case for this primitive? Do we have examples where the feature space is very large and the number of rows small?

In general, we expect metric-spaces to start breaking down around dimension 1024 and above and this phenomenon has been published in a lot of literature. What happens is the variance of the data points begins to decrease rapidly and they all end up just converging into a single blob, lowering their over discriminative capabilities in that space.

That's true for distances, anyways, but for normalizing a set of vectors, I've seen go up into the 10s of thousands. As @achirkin points out, this might be used as a preprocessing step before a larger feature selection step is performed using something like lasso, for example. I do suggest that we consider moderately tall and wide datasets (thousands not millions).

cjnolet · 2022-11-07T13:39:38Z

Ideally, I think, it would be nice to have a public-facing generic row/col-normalize function somewhere in the distance namespace, which would also take the distance type as an argument. What do you think about this, @cjnolet?

I suggest that we keep the normalization in the linalg namespace because its use for distance computations is more of an implementation detail and not the primary goal of performing the normalization. I could see an argument for putting it in the matrix namespace but I think it's representation as a standard norm followed by a matrix/vector division across rows makes it more suitable for linalg.

Nyrio · 2022-11-07T13:49:04Z

I suggest that we keep the normalization in the linalg namespace

Oh, I missed that Artem suggested the distance namespace. If norm is in linalg, normalize belongs in linalg too.

Nyrio · 2022-11-07T13:55:38Z

As @achirkin points out, this might be used as a preprocessing step before a larger feature selection step is performed using something like lasso, for example. I do suggest that we consider moderately tall and wide datasets (thousands not millions).

@cjnolet I see, it makes sense. On a separate branch, I have a 3-kernel approach for the coalesced reduction (thin: shuffle-based reduction with multiple rows per block, medium: current cub-based reduction with 1 block = 1 row, thick: multiple blocks per row and atomics)

I am benchmarking and writing heuristics, and then I can adapt this for normalize which is a slightly more complicated kernel due to additional waiting and broadcasting of the reduction result.

cjnolet · 2022-11-07T14:46:43Z

@achirkin Please note that the norm / rowNorm / colNorm prims do not take a DistanceType but NormType defined as follows:

I propose we eventually remove the enum and just accept an integral "order" argument directly. This is what scipy does, for example.

Another option would be to add all the most widely used norm computations (l0, l1, l2, linf) to the enum. L1 and l2 are widely used but as pointed out in the semirings paper I referenced earlier, l0 (essentially number of nonzero matrix elements), linf and lmax are also important when used to compute a multitude of general distance measures (not necessarily just metrics). They also boil down to simple reduction / accumulation functors in the end.

Nyrio · 2022-11-11T18:37:59Z

@cjnolet @achirkin I have consolidated the interface with arbitrary norm types, following Corey's suggestion to have both a functor and an enum-based APIs.

I have also improved the performance for thick matrices with a cub-based kernel. Unlike what I did in #1011, I think two code paths are enough for this one rather than three. My work on coalescedReduce showed that for anything more than a few dozen of rows, using multiple blocks per row provides very little advantage as we're memory-bound and already close to SOL with as little as 1 block per SM. Please note the cub-based kernel already provides a 25x advantage against the ann prim for shallow and thick matrices.

cf perf chart in the updated description

tfeher

Thanks @louis for the PR! I have some questions, please see below

cpp/include/raft/util/cuda_utils.cuh

cpp/test/linalg/normalize.cu

cpp/include/raft/linalg/detail/normalize.cuh

tfeher

Thanks Louis for addressing the issues, the PR looks good to me.

cpp/test/linalg/normalize.cu

cjnolet

LGTM. And thank you for using the new mdspan API!

cjnolet · 2022-11-16T19:35:04Z

@achirkin just waiting on your input here before I merge.

achirkin

Sorry for holding this off! LGTM as well, I'm very glad to see the prims like this being thoroughly optimized in raft!
Also just couple small comments below.

achirkin · 2022-11-16T20:11:49Z

cpp/include/raft/util/cuda_utils.cuh

@@ -516,6 +516,16 @@ struct Nop {
  HDI Type operator()(Type in, IdxType i = 0) { return in; }
 };

+template <typename Type, typename IdxType = int>
+struct SqrtOp {


Here, and in other "Functor" structs: would you consider moving the template parameters onto the operator() where possible? This way, the template parameter will be inferred automatically - less typing and opportunities to introduce bugs on the user side.

achirkin · 2022-11-16T20:14:19Z

cpp/include/raft/util/cuda_utils.cuh

+
+template <typename Type>
+struct Max {
+  HDI Type operator()(Type a, Type b) { return myMax(a, b); }


For some reason, I had an impression that we wanted to deprecate myXxx-style functions in raft. Is that the case, @cjnolet ? Here, we could use std::max, for it being constexpr.

@Nyrio do you want to create an issue for these and do them as a follow-on? I think since this PR has already been approved and run through CI (and since burndown starts tomorrow), we can go ahead and merge this as-is. What do you guys think?

Nyrio · 2022-11-17T11:06:00Z

I have opened two issues since both remarks are outside of the scope of this PR and shouldn't delay merging it.

cjnolet · 2022-11-17T11:44:58Z

@gpucibot merge

Nyrio added 2 commits November 2, 2022 14:33

Add rowNormalize primitive, test and bemchmark

609c991

mdspanify rowNormalize

f34d221

Nyrio requested review from a team as code owners November 2, 2022 16:39

github-actions bot added CMake cpp labels Nov 2, 2022

Nyrio added 3 - Ready for Review cpp CMake improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed cpp CMake labels Nov 2, 2022

Fix doxygen

901b954

Cleanup unused inputs

f4d0c37

achirkin reviewed Nov 7, 2022

View reviewed changes

cpp/include/raft/linalg/normalize.cuh Outdated Show resolved Hide resolved

Nyrio added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review labels Nov 9, 2022

Nyrio added 3 commits November 11, 2022 17:42

Merge remote-tracking branch 'origin/branch-22.12' into enh-normalize

c51e10b

Consolidate normalization API and add support for arbitrary norms

ea86463

Performance improvement and small error fixes

774b211

Nyrio mentioned this pull request Nov 11, 2022

Add strided normalization kernel and generic normalize interface #1012

Open

Nyrio added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels Nov 11, 2022

Nyrio requested review from cjnolet and achirkin November 11, 2022 18:38

Nyrio added 2 commits November 14, 2022 11:37

Add missing parameter in doxygen

fa5b693

doxygen fix

e1f8391

tfeher requested changes Nov 15, 2022

View reviewed changes

cpp/include/raft/util/cuda_utils.cuh Outdated Show resolved Hide resolved

cpp/test/linalg/normalize.cu Show resolved Hide resolved

cpp/include/raft/linalg/detail/normalize.cuh Outdated Show resolved Hide resolved

expose eps, pass inde to main_op, remove outdated comment

1806478

Nyrio requested a review from tfeher November 15, 2022 13:30

Nyrio removed their assignment Nov 15, 2022

tfeher approved these changes Nov 15, 2022

View reviewed changes

cpp/test/linalg/normalize.cu Show resolved Hide resolved

Nyrio assigned cjnolet and achirkin Nov 16, 2022

cjnolet approved these changes Nov 16, 2022

View reviewed changes

Nyrio unassigned cjnolet Nov 16, 2022

achirkin approved these changes Nov 16, 2022

View reviewed changes

This was referenced Nov 17, 2022

[ENH] Move template arguments of L1Op, L2Op etc to operator() for automatic type inferrence #1024

Open

[ENH] Deprecate myMax and similar functions #1025

Closed

rapids-bot bot merged commit e14bcbd into rapidsai:branch-22.12 Nov 17, 2022

Nyrio mentioned this pull request Jan 18, 2023

[ENH] IVF-* ANN post-integration TODOs #711

Open

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `normalize_rows` in `ann_utils.cuh` by a new `rowNormalize` prim and improve performance for thin matrices (small `n_cols`) #979

Replace `normalize_rows` in `ann_utils.cuh` by a new `rowNormalize` prim and improve performance for thin matrices (small `n_cols`) #979

Nyrio commented Nov 2, 2022 •

edited

Loading

achirkin commented Nov 3, 2022

Nyrio commented Nov 3, 2022 •

edited

Loading

Nyrio commented Nov 3, 2022

achirkin commented Nov 7, 2022

achirkin commented Nov 7, 2022

Nyrio commented Nov 7, 2022 •

edited

Loading

achirkin commented Nov 7, 2022 •

edited

Loading

Nyrio commented Nov 7, 2022

cjnolet commented Nov 7, 2022

cjnolet commented Nov 7, 2022

Nyrio commented Nov 7, 2022

Nyrio commented Nov 7, 2022

cjnolet commented Nov 7, 2022

Nyrio commented Nov 11, 2022 •

edited

Loading

tfeher left a comment

tfeher left a comment

cjnolet left a comment

cjnolet commented Nov 16, 2022

achirkin left a comment

achirkin Nov 16, 2022

achirkin Nov 16, 2022

cjnolet Nov 17, 2022

Nyrio commented Nov 17, 2022

cjnolet commented Nov 17, 2022

Replace normalize_rows in ann_utils.cuh by a new rowNormalize prim and improve performance for thin matrices (small n_cols) #979

Replace normalize_rows in ann_utils.cuh by a new rowNormalize prim and improve performance for thin matrices (small n_cols) #979

Conversation

Nyrio commented Nov 2, 2022 • edited Loading

achirkin commented Nov 3, 2022

Nyrio commented Nov 3, 2022 • edited Loading

Nyrio commented Nov 3, 2022

achirkin commented Nov 7, 2022

achirkin commented Nov 7, 2022

Nyrio commented Nov 7, 2022 • edited Loading

achirkin commented Nov 7, 2022 • edited Loading

Nyrio commented Nov 7, 2022

cjnolet commented Nov 7, 2022

cjnolet commented Nov 7, 2022

Nyrio commented Nov 7, 2022

Nyrio commented Nov 7, 2022

cjnolet commented Nov 7, 2022

Nyrio commented Nov 11, 2022 • edited Loading

tfeher left a comment

Choose a reason for hiding this comment

tfeher left a comment

Choose a reason for hiding this comment

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented Nov 16, 2022

achirkin left a comment

Choose a reason for hiding this comment

achirkin Nov 16, 2022

Choose a reason for hiding this comment

achirkin Nov 16, 2022

Choose a reason for hiding this comment

cjnolet Nov 17, 2022

Choose a reason for hiding this comment

Nyrio commented Nov 17, 2022

cjnolet commented Nov 17, 2022

Replace `normalize_rows` in `ann_utils.cuh` by a new `rowNormalize` prim and improve performance for thin matrices (small `n_cols`) #979

Replace `normalize_rows` in `ann_utils.cuh` by a new `rowNormalize` prim and improve performance for thin matrices (small `n_cols`) #979

Nyrio commented Nov 2, 2022 •

edited

Loading

Nyrio commented Nov 3, 2022 •

edited

Loading

Nyrio commented Nov 7, 2022 •

edited

Loading

achirkin commented Nov 7, 2022 •

edited

Loading

Nyrio commented Nov 11, 2022 •

edited

Loading