Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace normalize_rows in ann_utils.cuh by a new rowNormalize prim and improve performance for thin matrices (small n_cols) #979

Merged
merged 10 commits into from
Nov 17, 2022

Conversation

Nyrio
Copy link
Contributor

@Nyrio Nyrio commented Nov 2, 2022

This follows up on a discussion at #652 (comment). The main goal of this PR is to make this helper accessible as a raft primitive.

I also used the opportunity to look at the performance of this primitive, and have improved it for:

  • Thin matrices: less than 32 threads per row with shuffle-based reductions.
  • Thick matrices: cub-based reduction doing one row per block.

Here is an overview of the before/after performance on A100:

2022-11-11_normalize_perf_float_int32

@Nyrio Nyrio requested review from a team as code owners November 2, 2022 16:39
@Nyrio Nyrio added 3 - Ready for Review cpp CMake improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed cpp CMake labels Nov 2, 2022
@Nyrio Nyrio changed the title Replace normalize_rows in ann_utils.cuh by a new raft rowNormalize prim and improve performance for thin matrices (small n_cols) Replace normalize_rows in ann_utils.cuh by a new rowNormalize prim and improve performance for thin matrices (small n_cols) Nov 2, 2022
@achirkin
Copy link
Contributor

achirkin commented Nov 3, 2022

Hi, @Nyrio, I'd like to have a look at this when I'm back from vacation on Monday.
In the meanwhile, would you consider also the case of an extremely thick matrix, i.e. running multiple blocks per row?

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 3, 2022

In the meanwhile, would you consider also the case of an extremely thick matrix, i.e. running multiple blocks per row?

Is that a realistic use case for this primitive? Do we have examples where the feature space is very large and the number of rows small? I'm asking because we will need to write another kernel for thick matrices, and I don't think it's worth doing if we don't have any use for it yet. We can always do it when the need arises.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 3, 2022

Note on thick matrices: we need to all-reduce the norm between blocks collaborating on the same row, so it's probably worth first using cub::BlockReduce, then thread 0 synchronizes with the other collaborating blocks, then broadcasts the value to all threads in the block so they can apply the division. With those synchronization overheads, this hypothetical kernel should only be used when the number of rows is really small.

@achirkin
Copy link
Contributor

achirkin commented Nov 7, 2022

Ideally, I think, it would be nice to have a public-facing generic row/col-normalize function somewhere in the distance namespace, which would also take the distance type as an argument. What do you think about this, @cjnolet?

Maybe even with a python interface? I remember, we had some issues with normalization being done slow by cudf in cuml/svm. @tfeher, do you remember if we could potentially have extreme matrix dimensions there (m >> n and n >> m)?

@achirkin
Copy link
Contributor

achirkin commented Nov 7, 2022

Is that a realistic use case for this primitive? Do we have examples where the feature space is very large and the number of rows small?

Could potentially be the case for linear methods when one wants to solve the dual problem (e.g. dual coordinate descent)?..
In any case, if we put this outside of detail namespace, I think, it's important to make sure performance doesn't degrade too much for any extreme matrices.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 7, 2022

it would be nice to have a public-facing generic row/col-normalize function somewhere in the distance namespace

Totally agree, I'll see about making the non-coalesced kernel to support row/col normalize with row/col-major and might add it to this PR or a separate one. If we take the norm type as an argument, would you prefer to systematically apply the square root for L2, or provide an option? Not sure if there is any case where one would want to divide by the sum of squares and not the real L2 norm.

Regarding thick matrices, I'm implementing that for the coalesced reduction in a separate branch, and if we think it's important for this PR too I'll also see about adding it as well.

@achirkin
Copy link
Contributor

achirkin commented Nov 7, 2022

Hmm, although we have both squared and normal L2 enum values in the the DistanceType enumeration, I think, we should divide by the actual (sqrt) norm in both cases. The guideline here is that if we apply normalization for any supported kind of distance, and then compute this same distance on the output, the values should be equal to 1.0 for every row.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 7, 2022

although we have both squared and normal L2 enum values in the the DistanceType enumeration

@achirkin Please note that the norm / rowNorm / colNorm prims do not take a DistanceType but NormType defined as follows:

enum NormType { L1Norm = 0, L2Norm };

And while fin_op can be used to compute the square root, the default behavior of the L2 norm, in that case, is to compute the sum of squares, but that is because it is used for fusedL2NN.

For normalize, I also think it's fine to apply the square root by default, but there might be some inconsistency if we use that same structure NormType with a different definition of the L2 norm. On the other hand, that is the real definition of an L2 norm.

@cjnolet
Copy link
Member

cjnolet commented Nov 7, 2022

Is that a realistic use case for this primitive? Do we have examples where the feature space is very large and the number of rows small?

In general, we expect metric-spaces to start breaking down around dimension 1024 and above and this phenomenon has been published in a lot of literature. What happens is the variance of the data points begins to decrease rapidly and they all end up just converging into a single blob, lowering their over discriminative capabilities in that space.

That's true for distances, anyways, but for normalizing a set of vectors, I've seen go up into the 10s of thousands. As @achirkin points out, this might be used as a preprocessing step before a larger feature selection step is performed using something like lasso, for example. I do suggest that we consider moderately tall and wide datasets (thousands not millions).

@cjnolet
Copy link
Member

cjnolet commented Nov 7, 2022

Ideally, I think, it would be nice to have a public-facing generic row/col-normalize function somewhere in the distance namespace, which would also take the distance type as an argument. What do you think about this, @cjnolet?

I suggest that we keep the normalization in the linalg namespace because its use for distance computations is more of an implementation detail and not the primary goal of performing the normalization. I could see an argument for putting it in the matrix namespace but I think it's representation as a standard norm followed by a matrix/vector division across rows makes it more suitable for linalg.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 7, 2022

I suggest that we keep the normalization in the linalg namespace

Oh, I missed that Artem suggested the distance namespace. If norm is in linalg, normalize belongs in linalg too.

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 7, 2022

As @achirkin points out, this might be used as a preprocessing step before a larger feature selection step is performed using something like lasso, for example. I do suggest that we consider moderately tall and wide datasets (thousands not millions).

@cjnolet I see, it makes sense. On a separate branch, I have a 3-kernel approach for the coalesced reduction (thin: shuffle-based reduction with multiple rows per block, medium: current cub-based reduction with 1 block = 1 row, thick: multiple blocks per row and atomics)

I am benchmarking and writing heuristics, and then I can adapt this for normalize which is a slightly more complicated kernel due to additional waiting and broadcasting of the reduction result.

@cjnolet
Copy link
Member

cjnolet commented Nov 7, 2022

@achirkin Please note that the norm / rowNorm / colNorm prims do not take a DistanceType but NormType defined as follows:

I propose we eventually remove the enum and just accept an integral "order" argument directly. This is what scipy does, for example.

Another option would be to add all the most widely used norm computations (l0, l1, l2, linf) to the enum. L1 and l2 are widely used but as pointed out in the semirings paper I referenced earlier, l0 (essentially number of nonzero matrix elements), linf and lmax are also important when used to compute a multitude of general distance measures (not necessarily just metrics). They also boil down to simple reduction / accumulation functors in the end.

@Nyrio Nyrio added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review labels Nov 9, 2022
@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 11, 2022

@cjnolet @achirkin I have consolidated the interface with arbitrary norm types, following Corey's suggestion to have both a functor and an enum-based APIs.

I have also improved the performance for thick matrices with a cub-based kernel. Unlike what I did in #1011, I think two code paths are enough for this one rather than three. My work on coalescedReduce showed that for anything more than a few dozen of rows, using multiple blocks per row provides very little advantage as we're memory-bound and already close to SOL with as little as 1 block per SM. Please note the cub-based kernel already provides a 25x advantage against the ann prim for shallow and thick matrices.

cf perf chart in the updated description

@Nyrio Nyrio added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels Nov 11, 2022
@Nyrio Nyrio requested review from cjnolet and achirkin November 11, 2022 18:38
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @louis for the PR! I have some questions, please see below

cpp/include/raft/util/cuda_utils.cuh Outdated Show resolved Hide resolved
cpp/test/linalg/normalize.cu Show resolved Hide resolved
cpp/include/raft/linalg/detail/normalize.cuh Outdated Show resolved Hide resolved
@Nyrio Nyrio requested a review from tfeher November 15, 2022 13:30
@Nyrio Nyrio removed their assignment Nov 15, 2022
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Louis for addressing the issues, the PR looks good to me.

cpp/test/linalg/normalize.cu Show resolved Hide resolved
Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. And thank you for using the new mdspan API!

@cjnolet
Copy link
Member

cjnolet commented Nov 16, 2022

@achirkin just waiting on your input here before I merge.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for holding this off! LGTM as well, I'm very glad to see the prims like this being thoroughly optimized in raft!
Also just couple small comments below.

@@ -516,6 +516,16 @@ struct Nop {
HDI Type operator()(Type in, IdxType i = 0) { return in; }
};

template <typename Type, typename IdxType = int>
struct SqrtOp {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, and in other "Functor" structs: would you consider moving the template parameters onto the operator() where possible? This way, the template parameter will be inferred automatically - less typing and opportunities to introduce bugs on the user side.


template <typename Type>
struct Max {
HDI Type operator()(Type a, Type b) { return myMax(a, b); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, I had an impression that we wanted to deprecate myXxx-style functions in raft. Is that the case, @cjnolet ? Here, we could use std::max, for it being constexpr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nyrio do you want to create an issue for these and do them as a follow-on? I think since this PR has already been approved and run through CI (and since burndown starts tomorrow), we can go ahead and merge this as-is. What do you guys think?

@Nyrio
Copy link
Contributor Author

Nyrio commented Nov 17, 2022

I have opened two issues since both remarks are outside of the scope of this PR and shouldn't delay merging it.

@cjnolet
Copy link
Member

cjnolet commented Nov 17, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e14bcbd into rapidsai:branch-22.12 Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Reviewer Waiting for reviewer to review or respond CMake cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants