Faster glm ols-via-eigendecomposition algorithm #4201

achirkin · 2021-09-10T09:49:50Z

This PR makes some improvements on glm ordinary least squares (ols) implementation:

Split MLCommon::LinAlg::lstsq, which solves OLS by computing SVD of the feature matrix (UΣV = A). Originally, this function provided two algorithms to do so (using cusolver QR decomposition and using eigenvalue decomposition).
Add a third algorithm for solving OLS via SVD - via cusolver Jacobi iterations.
Inline raft::linalg::svdEig. This function computes SVD decomposition of the feature matrix UΣV = A via eigenvalue decomposition of QΛQ^T = A^T A, which is faster, but requires additional manipulations to recover Σ and U (while V = Q). Instead, I use raft::linalg::eigDC directly to compute w = (A^T A)^-1Ab = (QΛ^-1Q^T)(Ab). This also allows to compute Ab concurrently (in another cuda stream) to the rest of the operations, which often fit in my GPU at the same time when n_cols << n_rows.
Just a small thing: reduce the number of memory allocations using rmm::device_uvector. Supposedly, this reduces the wilderness of benchmarking uncertainty when using the default memory allocator.
Force the SVD-Jacobi algorithm when n_cols > n_rows, because none of the others support this case either in theory or due to cusolver limitations.
Update the python interface to allow choosing among all available algorithms.

…s and split the work into two cuda streams.

achirkin · 2021-09-10T10:01:20Z

Some pictures for the reference. Tested on AMD Ryzen 9 5950X and RTX 3090. The green line is the results of sklearn patched by intel intelex project. The blue line is cuml. The benchmark does not include input_to_cuml_array, which otherwise takes 80% of the execution time. The fit_intercept is also turned off, which otherwise takes 40-60% of the remaining execution time.

Original:

After modifications:

achirkin · 2021-09-10T10:05:42Z

For reference, the same setting after modifications, with:

fit_intercept = True

fit_intercept = True and input_to_cuml_array included in the benchmark:

Mind the weird spikes on the training/exec time plots. I ran the test only once for every plot point; maybe the spikes come from memory allocation?.. I tried to run profiling with the troubled input sizes, did not find any anomalies.

achirkin · 2021-09-10T11:15:12Z

Nsys timeline with 1000000 rows and 100 columns:

Original:

After modifications:

(the orange is the time spent in raft::linalg::eigDC, which is the same in both variants.

achirkin · 2021-09-15T08:36:40Z

This PR now depends on rapidsai/raft#327 and rapidsai/rmm#870 , so I can reduce the unnecessary/duplicate helper code.

achirkin · 2021-09-15T14:49:53Z

There is also a weird accuracy drop when n_cols == n_rows and fit_intercept == True for SVD-based algorithms, I'm investigating it.

cpp/src/glm/ols.cuh

cjnolet · 2021-09-17T15:48:28Z

@achirkin, I've held off on posting a review of this PR because it's still marked as a draft. Let me know when you are ready and I can also give it a look over for you.

achirkin · 2021-09-20T11:38:14Z

Thanks, @cjnolet , before I've been waiting for my PRs to raft and rmm to get through. Now I added the changes I wanted; if the ci checks pass, consider it ready for the review.

achirkin · 2021-09-20T17:56:07Z

rerun tests

cjnolet

The changes look great. Very minor things (and a couple comments in preparation for movement to raft).

cjnolet · 2021-09-20T19:29:50Z

cpp/src_prims/linalg/lstsq.cuh

-           int algo,
-           cudaStream_t stream)
+void lstsqSvdQR(const raft::handle_t& handle,
+                math_t* A,  // apparently, this must not be const, because cusolverDn<t>gesvd() says


Even if we immediately cast it away, it would be nice to be consistent and keep A as a const so that users know that it shouldn't be expected to change. Maybe also add a little note in the function comment. It would be nice to see consts used in the other lstsq* functions as well.

The problem is I'm not sure A stays unmodified. For example, lstsqQR is guaranteed to update it - convert to a triangular system. For lstsqSvdJacobi, cuSOLVER's cusolverDn<t>gesvdj says "On exit, the contents of A are destroyed." and it's not marked as const. The same story for lstsqSvdJacobi, except cusolverDn<t>gesvd explicitly provides an option to override outputs into A to save memory (although I disabled it). In both cusolverDn<t>gesvd and cusolverDn<t>gesvdj parameter A is marked as [in/out], so we cannot guarantee it's not modified :( https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-gesvd )

I can copy const array A though, but I'm not sure it's worth it.

cpp/src_prims/linalg/lstsq.cuh

python/cuml/linear_model/linear_regression.pyx

cpp/src_prims/linalg/lstsq.cuh

achirkin · 2021-09-23T05:45:24Z

rerun tests

codecov-commenter · 2021-09-23T08:35:04Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@d657178). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-21.10    #4201   +/-   ##
===============================================
  Coverage                ?   85.99%           
===============================================
  Files                   ?      231           
  Lines                   ?    19238           
  Branches                ?        0           
===============================================
  Hits                    ?    16543           
  Misses                  ?     2695           
  Partials                ?        0

Flag	Coverage Δ
dask	`46.89% <0.00%> (?)`
non-dask	`78.74% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d657178...3b7da99. Read the comment docs.

cjnolet

LGTM

cjnolet · 2021-09-24T21:57:12Z

@gpucibot merge

This PR makes some improvements on glm ordinary least squares (ols) implementation: 1. Split `MLCommon::LinAlg::lstsq`, which solves OLS by computing SVD of the feature matrix (**UΣV = A**). Originally, this function provided two algorithms to do so (using cusolver QR decomposition and using eigenvalue decomposition). 2. Add a third algorithm for solving OLS via SVD - via cusolver Jacobi iterations. 3. Inline `raft::linalg::svdEig`. This function computes SVD decomposition of the feature matrix **UΣV = A** via eigenvalue decomposition of **QΛQT = AT A**, which is faster, but requires additional manipulations to recover **Σ** and **U** (while **V = Q**). Instead, I use `raft::linalg::eigDC` directly to compute **w = (AT A)-1Ab = (QΛ-1QT)(Ab)**. This also allows to compute **Ab** concurrently (in another cuda stream) to the rest of the operations, which often fit in my GPU at the same time when n_cols << n_rows. 4. Just a small thing: reduce the number of memory allocations using `rmm::device_uvector`. Supposedly, this reduces the wilderness of benchmarking uncertainty when using the default memory allocator. 5. Force the SVD-Jacobi algorithm when n_cols > n_rows, because none of the others support this case either in theory or due to cusolver limitations. 6. Update the python interface to allow choosing among all available algorithms. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4201

Modify lstsq-eig algorithm to have less/smaller matrix multiplication…

4cddb65

…s and split the work into two cuda streams.

github-actions bot added the CUDA/C++ label Sep 10, 2021

achirkin added 2 - In Progress Currenty a work in progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed CUDA/C++ labels Sep 10, 2021

achirkin requested a review from tfeher September 10, 2021 09:53

achirkin self-assigned this Sep 10, 2021

Add flags argument to a function call

639ea53

github-actions bot added the CUDA/C++ label Sep 10, 2021

Add flags argument to a function call

1bdec83

Split lstsq into one-algorithm-per-function

0d61590

github-actions bot added the Cython / Python Cython or Python issue label Sep 15, 2021

cjnolet reviewed Sep 16, 2021

View reviewed changes

cpp/src/glm/ols.cuh Show resolved Hide resolved

achirkin mentioned this pull request Sep 17, 2021

[REVIEW] Add cuda_event type rapidsai/rmm#870

Closed

achirkin added 3 commits September 20, 2021 08:53

Merge branch 'branch-21.10' into enh-faster-svm-via-eig

61407ef

Refactoring: use more raft helpers

4f2d10d

Please flake8

6415f45

achirkin marked this pull request as ready for review September 20, 2021 11:35

achirkin requested review from a team as code owners September 20, 2021 11:35

cjnolet requested changes Sep 20, 2021

View reviewed changes

achirkin added 3 commits September 22, 2021 16:01

Various code-look improvements

df66cbc

Add const to the operator

674530c

update comments on constness of the input data

3b7da99

achirkin requested a review from cjnolet September 22, 2021 14:46

dantegd added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 2 - In Progress Currenty a work in progress labels Sep 23, 2021

cjnolet approved these changes Sep 24, 2021

View reviewed changes

rapids-bot bot merged commit 26408d3 into rapidsai:branch-21.10 Sep 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster glm ols-via-eigendecomposition algorithm #4201

Faster glm ols-via-eigendecomposition algorithm #4201

achirkin commented Sep 10, 2021 •

edited

Loading

achirkin commented Sep 10, 2021

achirkin commented Sep 10, 2021 •

edited

Loading

achirkin commented Sep 10, 2021

achirkin commented Sep 15, 2021

achirkin commented Sep 15, 2021 •

edited

Loading

cjnolet commented Sep 17, 2021 •

edited

Loading

achirkin commented Sep 20, 2021

achirkin commented Sep 20, 2021

cjnolet left a comment

cjnolet Sep 20, 2021

achirkin Sep 22, 2021 •

edited

Loading

achirkin commented Sep 23, 2021

codecov-commenter commented Sep 23, 2021

cjnolet left a comment

cjnolet commented Sep 24, 2021

Faster glm ols-via-eigendecomposition algorithm #4201

Faster glm ols-via-eigendecomposition algorithm #4201

Conversation

achirkin commented Sep 10, 2021 • edited Loading

achirkin commented Sep 10, 2021

achirkin commented Sep 10, 2021 • edited Loading

achirkin commented Sep 10, 2021

achirkin commented Sep 15, 2021

achirkin commented Sep 15, 2021 • edited Loading

cjnolet commented Sep 17, 2021 • edited Loading

achirkin commented Sep 20, 2021

achirkin commented Sep 20, 2021

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet Sep 20, 2021

Choose a reason for hiding this comment

achirkin Sep 22, 2021 • edited Loading

Choose a reason for hiding this comment

achirkin commented Sep 23, 2021

codecov-commenter commented Sep 23, 2021

Codecov Report

cjnolet left a comment

Choose a reason for hiding this comment

cjnolet commented Sep 24, 2021

achirkin commented Sep 10, 2021 •

edited

Loading

achirkin commented Sep 10, 2021 •

edited

Loading

achirkin commented Sep 15, 2021 •

edited

Loading

cjnolet commented Sep 17, 2021 •

edited

Loading

achirkin Sep 22, 2021 •

edited

Loading