-
Notifications
You must be signed in to change notification settings - Fork 546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster glm ols-via-eigendecomposition algorithm #4201
Faster glm ols-via-eigendecomposition algorithm #4201
Conversation
…s and split the work into two cuda streams.
Some pictures for the reference. Tested on AMD Ryzen 9 5950X and RTX 3090. The green line is the results of sklearn patched by intel intelex project. The blue line is cuml. The benchmark does not include |
This PR now depends on rapidsai/raft#327 and rapidsai/rmm#870 , so I can reduce the unnecessary/duplicate helper code. |
There is also a weird accuracy drop when |
@achirkin, I've held off on posting a review of this PR because it's still marked as a draft. Let me know when you are ready and I can also give it a look over for you. |
Thanks, @cjnolet , before I've been waiting for my PRs to raft and rmm to get through. Now I added the changes I wanted; if the ci checks pass, consider it ready for the review. |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look great. Very minor things (and a couple comments in preparation for movement to raft).
cpp/src_prims/linalg/lstsq.cuh
Outdated
int algo, | ||
cudaStream_t stream) | ||
void lstsqSvdQR(const raft::handle_t& handle, | ||
math_t* A, // apparently, this must not be const, because cusolverDn<t>gesvd() says |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we immediately cast it away, it would be nice to be consistent and keep A
as a const so that users know that it shouldn't be expected to change. Maybe also add a little note in the function comment. It would be nice to see consts used in the other lstsq*
functions as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is I'm not sure A
stays unmodified. For example, lstsqQR
is guaranteed to update it - convert to a triangular system. For lstsqSvdJacobi
, cuSOLVER's cusolverDn<t>gesvdj
says "On exit, the contents of A are destroyed." and it's not marked as const. The same story for lstsqSvdJacobi
, except cusolverDn<t>gesvd
explicitly provides an option to override outputs into A to save memory (although I disabled it). In both cusolverDn<t>gesvd
and cusolverDn<t>gesvdj
parameter A
is marked as [in/out]
, so we cannot guarantee it's not modified :( https://docs.nvidia.com/cuda/cusolver/index.html#cuSolverDN-lt-t-gt-gesvd )
I can copy const array A though, but I'm not sure it's worth it.
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #4201 +/- ##
===============================================
Coverage ? 85.99%
===============================================
Files ? 231
Lines ? 19238
Branches ? 0
===============================================
Hits ? 16543
Misses ? 2695
Partials ? 0
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@gpucibot merge |
This PR makes some improvements on glm ordinary least squares (ols) implementation: 1. Split `MLCommon::LinAlg::lstsq`, which solves OLS by computing SVD of the feature matrix (**UΣV = A**). Originally, this function provided two algorithms to do so (using cusolver QR decomposition and using eigenvalue decomposition). 2. Add a third algorithm for solving OLS via SVD - via cusolver Jacobi iterations. 3. Inline `raft::linalg::svdEig`. This function computes SVD decomposition of the feature matrix **UΣV = A** via eigenvalue decomposition of **QΛQ<sup>T</sup> = A<sup>T</sup> A**, which is faster, but requires additional manipulations to recover **Σ** and **U** (while **V = Q**). Instead, I use `raft::linalg::eigDC` directly to compute **w = (A<sup>T</sup> A)<sup>-1</sup>Ab = (QΛ<sup>-1</sup>Q<sup>T</sup>)(Ab)**. This also allows to compute **Ab** concurrently (in another cuda stream) to the rest of the operations, which often fit in my GPU at the same time when n_cols << n_rows. 4. Just a small thing: reduce the number of memory allocations using `rmm::device_uvector`. Supposedly, this reduces the wilderness of benchmarking uncertainty when using the default memory allocator. 5. Force the SVD-Jacobi algorithm when n_cols > n_rows, because none of the others support this case either in theory or due to cusolver limitations. 6. Update the python interface to allow choosing among all available algorithms. Authors: - Artem M. Chirkin (https://github.com/achirkin) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4201
This PR makes some improvements on glm ordinary least squares (ols) implementation:
MLCommon::LinAlg::lstsq
, which solves OLS by computing SVD of the feature matrix (UΣV = A). Originally, this function provided two algorithms to do so (using cusolver QR decomposition and using eigenvalue decomposition).raft::linalg::svdEig
. This function computes SVD decomposition of the feature matrix UΣV = A via eigenvalue decomposition of QΛQT = AT A, which is faster, but requires additional manipulations to recover Σ and U (while V = Q). Instead, I useraft::linalg::eigDC
directly to compute w = (AT A)-1Ab = (QΛ-1QT)(Ab). This also allows to compute Ab concurrently (in another cuda stream) to the rest of the operations, which often fit in my GPU at the same time when n_cols << n_rows.rmm::device_uvector
. Supposedly, this reduces the wilderness of benchmarking uncertainty when using the default memory allocator.