-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-node-multi-gpu Logistic Regression in C++ #5477
Conversation
Pull requests from external contributors require approval from a |
Rebased with the latest branch-23.08. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this new feature @lijinf2! This is going to be useful not just in Spark but in Dask and other places as well!
I think overall the C++ code is structured pretty well. There are some general details around the communicator and handle, documentation, and input validation that should be done.
I think the cython can be simplified further and the Dask piece should be pretty straightforward to add. The Dask Linear Regression Wrapper should be helpful there. Also, as discussed off line, we will need tests and the Dask Linear Regression Pytests should be helpful there. More specifically, you should be able to use Dask make_classification data generator for that.
Thanks again for this contribution!
# the cdef was copied from cuml.linear_model.qn | ||
cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM" nogil: | ||
|
||
cdef enum qn_loss_type "ML::GLM::qn_loss_type": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this isn't MG specific, we should centralize it so the SG and MG can use the same one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will affect single-GPU logistic_regression.pyx. Is it ok to do it in the next PR to avoid this one being too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the logic of adding this change in a follow up, but could you open a GH issue capturing this (and any other small) follow up items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you add a little todo here and post the number of the github issue? This helps us map it back to the code so it doesn't get lost/forgotten.
QN_LOSS_ABS "ML::GLM::QN_LOSS_ABS" | ||
QN_LOSS_UNKNOWN "ML::GLM::QN_LOSS_UNKNOWN" | ||
|
||
cdef struct qn_params: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here- this isn't mg specific so we should use the SG version
int n_ranks) except + | ||
|
||
|
||
class LogisticRegressionMG(LogisticRegression): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My suggestion is to use the design from LinearRegressionMG
as much as possible to model the relationship between LogisticRegression
and LogisticRegressionMG
. There are some more classes and mixins that can reduce boilerplate further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Find it a bit challenging to inherit mixins, for two reasons:
(1) min_lbfgs accepts float*. mixins fit accepts vector<float*>
(2) min_lbfgs supports multiple classes so coef can be a matrix. mixins assumes coef_ to be a vector of D + pams.fit_intercept length.
Any there existing APIs to address (1) and (2) to simplify the implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(3) Another issue pops up that relates to mixins._fit and (1). It seems mixins.fit converts input vectors and labels into a special type of points (i.e. X_arg, y_arg) using opg.build_data_t(X_arys). Is there s way to convert X_arg and y_arg to float* (min_lbfgs requires the float*)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined the design from 'LinearRegressionMG' and singe-gpu 'LogisticRegression'. Specifically:
(1) Got self._coef initialization support LinearRegressionMG style (i.e. 1-D array) and LogisticRegression style (i.e. 2-D array).
(2) Converted type of input vectors and input labels) from LienarRegressionMG style (vector of Matrix:Data type) to LogisticRegression style (float*).
(3) set n_classes to 2 in this PR and will work on multiple class in following PRs.
self.solver_model._coef_ = CumlArray.zeros( | ||
coef_size, dtype=self.dtype, order='C') | ||
|
||
def fit(self, X, y, rank, n_ranks, n_samples, n_classes, convert_dtype=False) -> "LogisticRegressionMG": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I think this could be simplified further to look more like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised the code and the _fit now looks exactly the same.
fit looks the same except for one additional argument (i.e. order='F'). This is because LinearRegressionMG
uses 'F' order but cuml.LogisticRegression
uses 'C' order.
Thanks so much for reviewing the PR! @cjnolet Your comments and links are very helpful and I will start working on the revision following your suggestions. One thing I feel uncertain relates to num_classes calculation. Is there an existing C++ API to calculate num_classes from labels across GPUs? This will help move num_classes variable to C++ implementation and reduce one argument in Cython fit wrapper. |
I don't know if we have a way to do this on mutliple GPUs yet. We could use an |
/ok to test |
Figured out it is possible with C++ Maybe in this PR we can assume num_classes to be 2 and remove the num_class from Cython fit argument list. In a future PR we can implement the C++ num_class calculation and a wrapper to use it in logistic_regression_mg.pyx . |
@cjnolet Thanks so much for the comments! They are very helpful. After addressing the comments, I find the implementation of dask class (cuml.dask.linear_model.LogisticRegression) and pytests become much simpler. Would like to have your review again when you have time. The revised PR mainly includes two new files (dask/linear_model/LogisticRegression) and (tests/dask/test_dask_logistic_regression.py). The added pytests get passed in my workstation of 2-GPUs. Will need to run the full pytests of cuml. |
/ok to test |
/ok to test |
1 similar comment
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes look good from my side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation looks great and it's coming together. My remaining comments are very minor at this point. I think we'll be able to get this into 23.08.
# the cdef was copied from cuml.linear_model.qn | ||
cdef extern from "cuml/linear_model/glm.hpp" namespace "ML::GLM" nogil: | ||
|
||
cdef enum qn_loss_type "ML::GLM::qn_loss_type": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you add a little todo here and post the number of the github issue? This helps us map it back to the code so it doesn't get lost/forgotten.
Thanks so much for reviewing again! I have revised the PR and pushed the latest. |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @lijinf2, that approval was premature and meant for a different PR. I think this will be getting approved shortly, though.
/ok to test |
The checks seem fail at irrelevant test cases(e.g. error "FAILED test_hdbscan.py::test_all_points_membership_vectors_circles[1000-knn-leaf-0-True-0.5-500-5-1000] - TypeError: 'numpy.float64' object cannot be interpreted as an integer"). Any idea? I can get test_hdbscan.py passed locally on my workstation. Perhaps rerun "/ok to test" to repeat the error. |
We are currently experiencing some CI issues due to changes in our dependencies (see #5514). Once those are resolved, we can rerun tests here. |
/merge |
This is a followup PR for [PR 5477](#5477). This PR adds predict API to MNMG logistic regression and tests to verify the correctness. Please review the code change from commit 171aef2 with message "add predict operator". The implementation is trivial after the dependency PR 5477 is merged. Authors: - Jinfeng Li (https://github.com/lijinf2) Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: #5516
This PR enables multi-node-multi-gpu Logistic Regression and it mostly reuses existing codes (i.e. GLMWithData and min_lbfgs) of single-GPU Logistic Regression. No change to any existing codes.
Added Pytest code for Spark cluster and the tests run successfully with 2 GPUs on a random dataset. The coef_ and intercept_ are the same as single-GPU cuml.LogisticRegression.fit. Pytest code can be found here: https://github.com/lijinf2/spark-rapids-ml/blob/lr/python/tests/test_logistic_regression.py