-
Notifications
You must be signed in to change notification settings - Fork 540
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support standardization for sparse vectors in logistic regression MG #5806
Support standardization for sparse vectors in logistic regression MG #5806
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found some very minor things in a first review, but the code already looks good!
cpp/src/glm/qn_mg.cu
Outdated
Standardizer<T>* stder = NULL; | ||
|
||
if (standardization) | ||
stder = new Standardizer(handle, X_simple, n_samples, mean_std_buff, vec_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stder
reads too much like std err
, maybe we coud rename it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. revised to std_obj.
assert array_equal(lron_coef_origin, sg.coef_, tolerance) | ||
assert array_equal(lron_intercept_origin, sg.intercept_, tolerance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, using unit_tol and total_tol will lead to less flakiness in the tests
unit_tol=tolerance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
a7a6793
to
527cbc1
Compare
@dantegd Thank you for the review! I have pushed the revised code. Seems CI fails with a "unrecognized arguments: --force" error associated with memba. Is it expected? The gemmb was tested on large dataset by multiplying a ones vector (1 x num_rows) with the sparse matrix (num_rows x 2), where every row is [1. 0.5]. When num_rows is 20 million, the gemmb returns [16,777,200 8,388,610], not the expected [20,000,000 10,000,000]. Therefore, this PR uses a chunk-based calculation to split the sparse matrix by rows, then aggregates over chunks. This can minimize the precision loss and return the expected results, already tested from 20 million 130 million. Let me know if the revised code looks ok or if there is any risk. |
527cbc1
to
c20610e
Compare
…are term, not tested yet
…ing one to a large number
c20610e
to
a44edb9
Compare
…crease the stability of the tests
No description provided.