Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix worker streams in OLS-eig executing in an unsafe order #4539

Merged
merged 6 commits into from
Feb 10, 2022

Conversation

achirkin
Copy link
Contributor

The latest version of the "eig" OLS solver has a bug producing garbage results under some conditions. When at least one worker stream is used to run some operations concurrently, for sufficiently large workset sizes, the memory allocation in the main stream may finish later than the worker stream starts to use it.

This PR adds more ordering between the main and the worker streams, fixing this and some other theoretically possible edge cases.

@achirkin achirkin requested a review from a team as a code owner January 31, 2022 12:42
@achirkin achirkin added 3 - Ready for Review Ready for review by team bug Something isn't working non-breaking Non-breaking change and removed CUDA/C++ labels Jan 31, 2022
cjnolet
cjnolet previously approved these changes Jan 31, 2022
@@ -228,11 +252,15 @@ class OlsTest : public ::testing::TestWithParam<OlsInputs<T>> {
T intercept, intercept2, intercept3;
};

const std::vector<OlsInputs<float>> inputsf2 = {
{0.001f, 4, 2, 2, 0}, {0.001f, 4, 2, 2, 1}, {0.001f, 4, 2, 2, 2}};
const std::vector<OlsInputs<float>> inputsf2 = {{hconf::NON_BLOCKING_ONE, 0.001f, 4, 2, 2, 0},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the changes in this PR, are these assertions able to reliably reproduce the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this bug seems to be very elusive. I managed to reproduce it only under some specific conditions in python, but then it disappeared again after I did some further changes to optimize preProcessData (perhaps, due to changing the pattern of calls using the main stream / rmm resources).
Yet I hope the changes in these tests will help to find other streams-related bugs if there are any more.

@cjnolet cjnolet dismissed their stale review January 31, 2022 19:01

I meant to comment, not approve.

Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @achirkin for fixing this problem! It looks good, I just have a few smaller comment.

cpp/src_prims/linalg/lstsq.cuh Outdated Show resolved Hide resolved
cpp/src_prims/linalg/lstsq.cuh Outdated Show resolved Hide resolved
@dantegd dantegd added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Feb 3, 2022
@achirkin achirkin added 3 - Ready for Review Ready for review by team and removed 4 - Waiting on Author Waiting for author to respond to review labels Feb 3, 2022
@achirkin achirkin requested a review from tfeher February 3, 2022 09:01
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the update, the PR looks good to me.

@tfeher
Copy link
Contributor

tfeher commented Feb 3, 2022

If I understand correctly, this bug affects LinearRegression models with the default solver algorithm (eig). A workarround in existing releases would be to pass a Handle without stream pool:

LinearRegression(handle=Handle(), ...)

or use a different algorithm:

LinearRegression(algorithm='qr', ...)

@achirkin would the first option (passing handle) be the preferred workaround?

@achirkin
Copy link
Contributor Author

achirkin commented Feb 3, 2022

@achirkin would the first option (passing handle) be the preferred workaround?

Yes, I think it would make less impact on performance. Thanks for the suggestion!

Also I should note that I could only reproduce the problem when both fit_intercept and normalize are True. By default, normalize is disabled in this model. Though this does not guarantee the bug cannot pop up without normalize if the stars are properly aligned :)

@codecov-commenter
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.04@9921c61). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.04    #4539   +/-   ##
===============================================
  Coverage                ?   85.73%           
===============================================
  Files                   ?      239           
  Lines                   ?    19585           
  Branches                ?        0           
===============================================
  Hits                    ?    16791           
  Misses                  ?     2794           
  Partials                ?        0           
Flag Coverage Δ
dask 46.18% <0.00%> (?)
non-dask 78.73% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9921c61...5ee810b. Read the comment docs.

@cjnolet
Copy link
Member

cjnolet commented Feb 10, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit df6d7be into rapidsai:branch-22.04 Feb 10, 2022
vimarsh6739 pushed a commit to vimarsh6739/cuml that referenced this pull request Oct 9, 2023
…4539)

The latest version of the "eig" OLS solver has a bug producing garbage results under some conditions. When at least one worker stream is used to run some operations concurrently, for sufficiently large workset sizes, the memory allocation in the main stream may finish later than the worker stream starts to use it.

This PR adds more ordering between the main and the worker streams, fixing this and some other theoretically possible edge cases.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)

URL: rapidsai#4539
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working CUDA/C++ non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants