-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not reset CUDA context after UCX tests #8201
Do not reset CUDA context after UCX tests #8201
Conversation
Resetting CUDA contexts during a running process may have unintended consequences on third-party libraries -- e.g., CuPy -- that store state based on the context. Therefore, prevent destroying CUDA context for now.
rerun tests |
It seems gpuCI is failing to resolve github.com, I've raised an internal issue to report that. |
Unit Test ResultsSee test report for an extended history of previous test failures. This is useful for diagnosing flaky tests. 21 files ± 0 21 suites ±0 10h 20m 37s ⏱️ - 9m 0s For more details on these failures, see this check. Results for commit 119dc14. ± Comparison against base commit 2858930. ♻️ This comment has been updated with latest results. |
rerun tests |
This is now resolved. |
rerun tests |
rerun tests |
1 similar comment
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pentschev! So looks like gpuCI completed successfully three times in a row, is that right?
@@ -40,6 +40,8 @@ gpuci_logger "Activate conda env" | |||
. /opt/conda/etc/profile.d/conda.sh | |||
conda activate dask | |||
|
|||
mamba install -y 'aws-sdk-cpp<1.11' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this due to some upstream issue? The actual change here looks totally fine, but it might be worth a comment with extra context
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, this is due to aws/aws-sdk-cpp#2681, but the change here was just to test/confirm this. For now we will likely resolve the issue via rapidsai/cudf#14173 and revert this change here.
Sorry for not commenting here earlier, I was trying to get this in before bringing people in to review the changes here. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I just noticed that rapidsai/cudf#14173 is merged -- does that mean we no longer need this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does now, we first had to fix some issues (mark old cuDF packages as broken and build new gpuCI docker images). I reverted the aws-sdk-cpp
changes now, we just need to wait and see if everything passes and we should be good to merge this. 🙂
Yes, that's right. I was rerunning multiple times to confirm. 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing @pentschev! I left one non-blocking comment about possibly removing the aws-sdk-cpp
pin, but it's not critical
@@ -40,6 +40,8 @@ gpuci_logger "Activate conda env" | |||
. /opt/conda/etc/profile.d/conda.sh | |||
conda activate dask | |||
|
|||
mamba install -y 'aws-sdk-cpp<1.11' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. I just noticed that rapidsai/cudf#14173 is merged -- does that mean we no longer need this here?
This reverts commit 0f1c4da.
rerun tests |
1 similar comment
rerun tests |
Alright, gpuCI tests now passed 3 times in a row. I think this should be good to go @jrbourbeau @charlesbluca @quasiben @wence- . Given we're not changing anything that's running on non-gpuCI tests I don't think any of them are related to this PR. In case anybody sees some correlation I do not, failing tests are summarized below:
|
Thanks @jrbourbeau for reporting/merging and @charlesbluca for the help in debugging and creating new gpuCI images for this. 🙂 |
Resetting CUDA contexts during a running process may have unintended consequences on third-party libraries -- e.g., CuPy -- that store state based on the context. Therefore, prevent destroying CUDA context for now.
Additionally fix cuDF failure due to a
FutureWarning
.Closes #8194
pre-commit run --all-files