Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not reset CUDA context after UCX tests #8201

Merged
merged 4 commits into from
Sep 25, 2023

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Sep 21, 2023

Resetting CUDA contexts during a running process may have unintended consequences on third-party libraries -- e.g., CuPy -- that store state based on the context. Therefore, prevent destroying CUDA context for now.

Additionally fix cuDF failure due to a FutureWarning.

Closes #8194

  • Tests added / passed
  • Passes pre-commit run --all-files

Resetting CUDA contexts during a running process may have unintended
consequences on third-party libraries -- e.g., CuPy -- that store state
based on the context. Therefore, prevent destroying CUDA context for
now.
@pentschev pentschev requested a review from fjetter as a code owner September 21, 2023 11:39
@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

It seems gpuCI is failing to resolve github.com, I've raised an internal issue to report that.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 21, 2023

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

       21 files  ±  0         21 suites  ±0   10h 20m 37s ⏱️ - 9m 0s
  3 814 tests +  1    3 702 ✔️  -   1     107 💤 ±0  5 +2 
36 865 runs  +22  35 061 ✔️ +26  1 799 💤  - 6  5 +2 

For more details on these failures, see this check.

Results for commit 119dc14. ± Comparison against base commit 2858930.

♻️ This comment has been updated with latest results.

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

It seems gpuCI is failing to resolve github.com, I've raised an internal issue to report that.

This is now resolved.

@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

rerun tests

1 similar comment
@pentschev
Copy link
Member Author

rerun tests

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pentschev! So looks like gpuCI completed successfully three times in a row, is that right?

@@ -40,6 +40,8 @@ gpuci_logger "Activate conda env"
. /opt/conda/etc/profile.d/conda.sh
conda activate dask

mamba install -y 'aws-sdk-cpp<1.11'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this due to some upstream issue? The actual change here looks totally fine, but it might be worth a comment with extra context

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, this is due to aws/aws-sdk-cpp#2681, but the change here was just to test/confirm this. For now we will likely resolve the issue via rapidsai/cudf#14173 and revert this change here.

Sorry for not commenting here earlier, I was trying to get this in before bringing people in to review the changes here. 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I just noticed that rapidsai/cudf#14173 is merged -- does that mean we no longer need this here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does now, we first had to fix some issues (mark old cuDF packages as broken and build new gpuCI docker images). I reverted the aws-sdk-cpp changes now, we just need to wait and see if everything passes and we should be good to merge this. 🙂

@pentschev
Copy link
Member Author

Thanks @pentschev! So looks like gpuCI completed successfully three times in a row, is that right?

Yes, that's right. I was rerunning multiple times to confirm. 😄

Copy link
Member

@jrbourbeau jrbourbeau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing @pentschev! I left one non-blocking comment about possibly removing the aws-sdk-cpp pin, but it's not critical

@@ -40,6 +40,8 @@ gpuci_logger "Activate conda env"
. /opt/conda/etc/profile.d/conda.sh
conda activate dask

mamba install -y 'aws-sdk-cpp<1.11'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. I just noticed that rapidsai/cudf#14173 is merged -- does that mean we no longer need this here?

@pentschev
Copy link
Member Author

rerun tests

1 similar comment
@pentschev
Copy link
Member Author

rerun tests

@pentschev
Copy link
Member Author

Alright, gpuCI tests now passed 3 times in a row. I think this should be good to go @jrbourbeau @charlesbluca @quasiben @wence- .

Given we're not changing anything that's running on non-gpuCI tests I don't think any of them are related to this PR. In case anybody sees some correlation I do not, failing tests are summarized below:

FAILED tests/test_explicit_comms.py::test_dataframe_shuffle[ucx-cudf-3] - AssertionError: assert not 1
 +  where 1 = <SpawnProcess name='SpawnProcess-15' pid=8047 parent=808 stopped exitcode=1>.exitcode
FAILED distributed/shuffle/tests/test_shuffle.py::test_closed_worker_during_unpack - AssertionError: assert 'shuffle' not in 'shuffle-711...5ab91499-213'
  'shuffle' is contained here:
    shuffle-711a259b679cd132c6eb25b25ab91499-213
  ? +++++++
FAILED distributed/shuffle/tests/test_shuffle.py::test_crashed_worker_after_shuffle - UserWarning: Timed out waiting for sampling thread.
FAILED distributed/tests/test_variable.py::test_race - TimeoutError
FAILED distributed/tests/test_client.py::test_forward_logging - AssertionError: {<Nanny: None, threads: 2>: <Status.closed: 'closed'>, <Nanny: tcp://127.0.0.1:57566, threads: 1>: <Status.closing: 'closing'>}
assert False

@jrbourbeau jrbourbeau merged commit e1387a0 into dask:main Sep 25, 2023
21 of 26 checks passed
@pentschev
Copy link
Member Author

Thanks @jrbourbeau for reporting/merging and @charlesbluca for the help in debugging and creating new gpuCI images for this. 🙂

@pentschev pentschev deleted the ucx-test-no-cuda-context-reset branch September 26, 2023 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gpu CI failing pretty consistently with segfault
3 participants