Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support CUDA 12.2 #161

Merged
merged 15 commits into from
Feb 10, 2024
Merged

Support CUDA 12.2 #161

merged 15 commits into from
Feb 10, 2024

Conversation

jameslamb
Copy link
Member

Description

  • switches to CUDA 12.2.2 for building conda packages and wheels
  • adds new tests running against CUDA 12.2.2

Notes for Reviewers

This is part of ongoing work to build and test packages against CUDA 12.2.2 across all of RAPIDS.

For more details see:

Planning a second round of PRs to revert these references back to a proper branch-24.{nn} release branch of shared-workflows once rapidsai/shared-workflows#166 is merged.

(created with rapids-reviser)

@jameslamb jameslamb changed the title add CUDA 12.2 support for conda packages and wheels WIP: use CUDA 12.2 for building and testing wheels Jan 11, 2024
@jameslamb jameslamb changed the title WIP: use CUDA 12.2 for building and testing wheels WIP: add CUDA 12.2 support for conda packages and wheels Jan 11, 2024
@jameslamb jameslamb changed the title WIP: add CUDA 12.2 support for conda packages and wheels WIP: (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels Jan 11, 2024
@jameslamb
Copy link
Member Author

conda builds and tests on CUDA 12.0.1 and 12.2.2 are segfaulting (build link).

I don't see similar errors on recent builds on branch-0.36 or other PRs.

@jameslamb jameslamb changed the title WIP: (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels Jan 12, 2024
@jameslamb jameslamb marked this pull request as ready for review January 12, 2024 21:22
@jameslamb jameslamb requested a review from a team as a code owner January 12, 2024 21:22
@pentschev
Copy link
Member

conda builds and tests on CUDA 12.0.1 and 12.2.2 are segfaulting (build link).

I don't see similar errors on recent builds on branch-0.36 or other PRs.

Those are indeed flaky tests, they don't always happen in same form. Let's first try rerunning them, based on the log for 12.2 almost all tests passed, the one that failed is one of the very last ones and likely to pass soon.

@jameslamb
Copy link
Member Author

Ok thank you!

@jakirkham
Copy link
Member

Thanks James & Peter! 🙏

Looks like that cleared things up and builds now pass

Though am going to mark this "do not merge" for the moment while we work through issues in the other PRs

@jakirkham jakirkham added the DO NOT MERGE Hold off on merging; see PR for details label Jan 13, 2024
@jameslamb jameslamb requested review from a team as code owners January 22, 2024 18:42
@jameslamb jameslamb changed the base branch from branch-0.36 to branch-0.37 January 22, 2024 18:42
@jameslamb jameslamb requested a review from a team as a code owner January 22, 2024 18:42
jameslamb and others added 7 commits January 22, 2024 12:45
* Move conda-only dependencies out of `pyproject` and `requirements` sections in `dependencies.yaml`
* Add `rmm`, `cudf`, and `cupy` matrices

Authors:
  - Paul Taylor (https://github.com/trxcllnt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#169
The C++ code was documented for some time but Doxygen build process was not included. This change now introduces Doxygen builds and fixes all documentation warnings.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: rapidsai#164
@jakirkham
Copy link
Member

Looks like there is a conflict here

It is a generated file. So can just regenerate it with rapids-dependency-file-generator

@jameslamb
Copy link
Member Author

Looks like there is a conflict here

fixed in 3dbdb61

rapids-bot bot pushed a commit that referenced this pull request Jan 25, 2024
…endencies.yaml (#174)

Contributes to rapidsai/build-planning#13.

Updates `update-version.sh` to correctly handle RAPIDS dependencies like `cudf-cu12==24.2.*`.

This also pulls in some dependency refactoring originally added in #161, which allows greater use of dependencies.yaml globs (and therefore less maintenance effort to support new CUDA versions).

### How I tested this

The portability of this updated `sed` command was tested here: rapidsai/cudf#14825 (comment).

In this repo, I ran the following:

```shell
./ci/release/update-version.sh '0.36.00'
git diff

./ci/release/update-version.sh '0.37.00
git diff
```

Confirmed that that first `git diff` changed all the things I expected, and that second one showed 0 changes.

Authors:
  - James Lamb (https://github.com/jameslamb)
  - Bradley Dice (https://github.com/bdice)
  - https://github.com/jakirkham

Approvers:
  - Jake Awe (https://github.com/AyodeAwe)
  - https://github.com/jakirkham

URL: #174
@jameslamb
Copy link
Member Author

I see a new test failure on the v11.8.0 conda tests.

There are quite a few stacktraces and things that look like errors in the logs, but only a single unit test case failure.

I think these snippets summarize it well:

2024-01-25 20:38:01,988 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/tmp/distributed/distributed/core.py", line 1564, in _connect
    comm = await connect(
  File "/tmp/distributed/distributed/comm/core.py", line 342, in connect
    comm = await wait_for(
  File "/tmp/distributed/distributed/utils.py", line 1940, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "/opt/conda/envs/test/lib/python3.10/asyncio/tasks.py", line 432, in wait_for
    await waiter
asyncio.exceptions.CancelledError
...
distributed.comm.core.CommClosedError: ConnectionPool closing.
asyncio.exceptions.InvalidStateError: invalid state
asyncio.exceptions.InvalidStateError: invalid state
...
_____________________ ERROR at teardown of test_transpose ______________________

    @pytest.fixture(scope="function")
    def ucxx_loop():
...
        with check_thread_leak():
            yield loop
>           ucxx.reset()
...
>               raise UCXError(msg)
E               ucxx.UCXXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationContext: 
E                 <frame at 0x7fdfe400e4c0, file '/opt/conda/envs/test/lib/python3.10/site-packages/ucxx/_lib_async/listener.py', line 155, code _listener_handler_coroutine>
E                 <frame at 0x55ddca529910, file '/opt/conda/envs/test/lib/python3.10/site-packages/ucxx/_lib_async/listener.py', line 155, code _listener_handler_coroutine>
...
/opt/conda/envs/test/lib/python3.10/site-packages/ucxx/core.py:111: UCXXError
ERROR python/distributed-ucxx/distributed_ucxx/tests/test_ucxx.py::test_transpose - ucxx.UCXXError: Trying to reset UCX but not all Endpoints and/or Listeners are closed(). The following objects are still referencing ApplicationContext: 
  <frame at 0x7fdfe400e4c0, file '/opt/conda/envs/test/lib/python3.10/site-packages/ucxx/_lib_async/listener.py', line 155, code _listener_handler_coroutine>
  <frame at 0x55ddca529910, file '/opt/conda/envs/test/lib/python3.10/site-packages/ucxx/_lib_async/listener.py', line 155, code _listener_handler_coroutine>

(build link)

NOTE: as of #174, rapids-dask-dependency is getting pulled in, so the tests are getting the latest main branch of dask and distributed, not releases. Maybe this error lies in some change between the latest distributed / dask releases and current main branch of those projects.

@jakirkham
Copy link
Member

Vyas restarted CI as we believe this is an unrelated flaky test

@jameslamb jameslamb changed the title (DO NOT MERGE) add CUDA 12.2 support for conda packages and wheels (DO NOT MERGE) add CUDA 12.2 support for conda packages Jan 25, 2024
@jameslamb jameslamb changed the title (DO NOT MERGE) add CUDA 12.2 support for conda packages Support CUDA 12.2 Jan 25, 2024
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing it's because this PR was started before #145, but it needs updating for the wheel jobs. The conda outputs look sensible.

I was going to ask some questions about the necessity of the CUDA dependency in ucx-py because it's pure Python, but I realize those are less relevant here since there is an actual C++ dependency in ucxx since it is not a pure Python package and we do need to compile against CUDA.

@bdice bdice requested review from vyasr and AyodeAwe February 9, 2024 22:31
@jakirkham
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit a027d1a into rapidsai:branch-0.37 Feb 10, 2024
47 checks passed
rapids-bot bot pushed a commit that referenced this pull request Feb 20, 2024
Follow-up to #161

For all GitHub Actions configs, replaces uses of the `test-cuda-12.2` branch on `shared-workflows`
with `branch-24.04`, now that rapidsai/shared-workflows#166 has been merged.

### Notes for Reviewers

This is part of ongoing work to build and test packages against CUDA 12.2 across all of RAPIDS.

For more details see:

* rapidsai/build-planning#7

*(created with `rapids-reviser`)*

Authors:
  - James Lamb (https://github.com/jameslamb)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Ray Douglass (https://github.com/raydouglass)

URL: #191
@jakirkham jakirkham mentioned this pull request Feb 20, 2024
rapids-bot bot pushed a commit that referenced this pull request Feb 21, 2024
`ucxx`'s CUDA compiler dependency added a `cuda-version` constraint at runtime that meant packages could only be installed with the same CUDA version `ucxx` was built with or newer. As a result CUDA 12.2 builds of `ucxx` required that CUDA 12.2+ would be used at runtime

However as we use [CUDA Compatibility]( https://docs.nvidia.com/deploy/cuda-compatibility/index.html ) in RAPIDS, we know that even if we built with CUDA 12.2, we can still use packages for other CUDA 12.x

This was largely handled for other dependencies as part of PR ( #161 ). However this wasn't handled for `ucxx`, which was likely in part as it was handling the CUDA compiler dependency differently from the other packages here. More history about `ucxx`'s CUDA compiler dependency in PR: #108

This change aligns how CUDA compiler is handled across packages to make this more consistent. Also it ignores the CUDA compiler constraints added at runtime. In all cases the packages handle this themselves by requiring `cuda-version` (properly constrained) and when CUDA 11 is concerned they add `cudatoolkit`

Thus this change should fix CI issues that were seen due to this overly constrained `cuda-version` by relaxing that constraint

Authors:
  - https://github.com/jakirkham

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Ray Douglass (https://github.com/raydouglass)

URL: #195
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO NOT MERGE Hold off on merging; see PR for details
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants