-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use libucx wheels #1041
use libucx wheels #1041
Conversation
@@ -4,6 +4,9 @@ build: | |||
os: "ubuntu-22.04" | |||
tools: | |||
python: "mambaforge-22.9" | |||
jobs: | |||
pre_install: | |||
- bash ci/build_docs_pre_install.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no pip
-installable library libucx
... it needs to be either libucx-cu11
or libucx-cu12
.
Readthedocs builds are installing directly from the pyproject.toml
checked into source control, which
doesn't have those suffixes added.
Lines 8 to 11 in 03c864b
python: | |
install: | |
- method: pip | |
path: . |
Resulting in this:
INFO: pip is looking at multiple versions of ucx-py to determine which version is compatible with other requirements. This could take a while.
ERROR: Could not find a version that satisfies the requirement libucx<1.16,>=1.15.0 (from ucx-py) (from versions: none)
ERROR: No matching distribution found for libucx<1.16,>=1.15.0
readthedocs
doesn't allow customizing the pip install
with arbitrary flags, e.g. by adding --no-deps
(docs).
So I think this pre-install script to fix the suffixes is the best option to keep those builds working. For more details on how it works, see https://docs.readthedocs.io/en/stable/build-customization.html.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I hadn't considered trying to get this into the conda environment file that way!
But I don't think that will work. conda
converts the pip:
section in an env.yaml
into a requirements.txt
, and you can only supply arguments that'd be valid in a requirements file passed to pip
.
various forms of that that I tried (click me)
There are some issues tracking the request "forward custom flags to pip in conda environments" but none have resulted in changes to conda
:
- Passing options to pip installer conda/conda#10119
- Adding pip flags in environment.yml conda/conda#3763
- Which install flags should pip use? readthedocs/readthedocs.org#5545
That means it's not possible to include arguments like --no-deps
.
# env.yaml
name: delete-me
channels:
- conda-forge
dependencies:
- pip
- pip:
- --no-deps -e ./
conda env create --name delete-me --file ./env.yaml
yields the following
Installing pip dependencies: - Ran pip subprocess with arguments:
['/raid/jlamb/miniforge/envs/delete-me/bin/python', '-m', 'pip', 'install', '-U', '-r', '/raid/jlamb/repos/ucx-py/condaenv.p9aa8j_k.requirements.txt', '--exists-action=b']
Pip subprocess output:
Pip subprocess error:
Usage: __main__.py [options]
ERROR: Invalid requirement: --no-deps -e ./
__main__.py: error: no such option: --no-deps
And without that --no-deps
, conda
tries to install all the dependencies of the package, resulting in
Pip subprocess error:
error: subprocess-exited-with-error
× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> [4 lines of output]
Collecting cython>=3.0.0
Using cached Cython-3.0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
ERROR: Could not find a version that satisfies the requirement libucx==1.15.0 (from versions: none)
ERROR: No matching distribution found for libucx==1.15.0
[end of output]
The only other thing I can think of is setting environment variable PIP_NO_DEPS=true
in the build environment. I tried that using conda
's support for environment variables (docs), but I don't think those affect the build... just the environment's activation.
name: delete-me
channels:
- conda-forge
dependencies:
- pip
- pip:
- -e ./
variables:
- PIP_NO_DEPS: "true"
I see 2 options for this:
- just keep this script as-is
- have someone with admin rights on the
ucx-py
readthedocs project define environment variablePIP_NO_DEPS=true
in the project's settings (https://docs.readthedocs.io/en/stable/guides/environment-variables.html)
I think that environment variable option would be preferable, actually. This project has a small number of top-level dependencies, keeping the necessary ones in the conda environment file isn't a big lift. @vyasr what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a preference, but I can set PIP_NO_DEPS=true
if you two agree that's the preferred way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that seems a bit cleaner to me.
Did you try PIP_NO_DEPENDENCIES
? I don't remember which is the right name (or maybe both work), and also the behavior of true/false for pip's *_NO_*
variables can be confusing so it's worth double-checking all the options with the setting that you proposed in the env.yaml before we set this on RTD. If all else fails, though, I'm fine with setting on RTD.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhhhh I forgot that PIP_NO_DEPENDENCIES
does not stop pip install
from trying to download build dependencies.
Processing /home/docs/checkouts/readthedocs.org/user_builds/ucx-py/checkouts/1041
Installing build dependencies: started
Installing build dependencies: finished with status 'error'
error: subprocess-exited-with-error
× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> [4 lines of output]
Collecting cython>=3.0.0
Downloading Cython-3.0.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.2 kB)
ERROR: Could not find a version that satisfies the requirement libucx==1.15.0 (from versions: none)
ERROR: No matching distribution found for libucx==1.15.0
[end of output]
I just added PIP_NO_BUILD_ISOLATION="True"
to the environment as well... still didn't work 😞
I have one other idea, will post in a second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To summarize.... these things did not work:
- set env variable
PIP_NO_DEPS="True"
in readthedocs builds - set env variable
PIP_NO_BUILD_ISOLATION="True"
in readthedocs builds - set both of the above in readthedocs builds
- set any combination of those environment variables in
variables:
block inbuilddocs.yml
conda env - pass
pip: - --no-deps ../../
viabuilddocs.yml
conda env - use a real, suffixed dependency name like
libucx-cu11>=1.15.0
inpyproject.toml
+ add--extra-index-url=https://pypi.anaconda.org/rapidsai-wheels-nightly/simple
inbuilddocs.yml
conda env
That last one failing revealed a more significant issue, not limited to just docs building...
calling libucx.load_library()
unconditionally means ucx-py
will not be importable unless the process has access to a GPU 😱
RuntimeError: The CUDA driver library libcuda.so.1 was not found on your system. This library cannot be provided by the libucx wheel and must be installed separately.
The docs DO BUILD successfully on the current state of this branch 🎉 (build link), but only because I added behavior to work around not having a GPU available.
So @pentschev @vyasr that's a design decision for you.
When someone tries to import ucp
without access to a GPU, what should happen?
- the import should fail
- the import should warn but succeed
- something else
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summarizing our offline conversations:
- yes
ucx-py
is expected to be usable without a GPU - the
libucx
wheels should stop raising exceptions whenlibucx.load_library()
is called in an environment without a GPU
In response, we published a new libucx==1.15.0.post1
removing those exceptions: rapidsai/ucx-wheels#5
Tested it with ucxx
on systems with and without a GPU, and it seemed to work well: rapidsai/ucx-wheels#5 (comment)
Those new wheels helped the docs builds here get further, but those builds exposed another issue... as of the changes in this PR, ucx-py
was compiling with the system-installed headers (e.g. /usr/include/ucm
) but the wheel-provided shared libraries (e.g. site-packages/libucx/lib/libucm.so.0
).
That showed up in "symbol not found" issues like this:
OSError: ${HOME}/conda/1041/lib/python3.12/site-packages/libucx/lib/libucs_signal.so: undefined symbol: ucs_debug_is_error_signal
I just pushed a change that fixes that: eba110f
And looks like the docs builds are happy and docs are rendering correctly (RTD build link) 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch on the headers! I forgot to check on those. Glad you found this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update for those finding this thread from search or other links.... thanks to @bdice 's suggestion in #1044 (comment), we were able to get this working without messing with environment variables or the conda environment YAML file, like this in .readthedocs.yaml
:
post_create_environment:
- |
pip install \
--extra-index-url "https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/" \
-C rapidsai.disable-cuda=true \
-C rapidsai.matrix-entry="cuda=12.2" \
.
Proposes removing the build-time dependency on `tomli` for wheels and conda packages. Noticed that working on #1041. It doesn't appear to be used anywhere here. ```shell git grep tomli ``` ## Notes for Reviewers That dependency was added back in #895. I'm not sure why, but I suspect it was related to the use of `versioneer` in this project at the time. Reference: python-versioneer/python-versioneer#338 (comment) This project doesn't use `versioneer` any more (#931). I strongly suspect that the dependency on `tomli` can be removed. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - https://github.com/jakirkham - Ray Douglass (https://github.com/raydouglass) URL: #1042
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks fine to me, I'll anyway say @vyasr should also give this a pass as he's gonna definitely be able to comment on any details that are gonna pass unnoticed by me.
Contributes to rapidsai/build-planning#57. `libucx.load_library()` defined here tries to pre-load `libcuda.so` and `libnvidia-ml.so`, to raise an informative error (instead of a cryptic one from a linker) if someone attempts to use the libraries from this wheel on a system without a GPU. Some of the projects using these wheels, like `ucxx` and `ucx-py`, are expected to be usable on systems without a GPU. See rapidsai/ucx-py#1041 (comment). To avoid those libraries needing to try-catch these errors, this proposes the following: * removing those checks and deferring to downstream libraries to handle the non-GPU case * modifying the build logic so we can publish patched versions of these wheels like `v1.15.0.post1` ### Notes for Reviewers Proposing starting with `1.15.0.post1` right away, since that's the version that `ucx-py` will use. I'm proposing the following sequence of PRs here (assuming downstream testing goes well): 1. this one 2. another changing the version to `1.14.0.post1` 3. another changing the version to `1.16.0.post1`
ci/build_docs_pre_install.sh
Outdated
|
||
set -euo pipefail | ||
|
||
sed -r -i "s/libucx==(.*)\"/libucx-cu12==\1\"/g" ./pyproject.toml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm proposing bringing back this script for docs builds, but having it substitute in a real, suffixed name for libucx-cu12
so that docs builds will really install that.
I think that's necessary to meet these 3 constraints:
- docs builds run
pip install .
to installucx-py
- fallback matrices in
dependencies.yaml
need to be the unsuffixed names (e.g.libucx
notlibucx-cu12
)- per the offline discussion that came out of remove 'disable-cuda-suffix' config setting rapids-build-backend#29 (comment)
pre-commit
hook runningrapids-dependency-file-generator
is going to put intopyproject.toml
whatever is in the fallback matrices independencies.yaml
Follow-up to #5. Proposes publishing a `1.14.1.post1` version, identical to version `1.14.1` except that `load_library()` will no longer raise exceptions in non-GPU environments. ## Notes for Reviewers Just putting this up to get in the CI run. Should probably wait to merge it until testing on rapidsai/ucx-py#1041 is done.
/merge |
Contributes to rapidsai/build-planning#57. Follow-up to #226. Proposes the following changes for wheel builds: * removing system-installed UCX *headers* * making the code to remove system-installed UCX libraries a bit more specific - *(to minimize the risk of accidentally deleting some non-UCX thing who name matches the pattern `libuc*`)* ## Notes for Reviewers Before applying similar changes to `ucx-py`, I noticed it being compiled with the system-installed headers but then linking against the libraries provided by the `libucx` wheels: rapidsai/ucx-py#1041 (comment) This change should reduce the risk of that happening. ### How I tested this Poked around the filesystem that `build_wheel.sh` runs in by pulling one of our standard wheel-building container images used in CI. ```shell docker run \ --rm \ -v $(pwd):/opt/work \ -w /opt/work \ -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.10 \ bash find /usr -type f -name 'libucm*' # /usr/lib64/libucm.la # /usr/lib64/libucm.a # /usr/lib64/libucm.so.0.0.0 # /usr/lib64/ucx/libucm_cuda.a # /usr/lib64/ucx/libucm_cuda.la # /usr/lib64/ucx/libucm_cuda.so.0.0.0 find /usr -type d -name 'uct' # /usr/include/uct ``` Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Ray Douglass (https://github.com/raydouglass) URL: #230
With rapidsai/ucx-py#1041 merged, UCX wheels are now fixed and thus reenabling raft-dask wheel tests that require UCX-Py should be safe. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - James Lamb (https://github.com/jameslamb) - Jake Awe (https://github.com/AyodeAwe) URL: #2307
Contributes to rapidsai/build-planning#31 Contributes to rapidsai/dependency-file-generator#89 Contributes to rapidsai/build-planning#72 Proposes: * introducing `rapids-build-backend` as this project's build backend, to reduce the complexity of various CI/build scripts * making get-data-from-`pyproject.toml` code in conda recipe a bit stricter ## Notes for Reviewers This reverts some of the workarounds introduced for docs builds in #1041. See this thread for context: #1041 (comment) ~To do that, I set 2 environment variables manually in in the readthedocs console ([link](https://readthedocs.org/dashboard/ucx-py/edit/)):~ * ~`RAPIDS_DISABLE_CUDA=true` = used to prevent needing `nvcc` in the docs-building environment~ * ~`RAPIDS_MATRIX_ENTRY='cuda=12.2'` = used to ensure that `pip install .` tries to install a real package (e.g. `libucx-cu12`)~ ~I think this is the first place we're using the `rapids-build-backend` environment variable support.~ see #1044 (comment) Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Bradley Dice (https://github.com/bdice) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1044
For rapidsai/build-planning#57, #1041 switched `ucx-py` over to `libucx` wheels. To test that that was working, it added some code to building scripts to remove system installations of UCX libraries. That should no longer be necessary as of rapidsai/ci-imgs#154. This proposes removing that code for managing system dependencies of UCX libraries, to simplify those build scripts a bit. Authors: - James Lamb (https://github.com/jameslamb) Approvers: - Kyle Edwards (https://github.com/KyleFromNVIDIA) - Peter Andreas Entschev (https://github.com/pentschev) URL: #1053
Contributes to rapidsai/build-planning#57.
Similar to rapidsai/ucxx#226, proposes using the new UCX wheels from https://github.com/rapidsai/ucx-wheels, instead of vendoring system versions of
libuc{m,p,s,t}.so
.Benefits of these changes
Allows users of
ucx-py
to avoid needing system installations of the UCX libraries.Shrinks the
ucx-py
wheels by 6.7MB compressed (77%) and 19.1 MB uncompressed (73%).how I calculated that (click me)
Mounting in a directory with a wheel built from this branch...
Compared to a recent nightly release.
Notes for Reviewers
Left some comments on the diff describing specific design choices.
The libraries from the
libucx
wheel are only used if a system installation isn't availableBuilt a wheel in a container using the same image used here in CI.
Found that the libraries from the
libucx
wheel are correctly found at build time, and are later found at import time.using 'rapidsai/citestwheel' image and LD_DEBUG (click me)
rapidsai/citestwheel
does NOT the UCX libraries installed at/usr/lib*
.Installed the
ucx-py
wheel.In that output, saw that
ld
was findinglibucs.so
first.It searched all the system paths before finally finding it in the
libucx
wheel.Then the others were found via the RPATH entries on
libucs.so
.libucm.so.0
:However, the libraries from the
libucx
wheel appear to be the last placeld
searches. That means that if you use these wheels on a system with a system installation oflibuc{m,p,s,t}
, that system installation's libraries will be loaded instead.using 'rapidsai/ci-wheel' image and LD_DEBUG (click me)
docker run \ --rm \ --gpus 1 \ -v $(pwd)/final_dist:/opt/work \ -w /opt/work \ -it rapidsai/ci-wheel:cuda12.2.2-rockylinux8-py3.10 \ bash
rapidsai/ci-wheel
has the UCX libraries installed at/usr/lib64
.Installed a wheel and tried to import from it.
In that situation, I saw the system libraries found before the one from the wheel.
In this case, when the system libraries are available,
site-packages/libucx/lib
isn't even searched.To avoid any RAPIDS-specific stuff tricking me, I tried in a generic
python:3.10
image. Found that the library could be loaded and all thelibuc{m,p,s,t}
libraries from thelibucx
wheel are found 🎉 .using 'python:3.10' wheel (click me)
💥