Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable cucim and xgboost in CUDA 12 rapids builds. #669

Merged
merged 17 commits into from
Jul 27, 2023

Conversation

bdice
Copy link
Contributor

@bdice bdice commented Jul 25, 2023

PR #664 temporarily disabled CUDA 12 packages for cucim and xgboost in rapids. This re-enables those.

This reverts commit cc272c4.

This can be merged once the following issues are closed:

@bdice bdice marked this pull request as ready for review July 26, 2023 13:43
@bdice bdice requested a review from a team as a code owner July 26, 2023 13:43
@bdice
Copy link
Contributor Author

bdice commented Jul 26, 2023

I also enabled testing of conda packages. I think that even if we can't import GPU libraries, we should be able to ensure that the packages contained in rapids or rapids-xgboost are solvable. I'm going to cut down the tests (by commenting out any problems) until they pass. This will help keep us from merging broken metapackages in the future.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bradley! 🙏

Made a few minor comments below

conda/recipes/rapids-xgboost/meta.yaml Outdated Show resolved Hide resolved
conda/recipes/rapids-xgboost/meta.yaml Outdated Show resolved Hide resolved
conda/recipes/versions.yaml Show resolved Hide resolved
@bdice
Copy link
Contributor Author

bdice commented Jul 26, 2023

We're going to have to disable tests for rapids since it depends on rapids-xgboost which is built separately. I would like to fix this in a follow-up by combining the two recipes into a single recipe with two outputs. For now, testing on rapids-xgboost found a bug that I fixed in d1aab9c. I am currently testing locally with rapids-xgboost removed from rapids (so that it can solve) but I am facing difficulty with solving custreamz. I thought this would be fixed by rapidsai/cudf#13754 but mamba doesn't seem to be finding the new packages...

@jakirkham
Copy link
Member

Updating this thread, sounds like we need PR ( rapidsai/cudf#13769 ) to fix a couple more build/strings

@jakirkham
Copy link
Member

That PR has been merged and packages uploaded

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Bradley! 🙏

Think we may need --use-local to pick up locally built packages when testing

ci/build_python.sh Show resolved Hide resolved
ci/build_python.sh Outdated Show resolved Hide resolved
ci/build_python.sh Outdated Show resolved Hide resolved
Co-authored-by: jakirkham <[email protected]>
@jakirkham
Copy link
Member

Seeing this on CI:

The reported errors are:
- Encountered problems while solving:
-   - nothing provides __cuda needed by xgboost-1.7.4-rapidsai_cuda112py38hf635370_1

Think we need to set the environment variable CONDA_OVERRIDE_CUDA to some value to "convince" Conda it is ok to install

@jakirkham
Copy link
Member

jakirkham commented Jul 26, 2023

Should we add this to both of these (as we've done with other recipes like cuDF)?

test:
  requires:
    - cuda-version ={{ cuda_version }}

Added suggestions below:

ci/build_python.sh Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor Author

bdice commented Jul 26, 2023

We’ll need to disable the import tests because this runs on CPU — but just solving the environment with no imports is better than no testing.

@vyasr
Copy link
Contributor

vyasr commented Jul 26, 2023

Looks like we can't actually run the tests on the runner we're using because it doesn't have a working CUDA installation, but at least testing that the environment solves is helpful.

@vyasr
Copy link
Contributor

vyasr commented Jul 26, 2023

We’ll need to disable the import tests because this runs on CPU — but just solving the environment with no imports is better than no testing.

Jinx

@@ -6,20 +6,21 @@ set -euo pipefail
source rapids-env-update

CONDA_CONFIG_FILE="conda/recipes/versions.yaml"
export CONDA_OVERRIDE_CUDA="${RAPIDS_CUDA_VERSION}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably consider the implications here. Today, I think rapids is installable even on CPU-only machines. The new rapids-xgboost package design requires __cuda to install. This is important to support for cases like HPC systems with CPU login nodes and GPU worker nodes that use the same environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they can also install by setting CONDA_OVERRIDE_CUDA to some value

In any event, this is coming from libxgboost. So we could move this just to the rapids-xgboost if we prefer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new rapids-xgboost package design

What is the relevant new change? Is it in libxgboost or in something about how rapids-xgboost is packaged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Well, maybe this requirement already existed. I'm not sure.

The question is whether __cuda should be a hard requirement for installation, which is coming from xgboost-related packages. I'm not sure if it was that way for the old xgboost packages we shipped in 23.06 or not. Regardless, it feels funny that no other RAPIDS package has this requirement besides xgboost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropping this in PR ( #673 ), which pulls in the new xgboost packages

Comment on lines -73 to -88
imports: # [linux64]
- cucim # [linux64]
- cudf # [linux64]
- cudf_kafka # [linux64]
- cugraph # [linux64]
- cuml # [linux64]
{% if cuda_major == "11" %}
- cusignal # [linux64]
{% endif %}
- cuspatial # [linux64]
- custreamz # [linux64]
- cuxfilter # [linux64]
- dask_cuda # [linux64]
- dask_cudf # [linux64]
- pylibcugraph # [linux64]
- rmm # [linux64]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could run these through pkgutil.find_loader in a run_test.py script in the recipe

This would let us test for their existence without needing to import them (and thus not need a GPU to test)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary for this PR, maybe file an issue or PR with this proposal later on. I feel comfortable with the current level of testing, which is higher than what we had before.

Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks all! 🙏

Had a few comments above, but none of them are blocking

@bdice
Copy link
Contributor Author

bdice commented Jul 26, 2023

Potentially blocking issue with xgboost builds requiring __cuda:

Previous release installation did not require __cuda to install:

$ CONDA_OVERRIDE_CUDA="" mamba create -n rapids-23.06-cpu -c rapidsai -c conda-forge -c nvidia rapids=23.06 python=3.10 cudatoolkit=11.8

The above command succeeds.

The proposed changes here would require __cuda to install. This was previously not a constraint for users installing rapids and poses a challenge for users of (for example) HPC systems where login nodes do not have GPUs, and only worker nodes have GPUs.

$ CONDA_OVERRIDE_CUDA="" mamba create -n rapids-23.08-cpu -c rapidsai-nightly -c conda-forge -c nvidia rapids=23.08 python=3.10 cuda-version=12.0 'xgboost=1.7.4*=rapidsai_cuda*'

...

The following package could not be installed
└─ xgboost 1.7.4* rapidsai_cuda* is uninstallable because it requires
   └─ __cuda  , which is missing on the system.

The above command fails on CPU-only systems. This is testing the currently nightly rapids which doesn't include xgboost, but with xgboost manually added as the recipe does in this PR.

I think this issue is blocking for the 23.08 release, but not necessarily for this PR. I'd be fine with merging this PR and discussing this issue separately. I can revisit this tomorrow with others.

@jakirkham
Copy link
Member

After discussion offline, we concluded the XGBoost issue is non-blocking for this PR. So it should be good to merge

We are evaluating options to fix XGBoost packages to not require __cuda with stakeholders. Once a fix is deployed we can come back and remove the CONDA_OVERRIDE_CUDA line to test the fix and simplify things here

--variant-config-files "${CONDA_CONFIG_FILE}" \
conda/recipes/rapids-xgboost

rapids-logger "Build rapids"

rapids-mamba-retry mambabuild \
--no-test \
--use-local \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just want to mention that for other repos, we use --channel "${RAPIDS_CONDA_BLD_OUTPUT_DIR}": https://github.com/rapidsai/cudf/blob/abb59c83128f956c7edcb4d7744cb0faecf0026c/ci/build_python.sh#L18-L39

RAPIDS_CONDA_BLD_OUTPUT_DIR is set in our CI images: https://github.com/rapidsai/ci-imgs/blob/75cad918c44c6e00480001b24cb764e1b43fa0a5/Dockerfile#L108-L113

It looks like --use-local works, I'm just pointing this out in case we want consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah --use-local will check the same things. Please see this list of paths that Conda checks when --use-local is set

@raydouglass
Copy link
Member

/merge

@rapids-bot rapids-bot bot merged commit f0c7766 into rapidsai:branch-23.08 Jul 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants