Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] fix CUDA 11.8 builds (fixes #6466) #6465

Merged
merged 25 commits into from
May 27, 2024
Merged

[ci] fix CUDA 11.8 builds (fixes #6466) #6465

merged 25 commits into from
May 27, 2024

Conversation

jameslamb
Copy link
Collaborator

@jameslamb jameslamb commented May 26, 2024

Fixes #6466

The CUDA 11.8 wheel (gcc) CI job is failing like this:

E   OSError: libomp.so.5: cannot open shared object file: No such file or directory

This fixes that by switching all pip install --user calls in CI jobs to simply pip install.

Notes for Reviewers

In short, the problem is a mix of the following:

  • using pip install --user in CI jobs is installing lightgbm into a location like /github/home/.local/lib/python3.10/site-packages
  • for the CUDA jobs, that file is persisting across jobs, because /github/home is a volume mount from the self-hosted runner we use ([ci] fix CUDA 11.8 builds (fixes #6466) #6465 (comment))
docker run \
    ...
    -v "/home/guoke/actions-runner/_work/_temp/_github_home":"/github/home" \
    ...
  • as a result, pip install in subsequent wheel CI jobs is refusing to install the newly-built-in-that-CI-run wheel, instead using one from a different job (which might use a different CUDA version or compiler)
Processing ./dist/lightgbm-4.3.0.99-py3-none-linux_x86_64.whl
Requirement already satisfied: numpy in /tmp/miniforge/envs/test-env/lib/python3.11/site-packages (from lightgbm==4.3.0.99) (1.26.4)
Requirement already satisfied: scipy in /tmp/miniforge/envs/test-env/lib/python3.11/site-packages (from lightgbm==4.3.0.99) (1.13.1)
lightgbm is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.

This PR fixes that by switching all uses of pip install --user to simply pip install in LightGBM's CI jobs. That results in pip installing lightgbm into ${CONDA}/envs/test-env/lib/python${ver}/site-packages. On CUDA jobs, ${CONDA} is at /tmp/miniforge, a location that doesn't have a mount back to the host... so nothing is left over from build-to-build 😁

How do you know lightgbm was being installed in /home/github?

You can see the absolute path to which lightgbm is being loaded in the test logs, like this:

tests/python_package_test/test_utilities.py:7: in <module>
    import lightgbm as lgb
/github/home/.local/lib/python3.11/site-packages/lightgbm/__init__.py
    from .basic import Booster, Dataset, Sequence, register_logger

(example failed build)

@jameslamb
Copy link
Collaborator Author

So, right before building a lightgbm wheel, here are all the libomp / libgomp I see in the environment.

pip

--- finding libomp.so (/usr) ...
/usr/lib/llvm-10/lib/libomp.so.5
/usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
--- finding libomp.so (/github/home) ...
--- finding libomp.so (/tmp/miniforge) ...
/tmp/miniforge/envs/test-env/lib/libgomp.so.1.0.0
/tmp/miniforge/lib/libgomp.so.1.0.0
/tmp/miniforge/pkgs/libgomp-13.2.0-h77fa898_7/lib/libgomp.so.1.0.0
/tmp/miniforge/pkgs/libgomp-13.2.0-h807b86a_5/lib/libgomp.so.1.0.0

wheel

--- finding libomp.so (/usr) ...
/usr/lib/x86_64-linux-gnu/libgomp.so.1.0.0
--- finding libomp.so (/github/home) ...
--- finding libomp.so (/tmp/miniforge) ...
/tmp/miniforge/envs/test-env/lib/libgomp.so.1.0.0
/tmp/miniforge/lib/libgomp.so.1.0.0
/tmp/miniforge/pkgs/libgomp-13.2.0-h77fa898_7/lib/libgomp.so.1.0.0
/tmp/miniforge/pkgs/libgomp-13.2.0-h807b86a_5/lib/libgomp.so.1.0.0

So there's only one difference... on the pip job (which is succeeding), it's finding this:

/usr/lib/llvm-10/lib/libomp.so.5

Seeing an error about libomp (not libgomp) in a job that's supposed to be using gcc makes me suspect that maybe our mechanism for setting the target compiler to gcc is not quite working. I just pushed another commit to get more logs.

@jameslamb
Copy link
Collaborator Author

I see this in the logs of the wheel job

Processing ./dist/lightgbm-4.3.0.99-py3-none-linux_x86_64.whl
Requirement already satisfied: numpy in /tmp/miniforge/envs/test-env/lib/python3.11/site-packages (from lightgbm==4.3.0.99) (1.26.4)
Requirement already satisfied: scipy in /tmp/miniforge/envs/test-env/lib/python3.11/site-packages (from lightgbm==4.3.0.99) (1.13.1)
lightgbm is already installed with the same version as the provided wheel. Use --force-reinstall to force an installation of the wheel.

added a pip freeze near the beginning of test.sh and saw there was already a lightgbm installed (but an sdist)!

lightgbm @ file:///__w/LightGBM/LightGBM/dist/lightgbm-4.3.0.99.tar.gz#sha256=53e9fa392e9a699f989a28469328b12f3578d6ad096ec1b27ac7d4cbe81dcdcc

So something is getting left behind, probably a result of the changes from #6458.

Here's how the container is being run:

/usr/bin/docker create \
    --name 9c95b233bc0f4447a9bab8c6b539381a_nvcrionvidiacuda1180develubuntu2004_f9dcc0 \
    --label df5bb3 \
    --workdir /__w/LightGBM/LightGBM \
    --network github_network_e8b8b25724604f608a6bef89c6f3c7e8 \
    --gpus all \
    -e "CMAKE_BUILD_PARALLEL_LEVEL=4" \
    -e "COMPILER=gcc" \
    -e "CONDA=/tmp/miniforge" \
    -e "CONDA_ENV=test-env" \
    -e "DEBIAN_FRONTEND=noninteractive" \
    -e "METHOD=wheel" \
    -e "OS_NAME=linux" \
    -e "PYTHON_VERSION=3.11" \
    -e "TASK=cuda" \
    -e "HOME=/github/home" \
    -e GITHUB_ACTIONS=true \
    -e CI=true \
    -v "/var/run/docker.sock":"/var/run/docker.sock" \
    -v "/home/guoke/actions-runner/_work":"/__w" \
    -v "/home/guoke/actions-runner/externals":"/__e":ro \
    -v "/home/guoke/actions-runner/_work/_temp":"/__w/_temp" \
    -v "/home/guoke/actions-runner/_work/_actions":"/__w/_actions" \
    -v "/opt/hostedtoolcache":"/__t" \
    -v "/home/guoke/actions-runner/_work/_temp/_github_home":"/github/home" \
    -v "/home/guoke/actions-runner/_work/_temp/_github_workflow":"/github/workflow" \
    --entrypoint "tail" \
    nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 \
    "-f" "/dev/null"

Going to try cleaning this up at the beginning of builds:

/home/guoke/actions-runner/_work

@jameslamb
Copy link
Collaborator Author

I tried changing the version for lightgbm, to check if maybe that lightgbm==4.3.0.99 lingering around in the environment was left over from some previous development... and it was!

I know that because even after changing the version number to 414.3.0.9``, I still saw a 4.3.0.99inpip freeze`.

full pip freeze (click me)
bokeh @ file:///home/conda/feedstock_root/build_artifacts/bokeh_1712901085037/work
Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1695989787169/work
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1707022139797/work/certifi
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1696001724357/work
click @ file:///home/conda/feedstock_root/build_artifacts/click_1692311806742/work
cloudpickle @ file:///home/conda/feedstock_root/build_artifacts/cloudpickle_1697464713350/work
colorama @ file:///home/conda/feedstock_root/build_artifacts/colorama_1666700638685/work
contourpy @ file:///home/conda/feedstock_root/build_artifacts/contourpy_1712429897138/work
cycler @ file:///home/conda/feedstock_root/build_artifacts/cycler_1696677705766/work
cytoolz @ file:///home/conda/feedstock_root/build_artifacts/cytoolz_17068[970](https://github.com/microsoft/LightGBM/actions/runs/9248418329/job/25438691261?pr=6465#step:6:971)31595/work
dask @ file:///home/conda/feedstock_root/build_artifacts/dask-core_1715977602713/work
dask-expr @ file:///home/conda/feedstock_root/build_artifacts/dask-expr_1715997835546/work
distributed @ file:///home/conda/feedstock_root/build_artifacts/distributed_1715981651589/work
exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1704921103267/work
fonttools @ file:///home/conda/feedstock_root/build_artifacts/fonttools_1716599762128/work
fsspec @ file:///home/conda/feedstock_root/build_artifacts/fsspec_1715865800631/work
graphviz @ file:///home/conda/feedstock_root/build_artifacts/python-graphviz_1711016462626/work
importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1710[971](https://github.com/microsoft/LightGBM/actions/runs/9248418329/job/25438691261?pr=6465#step:6:972)335535/work
iniconfig @ file:///home/conda/feedstock_root/build_artifacts/iniconfig_1673103042956/work
Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1715127149914/work
joblib @ file:///home/conda/feedstock_root/build_artifacts/joblib_1714665484399/work
kiwisolver @ file:///home/conda/feedstock_root/build_artifacts/kiwisolver_1695379920604/work
lightgbm @ file:///__w/LightGBM/LightGBM/dist/lightgbm-4.3.0.99.tar.gz#sha256=f8c1ae259645bdc4b0e561ef34520a6025563360990ef2bca73cab911649a358
locket @ file:///home/conda/feedstock_root/build_artifacts/locket_1650660393415/work
lz4 @ file:///home/conda/feedstock_root/build_artifacts/lz4_1704831090180/work
MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1706899926732/work
matplotlib @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-suite_1715976244352/work
msgpack @ file:///home/conda/feedstock_root/build_artifacts/msgpack-python_1715670639536/work
munkres==1.1.4
numpy @ file:///home/conda/feedstock_root/build_artifacts/numpy_1707225376651/work/dist/numpy-1.26.4-cp311-cp311-linux_x86_64.whl#sha256=d08e1c9e5833ae7780563812aa73e2497db1ee3bd5510d3becb8aa18aa2d0c7c
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1710075952259/work
pandas @ file:///home/conda/feedstock_root/build_artifacts/pandas_1715897625506/work
partd @ file:///home/conda/feedstock_root/build_artifacts/partd_1715026491486/work
pillow @ file:///home/conda/feedstock_root/build_artifacts/pillow_1712154447422/work
pluggy @ file:///home/conda/feedstock_root/build_artifacts/pluggy_1713667077545/work
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1705722403006/work
pyarrow==16.1.0
pyarrow-hotfix @ file:///home/conda/feedstock_root/build_artifacts/pyarrow-hotfix_1700596371886/work
pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1711811537435/work
pyparsing @ file:///home/conda/feedstock_root/build_artifacts/pyparsing_170[972](https://github.com/microsoft/LightGBM/actions/runs/9248418329/job/25438691261?pr=6465#step:6:973)1012883/work
PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work
pytest @ file:///home/conda/feedstock_root/build_artifacts/pytest_1716221322529/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1709299778482/work
pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1706886791323/work
PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1695373611984/work
scikit-learn @ file:///home/conda/feedstock_root/build_artifacts/scikit-learn_171648[973](https://github.com/microsoft/LightGBM/actions/runs/9248418329/job/25438691261?pr=6465#step:6:974)4259/work/dist/scikit_learn-1.5.0-cp311-cp311-linux_x86_64.whl#sha256=4e4d00ec549661aa22a03bd61e339cfd7abf80e247c75a0b3a95cc04b3c04b2d
scipy @ file:///home/conda/feedstock_root/build_artifacts/scipy-split_1716470220807/work/dist/scipy-1.13.1-cp311-cp311-linux_x86_64.whl#sha256=497f9d1a91aa2301c4dbd6d0678fb3053900d59dde1ee936267e18aee722ba8f
six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work
sortedcontainers @ file:///home/conda/feedstock_root/build_artifacts/sortedcontainers_1621217038088/work
tblib @ file:///home/conda/feedstock_root/build_artifacts/tblib_1702066284995/work
threadpoolctl @ file:///home/conda/feedstock_root/build_artifacts/threadpoolctl_1714400101435/work
tomli @ file:///home/conda/feedstock_root/build_artifacts/tomli_1644342247877/work
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1706112571092/work
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1708363099148/work
tzdata @ file:///home/conda/feedstock_root/build_artifacts/python-tzdata_1707747584337/work
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1708239446578/work
xyzservices @ file:///home/conda/feedstock_root/build_artifacts/xyzservices_1712209912887/work
zict @ file:///home/conda/feedstock_root/build_artifacts/zict_1681770155528/work
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1695255097490/work
``

([build link](https://github.com/microsoft/LightGBM/actions/runs/9248418329/job/25438691261?pr=6465))

</details>

Still not 100% sure what happened, so I pushed commits running `pip uninstall --yes` to try to clear out that cache. It *must* be getting mounted in from somewhere, somehow.

@jameslamb jameslamb changed the title WIP: [ci] fix CUDA 11.8 builds WIP: [ci] fix CUDA 11.8 builds (fixes #6466) May 27, 2024
@jameslamb jameslamb changed the title WIP: [ci] fix CUDA 11.8 builds (fixes #6466) [ci] fix CUDA 11.8 builds (fixes #6466) May 27, 2024
@jameslamb jameslamb marked this pull request as ready for review May 27, 2024 05:27
@jameslamb
Copy link
Collaborator Author

Alright I think this is ready for review! Whenever this is merged, CI will be working again.

Copy link
Collaborator

@borchero borchero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for the extensive description/debugging log!

@jameslamb
Copy link
Collaborator Author

Thanks for the quick review!

@jameslamb jameslamb merged commit 7d15298 into master May 27, 2024
39 checks passed
@jameslamb jameslamb deleted the ci/fix-cuda branch May 27, 2024 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ci] CUDA 11.8 wheel (gcc) CI jobs failing: 'libomp.so.5: no such file or directory'
2 participants