Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow v2.18.0 #408

Open
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

regro-cf-autotick-bot
Copy link
Contributor

It is very likely that the current package version for this feedstock is out of date.

Checklist before merging this PR:

  • Dependencies have been updated if changed: see upstream
  • Tests have passed
  • Updated license if changed and license_file is packaged

Information about this PR:

  1. Feel free to push to the bot's branch to update this PR if needed.
  2. The bot will almost always only open one PR per version.
  3. The bot will stop issuing PRs if more than 3 version bump PRs generated by the bot are open. If you don't want to package a particular version please close the PR.
  4. If you want these PRs to be merged automatically, make an issue with @conda-forge-admin,please add bot automerge in the title and merge the resulting PR. This command will add our bot automerge feature to your feedstock.
  5. If this PR was opened in error or needs to be updated please add the bot-rerun label to this PR. The bot will close this PR and schedule another one. If you do not have permissions to add this label, you can use the phrase @conda-forge-admin, please rerun bot in a PR comment to have the conda-forge-admin add it for you.

Pending Dependency Version Updates

Here is a list of all the pending dependency version updates for this repo. Please double check all dependencies before merging.

Name Upstream Version Current Version
bazel 7.4.0 Anaconda-Server Badge
cudnn 9.4.0.58 Anaconda-Server Badge
icu 2023-10-04 Anaconda-Server Badge
libjpeg-turbo 9e Anaconda-Server Badge
protobuf 28.3 Anaconda-Server Badge
tensorflow 2.18.0 Anaconda-Server Badge

This PR was created by the regro-cf-autotick-bot. The regro-cf-autotick-bot is a service to automatically track the dependency graph, migrate packages, and propose package version updates for conda-forge. Feel free to drop us a line if there are any issues! This PR was generated by https://github.com/regro/cf-scripts/actions/runs/11511568857 - please use this URL for debugging.

@conda-forge-admin
Copy link
Contributor

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe/meta.yaml) and found it was in an excellent condition.

@jdblischak
Copy link
Member

TensorFlow 2.18.0 supports numpy 2. Can we combine this with the numpy 2 migration in #389?

@hmaarrfk
Copy link
Contributor

yes, but somebody has to the hard work of getting the patches updated.

@h-vetinari
Copy link
Member

yes, but somebody has to the hard work of getting the patches updated.

@xhochy has already done that in #405 (it's also not that hard, I just did it before seeing that the patches are updated in that other PR already).

@njzjz
Copy link
Member

njzjz commented Nov 19, 2024

I cancel the running CI. As pointed at #405 (comment), the CI hangs at configure.py, due to the following change

tensorflow/tensorflow@9b5fa66#diff-4d5f3192809ec1b9add6b33007e0c50031ad9a0a2f3f55a481b506468824db2c

@xhochy
Copy link
Member

xhochy commented Nov 19, 2024

Thanks for the comment! I was a bit away (longer than expected) and did not remember on what my local changes were. They are related to the new hermetic CUDA. Someone should port the stuff I did in jaxlib over here.

@xhochy
Copy link
Member

xhochy commented Dec 13, 2024

@traversaro @njzjz It would be nice if you could have a look here, too. I have no deep CUDA knowledge and applying the .../../.. patch from jaxlib didn't solve the build issues. Even hardcoding $PREFIX in there didn't help.

@traversaro
Copy link

@traversaro @njzjz It would be nice if you could have a look here, too. I have no deep CUDA knowledge and applying the .../../.. patch from jaxlib didn't solve the build issues. Even hardcoding $PREFIX in there didn't help.

Sure. The error is:

2024-12-12T19:21:52.0578690Z [21,354 / 28,059] Compiling mlir/lib/Dialect/SparseTensor/IR/SparseTensorDialect.cpp; 32s local ... (4 actions, 3 running)
2024-12-12T19:21:52.1369696Z ERROR: /home/conda/feedstock_root/build_artifacts/tensorflow-split_1734015818369/work/tensorflow/python/user_ops/BUILD:17:29: Action tensorflow/python/user_ops/gen_user_ops_reg_offsets.pb failed: (Exit 127): offset_counter failed: error executing command (from target //tensorflow/python/user_ops:user_ops_reg_offsets) 
2024-12-12T19:21:52.1371752Z   (cd /home/conda/feedstock_root/build_artifacts/tensorflow-split_1734015818369/_build_env/share/bazel/0ac846fcb1a23bd14b850f0071a6803a/execroot/org_tensorflow && \
2024-12-12T19:21:52.1372593Z   exec env - \
2024-12-12T19:21:52.1373425Z   bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/python/framework/offset_counter '--out_path=bazel-out/k8-opt/bin/tensorflow/python/user_ops/gen_user_ops_reg_offsets.pb')
2024-12-12T19:21:52.1374590Z # Configuration: d9ec8b07e01ac9bc5403e30a303561e3a71dda829be23a88a76369079d1e3abc
2024-12-12T19:21:52.1375197Z # Execution platform: @local_execution_config_platform//:platform
2024-12-12T19:21:52.1376260Z bazel-out/k8-opt-exec-50AE0418/bin/tensorflow/python/framework/offset_counter: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory
2024-12-12T19:21:53.0665965Z [21,358 / 28,059] checking cached actions
2024-12-12T19:21:53.2765496Z INFO: Elapsed time: 15030.848s, Critical Path: 591.18s
2024-12-12T19:21:53.2765970Z INFO: 21358 processes: 4921 internal, 16437 local.
2024-12-12T19:21:53.2768015Z FAILED: Build did NOT complete successfully
2024-12-12T19:21:54.7894692Z Traceback (most recent call last):
2024-12-12T19:21:54.7899329Z   File "/opt/conda/lib/python3.12/site-packages/conda_build/build.py", line 2557, in build
2024-12-12T19:21:54.7907899Z     utils.check_call_env(
2024-12-12T19:21:54.7945907Z   File "/opt/conda/lib/python3.12/site-packages/conda_build/utils.py", line 406, in check_call_env
2024-12-12T19:21:54.7947521Z     return _func_defaulting_env_to_os_environ("call", *popenargs, **kwargs)
2024-12-12T19:21:54.7948543Z            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-12T19:21:54.7950024Z   File "/opt/conda/lib/python3.12/site-packages/conda_build/utils.py", line 382, in _func_defaulting_env_to_os_environ
2024-12-12T19:21:54.7951687Z     raise subprocess.CalledProcessError(proc.returncode, _args)
2024-12-12T19:21:54.7954100Z subprocess.CalledProcessError: Command '['/bin/bash', '-o', 'errexit', '/home/conda/feedstock_root/build_artifacts/tensorflow-split_1734015818369/work/conda_build.sh']' returned non-zero exit status 1.
2024-12-12T19:21:54.7956456Z 
2024-12-12T19:21:54.7956944Z The above exception was the direct cause of the following exception:
2024-12-12T19:21:54.7957693Z 
2024-12-12T19:21:54.7957950Z Traceback (most recent call last):
2024-12-12T19:21:54.7959159Z   File "/opt/conda/bin/conda-build", line 11, in <module>
2024-12-12T19:21:54.7959979Z     sys.exit(execute())
2024-12-12T19:21:54.7960533Z              ^^^^^^^^^
2024-12-12T19:21:54.7961647Z   File "/opt/conda/lib/python3.12/site-packages/conda_build/cli/main_build.py", line 648, in execute
2024-12-12T19:21:54.7967527Z     api.build(
2024-12-12T19:21:54.7968701Z   File "/opt/conda/lib/python3.12/site-packages/conda_build/api.py", line 211, in build
2024-12-12T19:21:54.7969940Z     return build_tree(
2024-12-12T19:21:54.7970612Z            ^^^^^^^^^^^
2024-12-12T19:21:54.7971782Z   File "/opt/conda/lib/python3.12/site-packages/conda_build/build.py", line 3653, in build_tree
2024-12-12T19:21:54.7987778Z     packages_from_this = build(
2024-12-12T19:21:54.7993074Z                          ^^^^^^
2024-12-12T19:21:54.7994150Z   File "/opt/conda/lib/python3.12/site-packages/conda_build/build.py", line 2565, in build
2024-12-12T19:21:54.8002343Z     raise BuildScriptException(str(exc), caused_by=exc) from exc
2024-12-12T19:21:54.8010830Z conda_build.exceptions.BuildScriptException: Command '['/bin/bash', '-o', 'errexit', '/home/conda/feedstock_root/build_artifacts/tensorflow-split_1734015818369/work/conda_build.sh']' returned non-zero exit status 1.
2024-12-12T19:22:02.9546309Z ##[error]Process completed with exit code 1.
2024-12-12T19:22:02.9709370Z Post job cleanup.
2024-12-12T19:22:03.0745753Z [command]/usr/bin/git version
2024-12-12T19:22:03.0806309Z git version 2.34.1

right? At a first glance there is some library that directly links libcuda, that is a bit different situation w.r.t. jaxlib that only dlopen-s the cuda libraries at runtime (and there is where the patch used in jaxlib/xla/tsl helps.

@xhochy
Copy link
Member

xhochy commented Dec 13, 2024

right?

Yes.

@njzjz
Copy link
Member

njzjz commented Dec 13, 2024

error while loading shared libraries: libcuda.so.1:

I reproduce the error locally. There is only libcuda.so in $BUILD_PREFIX/targets/x86_64-linux/lib/stubs and no libcuda.so.1.

@njzjz
Copy link
Member

njzjz commented Dec 16, 2024

Any idea why libcuda.so.1 is linked?

2024-12-16T10:02:20.1373520Z Traceback (most recent call last):
2024-12-16T10:02:20.1376023Z   File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1734305778210/_build_env/share/bazel/a79decdc7867a1df54e7213456fc7491/sandbox/processwrapper-sandbox/3/execroot/org_tensorflow_estimator/bazel-out/k8-opt-exec-2B5CBBC6/bin/tensorflow_estimator/python/estimator/api/extractor_wrapper.runfiles/org_tensorflow_estimator/tensorflow_estimator/python/estimator/api/extractor_wrapper.py", line 18, in <module>
2024-12-16T10:02:20.1378536Z     from tensorflow.python.tools.api.generator2.extractor import extractor
2024-12-16T10:02:20.1380390Z   File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1734305778210/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/lib/python3.11/site-packages/tensorflow/__init__.py", line 40, in <module>
2024-12-16T10:02:20.1382481Z     from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow  # pylint: disable=unused-import
2024-12-16T10:02:20.1383103Z     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2024-12-16T10:02:20.1384876Z   File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1734305778210/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/lib/python3.11/site-packages/tensorflow/python/pywrap_tensorflow.py", line 34, in <module>
2024-12-16T10:02:20.1386658Z     self_check.preload_check()
2024-12-16T10:02:20.1388549Z   File "/home/conda/feedstock_root/build_artifacts/tensorflow-split_1734305778210/_h_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_plac/lib/python3.11/site-packages/tensorflow/python/platform/self_check.py", line 63, in preload_check
2024-12-16T10:02:20.1390502Z     from tensorflow.python.platform import _pywrap_cpu_feature_guard
2024-12-16T10:02:20.1391149Z ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
2024-12-16T10:02:20.1871095Z Target //tensorflow_estimator/tools/pip_package:build_pip_package failed to build

@njzjz
Copy link
Member

njzjz commented Dec 17, 2024

I found that this error disappears on a GPU machine (setting CONDA_FORGE_DOCKER_RUN_ARGS="--gpus all"). We currently use the CPU machine but we can use the GPU machine like pytorch.

@jaimergp
Copy link
Member

This file is in cuda-compat-impl, according to the file search: https://conda-metadata-app.streamlit.app/Search_by_file_path (type libcuda.so.1 there). Linking to the system libcuda via the GPU CI server doesn't sound like the best solution to me 🤔

@njzjz
Copy link
Member

njzjz commented Dec 18, 2024

This file is in cuda-compat-impl, according to the file search: https://conda-metadata-app.streamlit.app/Search_by_file_path (type libcuda.so.1 there). Linking to the system libcuda via the GPU CI server doesn't sound like the best solution to me 🤔

libcuda.so.1 in cuda_compat is located in $PREFIX/cuda_compat directory that cannot be used automatically. To pass the build with cuda_compat, there are two options: (1) in all build scripts and test sections, copy libcuda.so.1 to $PREFIX/lib or add $PREFIX/cuda_compat to LD_LIBRARY_PATH; (2) use patchelf to add rpath $PREFIX/cuda_compat for all .so libraries. I feel both are a pain, and I'd like to know your thoughts.

@jakirkham
Copy link
Member

I found that this error disappears on a GPU machine (setting CONDA_FORGE_DOCKER_RUN_ARGS="--gpus all"). We currently use the CPU machine but we can use the GPU machine like pytorch.

Right adding the args --runtime=nvidia --gpus all is what should be done for proper GPU testing. This ensures the GPUs, driver library, etc. are available for use in the container. More details in these docs


The alternative if we don't actually want to test the package and merely want to test its existence, would be to use importlib.util.find_spec to simply check tensorflow is there

@jaimergp
Copy link
Member

I feel both are a pain, and I'd like to know your thoughts.

(1) doesn't sound so bad, but I do want to know (cause I'm not familiar with this part of the stack) if linking to the driver's libcuda is acceptable in terms of ABI compatibility or whether we should make an effort to link to the cuda-compat stub (via a symlink, one more search path for the linker, or other workarounds).

@hmaarrfk
Copy link
Contributor

would we retain the ability to use tensorflow compiled with cuda support to run on a CPU only machine?

@jakirkham
Copy link
Member

The default build approach is to include a libcuda.so stub library as part of cuda-driver-dev_{{ target_platform }}, which cuda-nvcc_{{ target_platform }} then pulls in. Since the compilers know to search this path, this ensures that the library is found at build time and symbols from the driver library are resolved

At runtime, libcuda.so should be provided via the system's driver. If that condition is met, the package can import and it can leverage the CUDA functionality from other libraries to run meaningful tests (beyond existence checks)

However what seems to be happening in this build is tensorflow is trying to load libcuda.so without having GPUs or the CUDA Driver. The first question we should ask is...

Why does tensorflow want to load libcuda.so during the build?

Some hypotheses worth testing...

  1. Building now requires a GPU
  2. Some library loading happens during the build process leading to a search for libcuda.so
  3. TensorFlow has added a check for a GPU during the build

What solution we pick should depend on what the TensorFlow build system is trying to accomplish

@jakirkham
Copy link
Member

jakirkham commented Dec 19, 2024

Side note: It is possible to dlopen the stub library

Here is a simple example of this behavior from Linux ARM

conda create -n tst_cuda_stub python=3.12 ipython cuda-nvcc
conda activate tst_cuda_stub
In [1]: import ctypes

In [2]: ctypes.cdll.LoadLibrary(
   ...:     "/opt/conda/envs/tst/targets/sbsa-linux/lib/stubs/libcuda.so"
   ...: )
Out[2]: <CDLL '/opt/conda/envs/tst/targets/sbsa-linux/lib/stubs/libcuda.so', handle aaaacafa37c0 at 0xffff85e2e720>

Am just not clear on why the TensorFlow wants to load libcuda.so. So am not sure whether we should be handling it this way or not (given this is not really the same as loading the driver library)

@h-vetinari
Copy link
Member

would we retain the ability to use tensorflow compiled with cuda support to run on a CPU only machine?

We have an explicit requirement on __cuda for the CUDA-variant.

# avoid that people without GPUs needlessly download ~0.5-1GB
- __cuda # [cuda_compiler_version != "None"]

So we already don't support that (well, unless someone uses CONDA_CUDA_OVERRIDE). I'm not saying that losing this ability would be desirable, just trying to figure out why it's a concern in the first place.

@hmaarrfk
Copy link
Contributor

So we already don't support that (well, unless someone uses CONDA_CUDA_OVERRIDE). I'm not saying that losing this ability would be desirable, just trying to figure out why it's a concern in the first place.

I think this is a fair question. The reason that CONDA_CUDA_OVERRIDE exists is to allow advanced users to explicitely request CUDA packages when the "installation system" doesn't really support it.

  1. A login node to a super computer might have this. These are typically "interactive nodes" without much "compute". It would be good to not have to duplicate your environments for this.
  2. Creating a system "test" image with one's software.
    • Should be able to test CUDA stuff when a GPU is installed on new hardware.
    • Should be able to test CPU stuff when no GPU is detected (and not fail at import time).

I am personally in camp 2, though, years ago, I was in camp 1.

If I recall correctly, one of the (many) reasons we added __cuda to avoid CPU only users to fill their disk space (often on a storage limited laptop), and to save on their installation time.

@h-vetinari
Copy link
Member

Creating a system "test" image with one's software.

I mean, how well can you test your setup if you're on a system that will end up taking completely different code paths (CPU vs. GPU) compared to the target environment?

In any case, I'm in favour of keeping the ability to run without a GPU driver, but at the same time, I don't think it's worth an extreme maintenance investement if indeed upstream tensorflow now requires that.

@jaimergp
Copy link
Member

jaimergp commented Dec 19, 2024

According to their docs, no GPU should be needed at build time 🤔

image

Or does that mean that build with GPU support does require the actual drivers (and hence a GPU), but CPU-only wheels can be built without them? Nah, it does seem like the former. See this commit: tensorflow/docs@7d4187e

@jakirkham
Copy link
Member

Ok we could add the stub library to the library search path at build time

@njzjz
Copy link
Member

njzjz commented Dec 19, 2024

According to their docs, no GPU should be needed at build time 🤔

image

Or does that mean that build with GPU support does require the actual drivers (and hence a GPU), but CPU-only wheels can be built without them? Nah, it does seem like the former. See this commit: tensorflow/docs@7d4187e

Although bazel is able to download cuda including drivers, in conda-forge, we set the environment variable LOCAL_CUDA_PATH to use the local cuda provided by conda-forge, just like other dependencies.

Related documentation can be found here: https://github.com/openxla/xla/blob/main/docs/hermetic_cuda.md

"When CUDA forward compatibility mode is disabled, Bazel targets will use User Mode and Kernel Mode Drivers pre-installed on the system."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.