Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New optimizers fail to load CUDA installed through conda #62

Open
drasmuss opened this issue Jan 13, 2023 · 14 comments
Open

New optimizers fail to load CUDA installed through conda #62

drasmuss opened this issue Jan 13, 2023 · 14 comments
Assignees
Labels

Comments

@drasmuss
Copy link

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): no
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 (WSL)
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.11
  • Python version: 3.9
  • Bazel version (if compiling from source): N/A
  • GPU model and memory: RTX 2080 Ti
  • Exact command to reproduce:
  1. Create a new environment, following the official installation instructions from here https://www.tensorflow.org/install/pip#linux:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install tensorflow
  1. Run the beginner MNIST tutorial (or any other tutorial that calls fit) from here https://keras.io/examples/vision/mnist_convnet/

Describe the problem.

An error is raised:

libdevice not found at ./libdevice.10.bc

Note that if you switch to using the legacy optimizers, by switching this line

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

to this

model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.legacy.Adam(), metrics=["accuracy"])

then the example runs successfully.

Describe the current behavior.

An error occurs when running the example.

Describe the expected behavior.

The example should run without error, as it does when using the legacy optimizers.

  • Do you want to contribute a PR? (yes/no): no

Standalone code to reproduce the issue.

https://keras.io/examples/vision/mnist_convnet/

Source code / logs.

Full stack trace of the error:

    File ".../tmp.py", line 47, in <module>
      model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
      return step_function(self, iterator)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
      outputs = model.train_step(data)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
      self.apply_gradients(grads_and_vars)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
      return super().apply_gradients(grads_and_vars, name=name)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
      iteration = self._internal_apply_gradients(grads_and_vars)
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
      distribution.extended.update(
    File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
      return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_4'
libdevice not found at ./libdevice.10.bc
         [[{{node StatefulPartitionedCall_4}}]] [Op:__inference_train_function_1026]

Likely related to:

@tilakrayal
Copy link
Collaborator

@gowthamkpr,
I tried to execute the mentioned code in two different ways as below, but couldn't find any issue. Kindly find the gist of it here.
model.compile(loss=tf.losses.mse, optimizer=tf.keras.optimizers.SGD())

and

model.compile(loss=tf.losses.mse, optimizer=tf.keras.optimizers.legacy.SGD())

@tilakrayal tilakrayal assigned gowthamkpr and unassigned tilakrayal Jan 17, 2023
@drasmuss
Copy link
Author

It doesn't look like your gist is following step 1 of the reproduction instructions above (i.e., create a new environment and install CUDA through conda).

@kevint0
Copy link

kevint0 commented Jan 18, 2023

I have been encountering the same issue as @drasmuss with the non legacy opitmisers: "adam" and "rmsprop". No errors with the SGD optimizer though. Below is the error from trying to run my script with the "rmsprop" optimiser.

Node: 'StatefulPartitionedCall_8'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_8}}]] [Op:__inference_train_function_1102]

@mhaas
Copy link

mhaas commented Feb 24, 2023

Hi,

adding me me too here - hoping it adds value and not just noise :)

I'm also seeing this issue in the following setup:

  • CUDA 11.7 installed on SLES from RPM packages (via the official Nvidia rep)
  • cuDNN 8.5.0 installed from cudnn-linux-x86_64-8.5.0.96_cuda11-archive.tar.xz
  • Tensorflow 2.11 installed via pip

This was not an issue with Tensorflow 2.10. With 2.11, I now get:

libdevice not found at ./libdevice.10.bc

@SuryanarayanaY SuryanarayanaY self-assigned this Apr 26, 2023
@SuryanarayanaY
Copy link
Contributor

@drasmuss ,

I believe this is not an issue now. I have cross checked with legacy optimizer and execution is success.Please refer the attached logs below. Please confirm if this is still an issue now?

17422_logs.txt

@drasmuss
Copy link
Author

Just checked, and it produces the same error as before. Here are the reproduction steps (I updated the installation instructions to match the changes for TF 2.12 here https://www.tensorflow.org/install/pip#linux):

# these are the standard TF installation steps, copied here for clarity
conda install -c conda-forge cudatoolkit=11.8.0
python3 -m pip install nvidia-cudnn-cu11==8.6.0.163 tensorflow==2.12.*
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'CUDNN_PATH=$(dirname $(python -c "import nvidia.cudnn;print(nvidia.cudnn.__file__)"))' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/:$CUDNN_PATH/lib' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh

# calling model.fit triggers the same error as before
python -c "import tensorflow as tf; model = tf.keras.models.Sequential([tf.keras.layers.Dense(10, input_shape=(10,))]); model.compile(loss=tf.losses.mse); model.fit(tf.ones((32, 10)), tf.ones((32, 10)))"

Here is the full error log:

2023-04-26 12:33:40.887265: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-04-26 12:33:40.912502: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-26 12:33:41.303191: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-04-26 12:33:41.877110: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:41.892418: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:41.892768: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:41.894668: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:41.894937: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:41.895171: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:42.496930: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:42.497198: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:42.497225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1722] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2023-04-26 12:33:42.497474: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:982] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-04-26 12:33:42.497531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 8859 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5
2023-04-26 12:33:43.559322: I tensorflow/compiler/xla/service/service.cc:169] XLA service 0x7fad47d31b00 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-04-26 12:33:43.559369: I tensorflow/compiler/xla/service/service.cc:177]   StreamExecutor device (0): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
2023-04-26 12:33:43.562463: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-04-26 12:33:43.668017: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:424] Loaded cuDNN version 8600
2023-04-26 12:33:43.673970: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:530] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice.
Searched for CUDA in the following directories:
  ./cuda_sdk_lib
  /usr/local/cuda-11.8
  /usr/local/cuda
  .
You can choose the search directory by setting xla_gpu_cuda_data_dir in HloModule's DebugOptions.  For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
2023-04-26 12:33:43.674114: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-26 12:33:43.674287: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc
2023-04-26 12:33:43.674323: I tensorflow/core/common_runtime/executor.cc:1197] [/job:localhost/replica:0/task:0/device:GPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INTERNAL: libdevice not found at ./libdevice.10.bc
         [[{{node StatefulPartitionedCall_1}}]]
2023-04-26 12:33:43.682794: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:274] libdevice is required by this HLO module but was not found at ./libdevice.10.bc
2023-04-26 12:33:43.682947: W tensorflow/core/framework/op_kernel.cc:1830] OP_REQUIRES failed at xla_ops.cc:362 : INTERNAL: libdevice not found at ./libdevice.10.bc
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File ".../lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File ".../lib/python3.9/site-packages/tensorflow/python/eager/execute.py", line 52, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node 'StatefulPartitionedCall_1' defined at (most recent call last):
    File "<string>", line 1, in <module>
    File ".../lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File ".../lib/python3.9/site-packages/keras/engine/training.py", line 1685, in fit
      tmp_logs = self.train_function(iterator)
    File ".../lib/python3.9/site-packages/keras/engine/training.py", line 1284, in train_function
      return step_function(self, iterator)
    File ".../lib/python3.9/site-packages/keras/engine/training.py", line 1268, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File ".../lib/python3.9/site-packages/keras/engine/training.py", line 1249, in run_step
      outputs = model.train_step(data)
    File ".../lib/python3.9/site-packages/keras/engine/training.py", line 1054, in train_step
      self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
    File ".../lib/python3.9/site-packages/keras/optimizers/optimizer.py", line 543, in minimize
      self.apply_gradients(grads_and_vars)
    File ".../lib/python3.9/site-packages/keras/optimizers/optimizer.py", line 1174, in apply_gradients
      return super().apply_gradients(grads_and_vars, name=name)
    File ".../lib/python3.9/site-packages/keras/optimizers/optimizer.py", line 650, in apply_gradients
      iteration = self._internal_apply_gradients(grads_and_vars)
    File ".../lib/python3.9/site-packages/keras/optimizers/optimizer.py", line 1200, in _internal_apply_gradients
      return tf.__internal__.distribute.interim.maybe_merge_call(
    File ".../lib/python3.9/site-packages/keras/optimizers/optimizer.py", line 1250, in _distributed_apply_gradients_fn
      distribution.extended.update(
    File ".../lib/python3.9/site-packages/keras/optimizers/optimizer.py", line 1245, in apply_grad_to_update_var
      return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_1'
libdevice not found at ./libdevice.10.bc
         [[{{node StatefulPartitionedCall_1}}]] [Op:__inference_train_function_401]

It's possible that you have CUDA installed elsewhere on your system (not through conda), and tensorflow is finding libdevice in that installation.

@SuryanarayanaY
Copy link
Contributor

Hi @drasmuss ,

Could you please try the following commands and let us know whether it fixes the error.

# Install NVCC
conda install -c nvidia cuda-nvcc=11.3.58
# Configure the XLA cuda directory
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
printf 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# Copy libdevice file to the required path
mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

Thanks!

@drasmuss
Copy link
Author

drasmuss commented May 9, 2023

Yes, that makes the problem go away, although I would hesitate to call it a solution as that's quite a cumbersome process to repeat every time we create a new environment, and a definite downgrade in user experience compared to TF <= 2.10.

@rchao
Copy link
Contributor

rchao commented May 18, 2023

@chenmoneygithub do you know if this is a known issue?

@danieljwiest
Copy link

I ran into this same issue with WSL2 and the proposed fix did not initially work for me; however, it did eventually work if I rebooted my computer after updating the $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh file.

I had similar issues with the published pip install instructions for Tensorflow (https://www.tensorflow.org/install/pip#step-by-step_instructions) and had to reboot my system between several of the steps.

Hopefully this helps anyone running into this issue on WSL2

@sachinprasadhs sachinprasadhs transferred this issue from keras-team/keras Sep 22, 2023
@joaomamede
Copy link

I had this problem as well and the fix above worked.

@Datagniel
Copy link

Datagniel commented Oct 26, 2023

I ran into the same problem on Linux Mint victoria 21.2 x86_64 after creating a new environment with conda and installing tensorflow-gpu version 2.12.1 from the conda-forge channel.
As suggested by @SuryanarayanaY I used his attempt but without specifying the cuda-ncc version (it installed 12.3.52) and it worked. Thank you very much again for the solution, @SuryanarayanaY !

conda install -c nvidia cuda-nvcc
# Configure the XLA cuda directory
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
printf 'export XLA_FLAGS=--xla_gpu_cuda_data_dir=$CONDA_PREFIX/lib/\n' >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
# Copy libdevice file to the required path
mkdir -p $CONDA_PREFIX/lib/nvvm/libdevice
cp $CONDA_PREFIX/lib/libdevice.10.bc $CONDA_PREFIX/lib/nvvm/libdevice/

@LogExE
Copy link

LogExE commented Nov 12, 2023

Same issue on Fedora 39 with fresh tensorflow from conda-forge, SuryanarayanaY fix works

@makra89
Copy link

makra89 commented Jan 30, 2024

Same issue, any official fix yet?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests