Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify additional steps to utilize GPU for Linux users #2299

Merged
merged 15 commits into from
Sep 5, 2024

Conversation

sgkouzias
Copy link
Contributor

Specify additional steps to utilize GPU for Linux users

Specify additional steps to utilize GPU for Linux users
Advice to skip additional step 6 if using CPU.
@8bitmp3
Copy link
Contributor

8bitmp3 commented Apr 9, 2024

@MarkDaoust @markmcd

Added second option to create virtual env via Python's built in venv module for Linux users with CUDA-enabled GPUs
Added virtual envs activation/deactivation commands and changed wording for editing the deactivate block in the activate script of the venv virtual env.
Added instructions to resolve the ptxas issue.
Revised CUDNN_DIR definition
Corrected LD_LIBRARY_PATH definition in conda environment instructions
Rename environment variable to PTXAS_DIR and package manager options.
Added note to use pip instead of conda to install TensorFlow.
Copy link
Contributor Author

@sgkouzias sgkouzias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added steps and respective instructions to install TensorFlow by running the pip install tensorflow[and-cuda] command within a virtual environment (option 1: conda, option 2: venv) and set environment variables to find/locate compatible NVIDIA libs installed with TensorFlow to effectively utilize GPUs. The solution has been successfully tested.

Reference: tensorflow/tensorflow#63362

@sgkouzias
Copy link
Contributor Author

sgkouzias commented May 10, 2024

@haifeng-jin , @MarkDaoust, @8bitmp3 I await any suggestions or revisions if needed. Do we have any updates?

@sgkouzias sgkouzias marked this pull request as draft May 16, 2024 13:23
@sgkouzias sgkouzias marked this pull request as ready for review May 16, 2024 13:28
@haifeng-jin
Copy link
Collaborator

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

@sgkouzias
Copy link
Contributor Author

sgkouzias commented May 20, 2024

As I remembered, the current recommended way to install TF is to use pip. I do not have further info on this. @MarkDaoust may comment on this.

@haifeng-jin it seems practically impossible for someone owning a PC with CUDA-enabled GPU to perform deep learning experiments with TensorFlow version 2.16.1 and utilize his GPU locally without manually performing some extra steps not included (until today) in the official TensorFlow documentation of the standard installation procedure of TensorFlow for Linux users with GPUs at least as a temporal fix!

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

Copy link
Contributor

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't use "add file"/"update file"/"fix file"/etc. commit messages. These are hard to reason about when looking at the history of the file/repository. Instead, please write explanatory git commit messages.

The commit message is also the title of the PR if the PR has only one commit. It is thus twice important to have commit messages that are relevant, as PRs would be easier to understand and easier to analyze in search results.

For how to write good quality git commit messages, please consult https://cbea.ms/git-commit/

@mihaimaruseac
Copy link
Contributor

It turns out that when you pip install tensorflow[and-cuda] all required NVIDIA libraries are installed as well. You just need to configure manually the environment variables as appropriate in order to utilize them and run TensorFlow with GPU.

Can we instead add these to the install guide?

@sgkouzias sgkouzias changed the title Update pip.md Specify additional steps to utilize GPU for Linux users May 24, 2024
@sgkouzias
Copy link
Contributor Author

configure manually the environment variables as appropriate

@mihaimaruseac shouldn't we explain/specify how to configure manually the environment variables as appropriate?

@sgkouzias sgkouzias requested a review from mihaimaruseac May 24, 2024 13:12
Copy link
Contributor

@mihaimaruseac mihaimaruseac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the update and it seems reasonable to me. Thank you

Removed option to install within conda virtual environment. Recommendation to install in venv environment.
@sgkouzias
Copy link
Contributor Author

@t-kalinowski thank you very much for your valuable advice. I revised the PR accordingly.

@t-kalinowski
Copy link

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

Replaced instructions to modify default activate/deactivate scripts with instructions to create symlinks to NVIDIA shared libraries and ptxas.
@sgkouzias
Copy link
Contributor Author

@sgkouzias if you also create a symlink at my-venv/bin/ptxas -> my-venv/lib/python.../site-packages/.../bin/ptxax, then you could probably get away without needing to require users to modify default activate and deactivate scripts.

@t-kalinowski thank you so much for your advice. Instructions have been totally revised as per your comments. Modifications to default activate and deactivate scripts are not required from users. Instructions should resemble more or less what you do in the R interface.

@sgkouzias
Copy link
Contributor Author

sgkouzias commented Jun 19, 2024

@8bitmp3 , @haifeng-jin , @MarkDaoust even TensorFlow version 2.17.0.rc0 requires to specify additional steps to utilize GPU for Linux users. The suggested instructions of this pull request offer a tested solution. I await your comments.


```bash
source tf/bin/activate
deactivate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove deactivate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove deactivate?

@learning-to-play removed deactivate as advised. Furthermore, I could remove the instruction to create symlink to ptxas since it is ultimately not needed for TensorFlow version 2.17.0.rc0 but only for TensorFlow version 2.16.1. Awaiting your comments.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to make sure that I understand the situation correctly. Which of the following two situation is correct?

Copy link
Contributor Author

@sgkouzias sgkouzias Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@learning-to-play the only difference is that on version 2.17.0.rc0 you need to create the symlinks to NVIDIA libs in order to utilize GPUs while on version 2.16.1 you should in addition to creating symlinks to NVIDIA libs create a symlink to ptxas as well. Consequently, the command pip install tensorflow[and-cuda] alone fails to work with GPUs on both versions.

@sgkouzias
Copy link
Contributor Author

sgkouzias commented Jul 1, 2024

@learning-to-play, @SeeForTwo, @8bitmp3, @haifeng-jin, @MarkDaoust, @markmcd

Unfortunately the latest release namely TensorFlow 2.16.2 does not fix the ptxas bug. When running a training script I get the error:

ptxas returned an error during compilation of ptx to sass: 'INTERNAL: ptxas 12.3.103 has a bug that we think can affect XLA. Please use a different version.' If the error message indicates that a file could not be written, please verify that sufficient filesystem space is provided.
Aborted (core dumped)

So it seems as TensorFlow 2.16.2 Fails to work with GPUs as well !

Notes:

  1. Successful installation was verified by running:
    python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
  2. The solution included in the submitted pull request pending review helped to get rid of the ptxas bug and ultimately enforced TensorFlow 2.16.2 to work with my GPU:
ln -sf $(find $(dirname $(dirname $(python -c "import nvidia.cuda_nvcc; print(nvidia.cuda_nvcc.__file__)"))/*/bin/) -name ptxas -print -quit) $VIRTUAL_ENV/bin/ptxas

@belitskiy
Copy link
Member

Thank you for the contribution, @sgkouzias :)
Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

Revised the step with instructions to configure the virtual environment variables for GPU users by adding a disclaimer.
@sgkouzias
Copy link
Contributor Author

Thank you for the contribution, @sgkouzias :) Given that the [and-cuda] installation now does detect pip-installed CUDA components again, please add a disclaimer specify that that symbolic links are only necessary in case the intended way doesn't work, i.e. the components aren't being detected, and/or conflict with the existing system CUDA installation (like ptxas for you).

@belitskiy, @learning-to-play I revised instructions as advised and will be awaiting your feedback. It is my honor to contribute to the TensorFlow community.

Deleted asterisk emoji and placed disclaimer note before respective instructions.
8bitmp3
8bitmp3 previously approved these changes Sep 4, 2024
Copy link
Contributor

@8bitmp3 8bitmp3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

site/en/install/pip.md Outdated Show resolved Hide resolved
@MarkDaoust
Copy link
Member

Thanks for all your work everyone (especially @sgkouzias)!

I just tweaked the order so that this new GPU debugging step is after the step where you test the GPU.

I think this is still right so I'm merging it. But LMK if I misunderstood anything.

@sgkouzias
Copy link
Contributor Author

Thanks for all your work everyone (especially @sgkouzias)!

I just tweaked the order so that this new GPU debugging step is after the step where you test the GPU.

I think this is still right so I'm merging it. But LMK if I misunderstood anything.

Thank you @MarkDaoust 🙏 it is my honour.
I noticed you mentioned merging, but it seems the pull request still needs a formal review due to branch protection rules. Could you please take a quick look and approve it when you have a chance?
Many thanks again!

@sgkouzias sgkouzias requested a review from MarkDaoust September 5, 2024 16:44
@MarkDaoust
Copy link
Member

Really it has everything it needs we're just waiting for the internal merge, it should be through soon.

@copybara-service copybara-service bot merged commit 27ba8a4 into tensorflow:master Sep 5, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants