Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BE]: Update cudnn to 8.9.7.29 #120642

Closed

Conversation

Skylion007
Copy link
Collaborator

@Skylion007 Skylion007 commented Feb 26, 2024

Update cudnn to 8.9.7.29 . We just updated the cudnn frontend, might as well. Mostly has improvements for the cudnn flash attention implementation which we are interested in exploring. Such as in #115663

Copy link

pytorch-bot bot commented Feb 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/120642

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Unrelated Failures

As of commit ece0cce with merge base 8bf9e99 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Feb 26, 2024
@Skylion007 Skylion007 force-pushed the skylion007/update-cudnn-8-9-7-29 branch from 809a556 to 1b638fb Compare February 26, 2024 21:04
@Skylion007 Skylion007 changed the title [BE]: Update cudnn to 8-9-7-29 [BE]: Update cudnn to 8.9.7.29 Feb 26, 2024
@Skylion007 Skylion007 added the better-engineering Relatively self-contained tasks for better engineering contributors label Feb 26, 2024
@eqy eqy added ciflow/trunk Trigger trunk jobs on your pull request ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR labels Feb 26, 2024
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • OSS CI (alband, dagitses, pytorch/pytorch-dev-infra)
  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

1 similar comment
@malfet
Copy link
Contributor

malfet commented Feb 26, 2024

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 3 jobs have failed, first few of them are: trunk, linux-binary-manywheel, linux-binary-libtorch-cxx11-abi

Details for Dev Infra team Raised by workflow job

@albanD albanD removed their request for review February 26, 2024 22:38
@Skylion007
Copy link
Collaborator Author

@pytorchbot merge

@malfet
Copy link
Contributor

malfet commented Mar 1, 2024

Ok, the sad thing is: there are no nvidia-cudnn-cu11==8.9.7.29 package on pypi , only 8.9.6 at the time of the writing:
image

@eqy
Copy link
Collaborator

eqy commented Mar 1, 2024

Let me ping cuDNN about it as 8.9.7 is among the more well-tested released and should be hosted at this point

@ptrblck
Copy link
Collaborator

ptrblck commented Mar 1, 2024

Yes, @malfet is correct and nvidia-cudnn-cu11==8.9.7.29 is not available as cuDNN is running into the project size limitations.
RFE to increase the size limit for nvidia-cudnn-cu11: pypi/support#3402
RFE to increase the size limit for nvidia-cudnn-cu12: pypi/support#3408

Both were opened in ~Dec. 2023 and we haven't received any updates yet. Unsure if @rgommers would know more

@rgommers
Copy link
Collaborator

rgommers commented Mar 4, 2024

Both were opened in ~Dec. 2023 and we haven't received any updates yet. Unsure if @rgommers would know more

Not much more. There is an active discussion on Discourse about issues with responses to PyPI support requests in general. I just commented a few days ago on that thread about issues with limit size requests - no response or activity yet.

I'll note that the situation for JAX that I linked to in that comment seems even worse; they have been deleting their old releases for a while now to make space for new releases (which of course broke some users who pinned those old versions).

For PyTorch itself I think we've avoided hard release blockers so far (correct me if I'm wrong @malfet), but it really isn't a good situation that important projects like cuDNN cannot upload releases.

The canonical discussion on this is "What to do about GPUs? (and the built distributions that support them)"; that never really got resolved.

I am considering writing a larger new post about it later this week, depending on the urgency, but I really don't know if it'll help given how politicized that issue has become. There aren't many PyPI admins, and most are volunteers. And the one active PSF staff member doesn't have much time allocated to this; it looks like they're trying to get a new paid support engineer funded by the PSF, but that may take a while to materialize.

@Skylion007
Copy link
Collaborator Author

It would be nice to update to the latest CUDNN that we can support for 2.3. What's the latest version that is feasible to support?

@malfet
Copy link
Contributor

malfet commented Mar 8, 2024

I can update our index to 8.9.7.29 from NVIDIA's pypi that should unblock both this change and the release

@@ -5,11 +5,11 @@ if [[ ${CUDNN_VERSION} == 8 ]]; then
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"
CUDNN_NAME="cudnn-linux-x86_64-8.9.7.29_cuda12-archive"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update only 11.8 cudnn version. At this point we want to upgrade only cuda 11.8 cudnn.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@atalman why do we not want to update cuDNN for the CUDA 12 builds? This would cause a divergence.

Copy link
Contributor

@atalman atalman Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptrblck Sorry yes, issue exist in cudnn 11 but not 12, my bad https://pypi.org/project/nvidia-cudnn-cu12/#files. I do see cudnn 8.9.7.29 available.

nWEIdia added a commit to nWEIdia/builder that referenced this pull request Mar 12, 2024
@Skylion007 Skylion007 force-pushed the skylion007/update-cudnn-8-9-7-29 branch from 878e885 to b132d79 Compare March 13, 2024 20:01
@atalman
Copy link
Contributor

atalman commented Mar 14, 2024

@Skylion007 @malfet Here are additional work required for cudnn : https://github.com/pytorch/builder/blob/main/CUDA_UPGRADE_GUIDE.MD#upgrade-cudnn-version-only
For Windows we would need to rebuild the AMI.

atalman pushed a commit to pytorch/builder that referenced this pull request Apr 8, 2024
* Windows CUDA 12.4 changes
Refrence: #1376

* Update cudnn to 8.9.7.29 to align with pytorch/pytorch#120642
@atalman
Copy link
Contributor

atalman commented Apr 11, 2024

@pytorchmergebot rebase -b main

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/main. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased skylion007/update-cudnn-8-9-7-29 onto refs/remotes/origin/main, please pull locally before adding more changes (for example, via git checkout skylion007/update-cudnn-8-9-7-29 && git pull --rebase)

@Skylion007
Copy link
Collaborator Author

Closing in favor of #123475

@Skylion007 Skylion007 closed this May 12, 2024
@Skylion007 Skylion007 reopened this May 14, 2024
@Skylion007
Copy link
Collaborator Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased skylion007/update-cudnn-8-9-7-29 onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout skylion007/update-cudnn-8-9-7-29 && git pull --rebase)

@pytorchmergebot pytorchmergebot force-pushed the skylion007/update-cudnn-8-9-7-29 branch from dfae8ee to ece0cce Compare May 14, 2024 14:51
@nWEIdia
Copy link
Collaborator

nWEIdia commented Jun 10, 2024

Closing as v9 is in :)

@nWEIdia nWEIdia closed this Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
better-engineering Relatively self-contained tasks for better engineering contributors ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR ciflow/trunk Trigger trunk jobs on your pull request open source topic: not user facing topic category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants