Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CUDA 11.8, cuDNN and PyCUDA #8295

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Feb 6, 2023

Update to CUDA 11.8 and related software:

  • update CUDA to version 11.8.0;
  • update the compatibility NVIDIA drivers to version 520.61.05;
  • update cuDNN to version 8.7.0.84 for CUDA 11.8;
  • update PyCUDA to 2022.2.2.

fwyzard and others added 3 commits February 6, 2023 23:27
The main change since CUDA 11.5.x is the support for the Lovelace (sm_87) and
Hopper (sm_90) architectures.

See https://docs.nvidia.com/cuda/archive/11.8.0/cuda-toolkit-release-notes/index.html
for the full CUDA 11.8.0 release notes and change log.
@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2023

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_13_0_X/master.

@smuzaffar, @aandvalenzuela, @iarspider can you please review it and eventually sign? Thanks.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.
cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

please test for el8_ppc64le_gcc11

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

please test for el8_aarch64_gcc11

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

hold

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 6, 2023

Pull request has been put on hold by @fwyzard
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

@cmsbuild cmsbuild added the hold label Feb 6, 2023
@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 6, 2023

In stand-alone tests we have observed a small drop in performance with all CUDA versions after 11.5, so we should check the impact on the HLT performance before merging this.

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2023

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30438/summary.html
COMMIT: 565921d
CMSSW: CMSSW_13_0_X_2023-02-05-2300/el8_aarch64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8295/30438/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

Target //tensorflow/tools/pip_package:build_pip_package failed to build
INFO: Elapsed time: 1138.545s, Critical Path: 175.74s
INFO: 3613 processes: 778 internal, 2835 local.
FAILED: Build did NOT complete successfully
FAILED: Build did NOT complete successfully
error: Bad exit status from /data/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.8DrGJF (%build)


RPM build errors:
line 37: It's not recommended to have unversioned Obsoletes: Obsoletes: external+tensorflow-sources+2.6.4-166022720595f2d89e239d15b7dfe73d
Bad exit status from /data/cmsbuild/jenkins_a/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.8DrGJF (%build)


@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2023

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30437/summary.html
COMMIT: 565921d
CMSSW: CMSSW_13_0_X_2023-02-05-2300/el8_ppc64le_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8295/30437/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30437/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30437/git-merge-result

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 7, 2023

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30435/summary.html
COMMIT: 565921d
CMSSW: CMSSW_13_0_X_2023-02-06-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8295/30435/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30435/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30435/git-merge-result

Unit Tests

I found errors in the following unit tests:

---> test testTriggerMonitors had ERRORS

Comparison Summary

Summary:

  • You potentially added 26 lines to the logs
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555495
  • DQMHistoTests: Total failures: 225
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555248
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 2
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19860
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 8, 2023

please test with #8258,cms-sw/cmssw#40645

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 8, 2023

-1

Failed Tests: Build ClangBuild
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30498/summary.html
COMMIT: 565921d
CMSSW: CMSSW_13_0_X_2023-02-07-2300/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8295/30498/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30498/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30498/git-merge-result

Build

I found compilation error when building:

>> Leaving Package Utilities/StaticAnalyzers
>> Package Utilities/StaticAnalyzers built
Copying tmp/el8_amd64_gcc11/src/DataFormats/SoATemplate/test/testRocmSoALayoutAndView_t/libtestRocmSoALayoutAndView_t_rocm.a to productstore area:
cp: cannot stat 'tmp/el8_amd64_gcc11/src/DataFormats/SoATemplate/test/testRocmSoALayoutAndView_t/libtestRocmSoALayoutAndView_t_rocm.a': No such file or directory
>> Deleted: tmp/el8_amd64_gcc11/src/DataFormats/SoATemplate/test/testRocmSoALayoutAndView_t/libtestRocmSoALayoutAndView_t_rocm.a
gmake: *** [config/SCRAM/GMake/Makefile.rules:1740: tmp/el8_amd64_gcc11/src/DataFormats/SoATemplate/test/testRocmSoALayoutAndView_t/libtestRocmSoALayoutAndView_t_rocm.a] Error 1
>> Entering Package Configuration/DataProcessing
>> Leaving Package Configuration/DataProcessing
>> Package Configuration/DataProcessing built
>> Compiling  /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-07-2300/src/Configuration/DataProcessing/test/TestCfg.cpp
>> Building binary TestConfigDP


Clang Build

I found compilation error while trying to compile with clang. Command used:

USER_CUDA_FLAGS='--expt-relaxed-constexpr' USER_CXXFLAGS='-Wno-register -fsyntax-only' scram build -k -j 32 COMPILER='llvm compile'

>> Entering Package Validation/TrackerDigis
>> Entering Package Validation/TrackerHits
>> Entering Package Validation/TrackerRecHits
>> Entering Package Validation/TrackingMCTruth
>> Compile sequence completed for CMSSW CMSSW_13_0_X_2023-02-07-2300
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 1
+ eval scram build outputlog '&&' '(python3' /data/cmsbld/jenkins/workspace/ib-run-pr-tests/cms-bot/buildLogAnalyzer.py --logDir /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-07-2300/tmp/el8_amd64_gcc11/cache/log/src '||' 'true)'
++ scram build outputlog
>> Entering Package Alignment/OfflineValidation
>> Compiling  /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-07-2300/src/Alignment/OfflineValidation/bin/DMRmerge.cc
>> Compiling  /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_13_0_X_2023-02-07-2300/src/Alignment/OfflineValidation/bin/Options.cc


@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 9, 2023

@smuzaffar it seems that making a local installation out of the bot's build does not work any more ?

I've done

/cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8295/30498/install.sh
cd CMSSW_13_0_X_2023-02-07-2300/src
cmsenv
cd DataFormats/SoATemplate/test
scram b

and I get an error like

>> Local Products Rules ..... started
>> Local Products Rules ..... done
gmake: *** No rule to make target 'tmp/el8_amd64_gcc11/src/DataFormats/SoATemplate/test/testCudaSoALayoutAndView_t/SoALayoutAndView_t.cu.o', needed by 'tmp/el8_amd64_gcc11/src/DataFormats/SoATemplate/test/testCudaSoALayoutAndView_t/compile'.  Stop.
gmake: *** Waiting for unfinished jobs....
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

I get the same in other packages and from a different cmsdist PR.

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 9, 2023

also, the error seem unrelated to CUDA

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 9, 2023

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Feb 9, 2023

@smuzaffar it seems that making a local installation out of the bot's build does not work any more ?

Ah, maybe it was because I had a # in the directory name. A cleaner directory name seems to work.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30540/summary.html
COMMIT: 565921d
CMSSW: CMSSW_13_0_X_2023-02-09-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/8295/30540/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test testTriggerMonitors had ERRORS

Comparison Summary

Summary:

  • You potentially removed 10 lines from the logs
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3555852
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3555827
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 213 log files, 164 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19862
  • DQMHistoTests: Total failures: 78
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19784
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

@smuzaffar smuzaffar changed the base branch from IB/CMSSW_13_0_X/master to IB/CMSSW_13_1_X/master February 11, 2023 11:52
@fwyzard
Copy link
Contributor Author

fwyzard commented Mar 2, 2023

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 2, 2023

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-522667/30989/summary.html
COMMIT: 565921d
CMSSW: CMSSW_13_1_X_2023-03-02-1100/el8_amd64_gcc11
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/8295/30989/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

INFO: Found applicable config definition build:cuda in file /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/BUILD/el8_amd64_gcc11/external/tensorflow-sources/2.6.4-45e5831316b52b33d56b259a07cc3cdd/tensorflow-2.6.4/.bazelrc: --repo_env TF_NEED_CUDA=1 --crosstool_top=@local_config_cuda//crosstool:toolchain --@local_config_cuda//:enable_cuda
WARNING: The following configs were expanded more than once: [cuda]. For repeatable flags, repeats are counted twice and may lead to unexpected behavior.
ERROR: @local_config_cuda//:enable_cuda :: Error loading option @local_config_cuda//:enable_cuda: Repository command failed
Expected even number of arguments

error: Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.H5fzGa (%build)


RPM build errors:
line 37: It's not recommended to have unversioned Obsoletes: Obsoletes: external+tensorflow-sources+2.6.4-45e5831316b52b33d56b259a07cc3cdd
Bad exit status from /data/cmsbld/jenkins/workspace/ib-run-pr-tests/testBuildDir/tmp/rpm-tmp.H5fzGa (%build)


@fwyzard fwyzard closed this Mar 7, 2023
@fwyzard fwyzard deleted the IB/CMSSW_13_0_X/master_cuda_11.8.0 branch March 7, 2023 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants