Keep the PTX in CUDA binaries, so it can be JIT'ted for newer devices #6851

fwyzard · 2021-04-28T08:08:36Z

No description provided.

fwyzard · 2021-04-28T08:08:49Z

please test

cmsbuild · 2021-04-28T08:08:56Z

A new Pull Request was created by @fwyzard (Andrea Bocci) for branch IB/CMSSW_12_0_X/master.

@smuzaffar, @mrodozov can you please review it and eventually sign? Thanks.
cms-bot commands are listed here

fwyzard · 2021-04-28T08:09:17Z

enable gpu

fwyzard · 2021-04-28T08:09:21Z

please test

fwyzard · 2021-04-28T08:12:36Z

please test for slc7_aarch64_gcc9

fwyzard · 2021-04-28T08:14:03Z

please test for slc7_ppc64le_gcc9

cmsbuild · 2021-04-28T11:55:05Z

-1

Failed Tests: UnitTests RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fc616/14660/summary.html
COMMIT: a622ce3
CMSSW: CMSSW_12_0_X_2021-04-27-2300/slc7_ppc64le_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/6851/14660/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test import-yaml had ERRORS

RelVals

11634.91111634.911_TTbar_14TeV+2021_DD4hep+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST+ALCA/step1_TTbar_14TeV+2021_DD4hep+TTbar_14TeV_TuneCP5_GenSim+Digi+Reco+HARVEST+ALCA.log

cmsbuild · 2021-04-28T12:07:37Z

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fc616/14659/summary.html
COMMIT: a622ce3
CMSSW: CMSSW_12_0_X_2021-04-27-2300/slc7_aarch64_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/6851/14659/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test import-yaml had ERRORS

fwyzard · 2021-04-28T12:10:10Z

===== Test "import-yaml" ====
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/cvmfs/cms-ib.cern.ch/jenkins-env/python/shared/yaml/__init__.py", line 2, in <module>
    from error import *
ModuleNotFoundError: No module named 'error'

---> test import-yaml had ERRORS
 
^^^^ End Test import-yaml ^^^^

unrelated ?

mrodozov · 2021-04-28T12:16:32Z

unrelated. for some weird reason on Arm and PPC this test is failing.
error is a file inside yaml - if you start python3 and import yaml it's ok
but if you run python3 -c 'import yaml' it's failing.
and that's after we changed the test to be python3

cmsbuild · 2021-04-28T13:03:52Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fc616/14657/summary.html
COMMIT: a622ce3
CMSSW: CMSSW_12_0_X_2021-04-27-2300/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/6851/14657/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 9559
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 9559
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 38
DQMHistoTests: Total histograms compared: 2877605
DQMHistoTests: Total failures: 5
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2877578
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
Checked 160 log files, 37 edm output root files, 38 DQM output files
TriggerResults: no differences found

makortel · 2021-04-28T13:37:42Z

Out of curiosity, how much does the PTX add to the (shared) object file size? (even on qualitative order-of-magnitude scale)

fwyzard · 2021-04-28T15:40:53Z

Here is a comparison of the libraries from a working area I had around.

before (no PTX)

1.4M    pluginEventFilterEcalRawToDigiPlugins.so
724K    pluginEventFilterHcalRawToDigiGPUPlugins.so
600K    pluginHeterogeneousCoreCUDATestPlugins.so
8.2M    pluginRecoLocalCaloEcalRecProducersPlugins.so
1.8M    pluginRecoLocalCaloHGCalRecProducersPlugins.so
6.7M    pluginRecoLocalCaloHcalRecProducers.so
1.5M    pluginRecoLocalTrackerSiPixelClusterizerPlugins.so
1.9M    pluginRecoLocalTrackerSiPixelRecHitsPlugins.so
13M     pluginRecoPixelVertexingPixelTripletsPlugins.so
2.1M    pluginRecoPixelVertexingPixelVertexFindingPlugins.so
38M     total

after (with PTX)

1.4M    pluginEventFilterEcalRawToDigiPlugins.so
748K    pluginEventFilterHcalRawToDigiGPUPlugins.so
616K    pluginHeterogeneousCoreCUDATestPlugins.so
8.6M    pluginRecoLocalCaloEcalRecProducersPlugins.so
1.9M    pluginRecoLocalCaloHGCalRecProducersPlugins.so
7.0M    pluginRecoLocalCaloHcalRecProducers.so
1.6M    pluginRecoLocalTrackerSiPixelClusterizerPlugins.so
2.0M    pluginRecoLocalTrackerSiPixelRecHitsPlugins.so
15M     pluginRecoPixelVertexingPixelTripletsPlugins.so
2.3M    pluginRecoPixelVertexingPixelVertexFindingPlugins.so
41M     total

So, around 10% of the size for the CUDA-enabled plugins, on average ?

makortel · 2021-04-29T01:23:05Z

Thanks. So noticeable but not large. Do you think the JIT'ting would be intended only for cases where new architecture appears after the build, or should we think of mainly JIT'ting the code?

fwyzard · 2021-04-29T06:38:49Z

For the moment I would consider it as a fallback option for new architectures (e.g. Ampere), and to possibly improve the performance on less used architectures (e.g. JIT'ting for sm_61 may have some benefit for a GeForce GTX 1080 instead of using the binary for sm_60).

Somewhere in the NVIDIA documentation I've see a suggestion that we should build with -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=[sm_86,compute_86] ... which would increase both compilation times and library sizes significantly.

I consider this PR a reasonable compromise.

makortel · 2021-04-29T12:56:13Z

I consider this PR a reasonable compromise.

Sure, thanks. I was thinking more the future, e.g. should we add sm_80 etc there at some point?

smuzaffar · 2021-04-29T13:19:44Z

+externals

cmsbuild · 2021-04-29T13:20:05Z

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_12_0_X/master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

fwyzard · 2021-04-29T13:43:11Z

Sure, thanks. I was thinking more the future, e.g. should we add sm_80 etc there at some point?

Sure - well, maybe once we actually have some Ampere card to test :-)

smuzaffar · 2021-04-29T20:28:20Z

test parameters:

addpkg = HeterogeneousCore

smuzaffar · 2021-04-29T20:28:28Z

please test

smuzaffar · 2021-04-29T20:40:57Z

abort

we need to wait for next IB with SCRAMV3 for external PR testing

smuzaffar · 2021-04-30T06:50:13Z

please test

cmsbuild · 2021-04-30T10:15:37Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-0fc616/14730/summary.html
COMMIT: a622ce3
CMSSW: CMSSW_12_0_X_2021-04-29-2300/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/6851/14730/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 9559
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 9559
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 37
DQMHistoTests: Total histograms compared: 2662646
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2662623
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 36 files compared)
Checked 155 log files, 37 edm output root files, 37 DQM output files
TriggerResults: no differences found

Keep the PTX in CUDA binaries, so it can be JIT'ted for newer devices

a622ce3

cmsbuild added externals-pending orp-pending pending-signatures tests-started labels Apr 28, 2021

fwyzard mentioned this pull request Apr 28, 2021

Keep the PTX in CUDA binaries, so it can be JIT'ted for newer devices (11.3.x) #6852

Merged

cmsbuild added tests-approved and removed tests-started labels Apr 28, 2021

fwyzard mentioned this pull request Apr 28, 2021

How to provide supported CUDA compute capabilities and runtime version where needed? cms-sw/cmssw#33542

Open

cmsbuild added externals-approved fully-signed and removed externals-pending pending-signatures labels Apr 29, 2021

cmsbuild added tests-started and removed tests-approved labels Apr 29, 2021

cmsbuild added tests-pending and removed tests-started labels Apr 29, 2021

cmsbuild added tests-started and removed tests-pending labels Apr 30, 2021

cmsbuild added tests-approved and removed tests-started labels Apr 30, 2021

smuzaffar merged commit bd26e28 into cms-sw:IB/CMSSW_12_0_X/master Apr 30, 2021

This was referenced Apr 30, 2021

[SCRAMV3] Fix for scram tool info #6860

Merged

Update TBB to 2021.2.0 and use cmake to build #6792

Merged

fwyzard deleted the IB/CMSSW_12_0_X/master_CUDA_keep_PTX branch May 10, 2021 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep the PTX in CUDA binaries, so it can be JIT'ted for newer devices #6851

Keep the PTX in CUDA binaries, so it can be JIT'ted for newer devices #6851

fwyzard commented Apr 28, 2021

fwyzard commented Apr 28, 2021

cmsbuild commented Apr 28, 2021

fwyzard commented Apr 28, 2021

fwyzard commented Apr 28, 2021

fwyzard commented Apr 28, 2021 •

edited

Loading

fwyzard commented Apr 28, 2021

cmsbuild commented Apr 28, 2021

cmsbuild commented Apr 28, 2021

fwyzard commented Apr 28, 2021

mrodozov commented Apr 28, 2021 •

edited

Loading

cmsbuild commented Apr 28, 2021

makortel commented Apr 28, 2021

fwyzard commented Apr 28, 2021

makortel commented Apr 29, 2021

fwyzard commented Apr 29, 2021

makortel commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

cmsbuild commented Apr 29, 2021

fwyzard commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

smuzaffar commented Apr 30, 2021

cmsbuild commented Apr 30, 2021

Keep the PTX in CUDA binaries, so it can be JIT'ted for newer devices #6851

Keep the PTX in CUDA binaries, so it can be JIT'ted for newer devices #6851

Conversation

fwyzard commented Apr 28, 2021

fwyzard commented Apr 28, 2021

cmsbuild commented Apr 28, 2021

fwyzard commented Apr 28, 2021

fwyzard commented Apr 28, 2021

fwyzard commented Apr 28, 2021 • edited Loading

fwyzard commented Apr 28, 2021

cmsbuild commented Apr 28, 2021

Unit Tests

RelVals

cmsbuild commented Apr 28, 2021

Unit Tests

fwyzard commented Apr 28, 2021

mrodozov commented Apr 28, 2021 • edited Loading

cmsbuild commented Apr 28, 2021

GPU Comparison Summary

Comparison Summary

makortel commented Apr 28, 2021

fwyzard commented Apr 28, 2021

before (no PTX)

after (with PTX)

makortel commented Apr 29, 2021

fwyzard commented Apr 29, 2021

makortel commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

cmsbuild commented Apr 29, 2021

fwyzard commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

smuzaffar commented Apr 29, 2021

smuzaffar commented Apr 30, 2021

cmsbuild commented Apr 30, 2021

GPU Comparison Summary

Comparison Summary

fwyzard commented Apr 28, 2021 •

edited

Loading

mrodozov commented Apr 28, 2021 •

edited

Loading