Add GPU support for ONNXRuntime #6776

hqucms · 2021-03-31T14:06:48Z

This PR adds the GPU support for ONNXRuntime. The built library still runs on CPU by default (thus current applications in CMSSW are unaffected), while GPU inference can be enabled if needed (see example).

A few changes needed:

a small modification on cuda is needed (namely, keeping libcudart_static.a) for cmake to detect nvcc correctly
cudnn is added as a dependency. Note that generally downloading cudnn needs NVIDIA Developer Program Membership, though direct download link w/o authentication exists (and used here). Experts should probably double check if the way we distribute it complies with its SLA.

Also I am not sure if this will compile on ppc64le or aarch64 (though cudnn exists for them).

FYI @riga @mialiu149

cmsbuild · 2021-03-31T14:09:09Z

A new Pull Request was created by @hqucms (Huilin Qu) for branch IB/CMSSW_11_3_X/master.

@cmsbuild, @smuzaffar, @mrodozov can you please review it and eventually sign? Thanks.
cms-bot commands are listed here

smuzaffar · 2021-03-31T15:17:06Z

cuda.spec

@@ -34,6 +34,7 @@ mkdir -p %{i}/lib64/stubs

 # package only the runtime static library
 mv %_builddir/build/lib64/libcudadevrt.a %{i}/lib64/
+mv %_builddir/build/lib64/libcudart_static.a %{i}/lib64/


any objections/concern @fwyzard ?

Mhm... I don't think that mixing the shared and static version of libcudart is a good idea.
Can ONNX not use the dynamic version ?

This libcudart_static.a is needed otherwise enable_language(CUDA) in cmake crashes. I don't think onnxruntime really uses it.

why do we need cmake ?
it looks like cuDNN is not built, it's simply unpacked.

can you set the CUDA_USE_STATIC_CUDA_RUNTIME cmake option to OFF ?

why do we need cmake ?
it looks like cuDNN is not built, it's simply unpacked.

We use cmake for ONNXRuntime. For cuDNN indeed it's simply unpacked.

can you set the CUDA_USE_STATIC_CUDA_RUNTIME cmake option to OFF ?

Sure, I can try that.

@fwyzard Unfortunately CUDA_USE_STATIC_CUDA_RUNTIME does not make a difference. It seems it's superseded by CMAKE_CUDA_RUNTIME_LIBRARY now?

In fact I managed to work around this with the following lines:

-DCMAKE_CUDA_FLAGS="-cudart shared" \ -DCMAKE_CUDA_RUNTIME_LIBRARY=Shared \ -DCMAKE_TRY_COMPILE_PLATFORM_VARIABLES="CMAKE_CUDA_RUNTIME_LIBRARY" \

Now we don't need libcudart_static.a anymore. In fact I think the really useful one is -DCMAKE_CUDA_FLAGS="-cudart shared", which is sufficient to solve the problem in a newer cmake version, but somehow I need all three lines in the cmake version we use...

Also the problem is purely in cmake -- when calling enable_language(CUDA) it tries to compile a test program with nvcc and then parse the output to set up various CUDA paths/flags, and there it links to libcudart_static.a since the -cudart defaults to static in nvcc. After that stage, whether linking to cudart statically or dynamically can be controlled by CMAKE_CUDA_RUNTIME_LIBRARY and it's set to Shared in onnxruntime.

fwyzard · 2021-03-31T17:02:13Z

cudnn.spec

+### RPM external cudnn 8.1.1.33
+## INITENV +PATH LD_LIBRARY_PATH %i/lib64
+
+%define cudaver_maj 11.2


Probably just nitpicking, but this is not the "major" CUDA version.
Can you just use cudaver ?

Also, @smuzaffar , is there a way to get this directly from the CUDA spec file ?

not really, This is parsed and used by cmsBuild (even before installing dependencies), so at that time cmsBuild do not know the actual value. We can use a common file ( just like https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_3_X/master/cuda-flags.file ) where one can define this version and then include it in both cuda and cudnn. But this looks very much over killed for this purpose.

I would suggest that in%prep section just check if $CUDA_VERSION and %{cudaver_maj} are same (some sed/cut/grep is needed)

OK, if it cannot be extracted form the CUDA spec file, better leave it hard coded here then - this way we don't need to rebuild cuDNN for minor updates to CUDA (i.e. 11.2.0 -> 11.2.1 --> 11.2.2).

I assume we'll notice soon enough if we do a update CUDA and fail to update CUDNN.

OK I renamed cudaver_maj to cudaver added a check in the %prep section now.

fwyzard · 2021-03-31T17:10:05Z

cuda.spec

@@ -101,6 +102,7 @@ ln -sf libnvidia-ptxjitcompiler.so.1                                        %{i}
 sed \
  -e"/^TOP *=/s|= .*|= $CMS_INSTALL_PREFIX/%{pkgrel}|" \
  -e's|$(_HERE_)|$(TOP)/bin|g' \
+  -e's|$(TOP)/lib|$(TOP)/lib64|g' \


this looks correct, but in fact should not be needed, after scram has set up the environment ?

Indeed it's not really needed. I reverted all the changes on cuda.spec now.

cmsbuild · 2021-04-01T12:22:11Z

Pull request #6776 was updated.

fwyzard · 2021-04-01T12:37:34Z

nice!

fwyzard · 2021-04-01T12:37:38Z

please test

fwyzard · 2021-04-01T13:05:06Z

please test for slc7_aarch64_gcc9

fwyzard · 2021-04-01T13:25:03Z

please test for slc7_ppc64le_gcc9

cmsbuild · 2021-04-01T13:28:48Z

-1

Failed Tests: Build
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e46e0/13926/summary.html
COMMIT: ee8a694
CMSSW: CMSSW_11_3_X_2021-03-31-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/6776/13926/install.sh to create a dev area with all the needed externals and cmssw changes.

Build

I found compilation error when building:

/cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]'
/cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]'
/cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]'
/cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]'
/cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]'
collect2: error: ld returned 1 exit status
>> Deleted: tmp/slc7_amd64_gcc900/src/PhysicsTools/ONNXRuntime/test/testONNXRuntime/testONNXRuntime
gmake: *** [tmp/slc7_amd64_gcc900/src/PhysicsTools/ONNXRuntime/test/testONNXRuntime/testONNXRuntime] Error 1
>> Leaving Package PhysicsTools/ONNXRuntime
>> Package PhysicsTools/ONNXRuntime built
>> Subsystem PhysicsTools built

cmsbuild · 2021-04-01T13:36:03Z

-1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e46e0/13927/summary.html
COMMIT: ee8a694
CMSSW: CMSSW_11_3_X_2021-03-31-2300/slc7_aarch64_gcc9
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmsdist/6776/13927/install.sh to create a dev area with all the needed externals and cmssw changes.

External Build

I found compilation error when building:

File "./pkgtools/cmsBuild", line 3487, in installPackage
installRpm(pkg, pkg.options.bootstrap)
File "./pkgtools/cmsBuild", line 3235, in installRpm
raise RpmInstallFailed(pkg, output)
RpmInstallFailed: Failed to install package cudnn. Reason:
error: Failed dependencies:
	libm.so.6(GLIBC_2.27)(64bit) is needed by external+cudnn+8.1.1.33-1fe5d615f0e3571e760119e066121081-1-1.aarch64

* The action "build-external+python_tools+2.0-8e466f93932071702fa843dee44853e2" was not completed successfully because The following dependencies could not complete:
install-external+onnxruntime+1.6.0-b578716d6932c1ae9ed96cefdf913fea
* The action "final-job" was not completed successfully because The following dependencies could not complete:

cmsbuild · 2021-04-01T13:43:08Z

Pull request #6776 was updated.

fwyzard · 2021-04-01T13:48:56Z

please test

fwyzard · 2021-04-01T13:50:29Z

@hqucms with these changes

does ONNX still run fine without a GPU ?
does it always use a GPU if it detects one, or does it need some runtime configuration ?

hqucms · 2021-04-08T13:03:32Z

@smuzaffar The unit test failures look unrelated to this PR?

smuzaffar · 2021-04-08T13:13:12Z

test parameters:

enable_test = threading,gpu

smuzaffar · 2021-04-08T13:14:08Z

please test
yes @hqucms , unit tests failure are not related to this PR. Let me re-run it with threading on for production arch.

hqucms · 2021-04-08T13:20:27Z

please test
yes @hqucms , unit tests failure are not related to this PR. Let me re-run it with threading on for production arch.

Thank you for confirming @smuzaffar !
One thing I think would be useful to add is a unittest that tests ONNXRuntime on GPU. But I think it can be added after this PR is finalized?

smuzaffar · 2021-04-08T13:31:16Z

yes ONNXRuntime on GPU unit tests should be added to make sure the functionaly is working.
If you already have something to test then please go ahead and open cmssw pr.

cmsbuild · 2021-04-08T15:27:58Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e46e0/14106/summary.html
COMMIT: 7cc7fb0
CMSSW: CMSSW_11_3_X_2021-04-08-1100/slc7_amd64_gcc900
Additional Tests: THREADING,GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/6776/14106/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 9575
DQMHistoTests: Total failures: 0
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 9575
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

smuzaffar · 2021-04-08T16:47:19Z

+externals
looks good to go in

cmsbuild · 2021-04-08T16:47:42Z

This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_11_3_X/master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

silviodonato · 2021-04-09T06:47:35Z

+1

cmsbuild added externals-pending orp-pending pending-signatures tests-pending labels Mar 31, 2021

smuzaffar reviewed Mar 31, 2021

View reviewed changes

fwyzard reviewed Mar 31, 2021

View reviewed changes

hqucms added 2 commits April 1, 2021 02:46

Add GPU support for ONNXRuntime.

beee8db

Avoid dependency on libcudart_static.a

ee8a694

hqucms force-pushed the dev/CMSSW_11_3_X/onnx-gpu branch from 6dc4f7a to ee8a694 Compare April 1, 2021 12:21

cmsbuild added tests-started and removed tests-pending labels Apr 1, 2021

cmsbuild added tests-rejected and removed tests-started labels Apr 1, 2021

Update cmssw-tool-conf.spec

a33ced0

cmsbuild added tests-pending and removed tests-rejected labels Apr 1, 2021

cmsbuild added tests-started and removed tests-pending labels Apr 1, 2021

cmsbuild added tests-started and removed tests-approved labels Apr 8, 2021

cmsbuild added tests-approved tests-rejected tests-started and removed tests-started tests-approved tests-rejected labels Apr 8, 2021

cmsbuild added externals-approved fully-signed and removed externals-pending pending-signatures labels Apr 8, 2021

cmsbuild added tests-approved and removed tests-started labels Apr 8, 2021

cmsbuild added orp-approved and removed orp-pending labels Apr 9, 2021

cmsbuild merged commit f6c50ee into cms-sw:IB/CMSSW_11_3_X/master Apr 9, 2021

cmsbuild mentioned this pull request Apr 9, 2021

[Do not merge] testing ONNXRuntime for GCC 10 #6801

Closed

hqucms mentioned this pull request Feb 14, 2022

Add GPU support for ONNXRuntime cms-sw/cmssw#36963

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GPU support for ONNXRuntime #6776

Add GPU support for ONNXRuntime #6776

hqucms commented Mar 31, 2021

cmsbuild commented Mar 31, 2021

smuzaffar Mar 31, 2021

fwyzard Mar 31, 2021

hqucms Mar 31, 2021

fwyzard Mar 31, 2021 •

edited

Loading

fwyzard Mar 31, 2021

hqucms Mar 31, 2021

hqucms Apr 1, 2021

fwyzard Mar 31, 2021

smuzaffar Mar 31, 2021

fwyzard Mar 31, 2021

hqucms Apr 1, 2021

fwyzard Mar 31, 2021

hqucms Apr 1, 2021

cmsbuild commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021

cmsbuild commented Apr 1, 2021

cmsbuild commented Apr 1, 2021

cmsbuild commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021 •

edited

Loading

hqucms commented Apr 8, 2021

smuzaffar commented Apr 8, 2021

smuzaffar commented Apr 8, 2021

hqucms commented Apr 8, 2021

smuzaffar commented Apr 8, 2021

cmsbuild commented Apr 8, 2021

smuzaffar commented Apr 8, 2021

cmsbuild commented Apr 8, 2021

silviodonato commented Apr 9, 2021

Add GPU support for ONNXRuntime #6776

Add GPU support for ONNXRuntime #6776

Conversation

hqucms commented Mar 31, 2021

cmsbuild commented Mar 31, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsbuild commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021

cmsbuild commented Apr 1, 2021

Build

cmsbuild commented Apr 1, 2021

External Build

cmsbuild commented Apr 1, 2021

fwyzard commented Apr 1, 2021

fwyzard commented Apr 1, 2021 • edited Loading

hqucms commented Apr 8, 2021

smuzaffar commented Apr 8, 2021

smuzaffar commented Apr 8, 2021

hqucms commented Apr 8, 2021

smuzaffar commented Apr 8, 2021

cmsbuild commented Apr 8, 2021

GPU Comparison Summary

smuzaffar commented Apr 8, 2021

cmsbuild commented Apr 8, 2021

silviodonato commented Apr 9, 2021

fwyzard Mar 31, 2021 •

edited

Loading

fwyzard commented Apr 1, 2021 •

edited

Loading