-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GPU support for ONNXRuntime #6776
Add GPU support for ONNXRuntime #6776
Conversation
A new Pull Request was created by @hqucms (Huilin Qu) for branch IB/CMSSW_11_3_X/master. @cmsbuild, @smuzaffar, @mrodozov can you please review it and eventually sign? Thanks. |
cuda.spec
Outdated
@@ -34,6 +34,7 @@ mkdir -p %{i}/lib64/stubs | |||
|
|||
# package only the runtime static library | |||
mv %_builddir/build/lib64/libcudadevrt.a %{i}/lib64/ | |||
mv %_builddir/build/lib64/libcudart_static.a %{i}/lib64/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any objections/concern @fwyzard ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm... I don't think that mixing the shared and static version of libcudart
is a good idea.
Can ONNX not use the dynamic version ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This libcudart_static.a
is needed otherwise enable_language(CUDA)
in cmake crashes. I don't think onnxruntime really uses it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need cmake ?
it looks like cuDNN is not built, it's simply unpacked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you set the CUDA_USE_STATIC_CUDA_RUNTIME
cmake option to OFF
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need cmake ?
it looks like cuDNN is not built, it's simply unpacked.
We use cmake for ONNXRuntime. For cuDNN indeed it's simply unpacked.
can you set the CUDA_USE_STATIC_CUDA_RUNTIME cmake option to OFF ?
Sure, I can try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fwyzard Unfortunately CUDA_USE_STATIC_CUDA_RUNTIME
does not make a difference. It seems it's superseded by CMAKE_CUDA_RUNTIME_LIBRARY
now?
In fact I managed to work around this with the following lines:
-DCMAKE_CUDA_FLAGS="-cudart shared" \
-DCMAKE_CUDA_RUNTIME_LIBRARY=Shared \
-DCMAKE_TRY_COMPILE_PLATFORM_VARIABLES="CMAKE_CUDA_RUNTIME_LIBRARY" \
Now we don't need libcudart_static.a
anymore. In fact I think the really useful one is -DCMAKE_CUDA_FLAGS="-cudart shared"
, which is sufficient to solve the problem in a newer cmake version, but somehow I need all three lines in the cmake version we use...
Also the problem is purely in cmake -- when calling enable_language(CUDA)
it tries to compile a test program with nvcc
and then parse the output to set up various CUDA paths/flags, and there it links to libcudart_static.a
since the -cudart
defaults to static
in nvcc
. After that stage, whether linking to cudart statically or dynamically can be controlled by CMAKE_CUDA_RUNTIME_LIBRARY
and it's set to Shared
in onnxruntime.
cudnn.spec
Outdated
### RPM external cudnn 8.1.1.33 | ||
## INITENV +PATH LD_LIBRARY_PATH %i/lib64 | ||
|
||
%define cudaver_maj 11.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably just nitpicking, but this is not the "major" CUDA version.
Can you just use cudaver
?
Also, @smuzaffar , is there a way to get this directly from the CUDA spec file ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, This is parsed and used by cmsBuild (even before installing dependencies), so at that time cmsBuild do not know the actual value. We can use a common file ( just like https://github.com/cms-sw/cmsdist/blob/IB/CMSSW_11_3_X/master/cuda-flags.file ) where one can define this version and then include it in both cuda and cudnn. But this looks very much over killed for this purpose.
I would suggest that in%prep
section just check if $CUDA_VERSION
and %{cudaver_maj}
are same (some sed/cut/grep is needed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if it cannot be extracted form the CUDA spec file, better leave it hard coded here then - this way we don't need to rebuild cuDNN for minor updates to CUDA (i.e. 11.2.0 -> 11.2.1 --> 11.2.2).
I assume we'll notice soon enough if we do a update CUDA and fail to update CUDNN.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I renamed cudaver_maj
to cudaver
added a check in the %prep
section now.
cuda.spec
Outdated
@@ -101,6 +102,7 @@ ln -sf libnvidia-ptxjitcompiler.so.1 %{i} | |||
sed \ | |||
-e"/^TOP *=/s|= .*|= $CMS_INSTALL_PREFIX/%{pkgrel}|" \ | |||
-e's|$(_HERE_)|$(TOP)/bin|g' \ | |||
-e's|$(TOP)/lib|$(TOP)/lib64|g' \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks correct, but in fact should not be needed, after scram has set up the environment ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it's not really needed. I reverted all the changes on cuda.spec now.
6dc4f7a
to
ee8a694
Compare
Pull request #6776 was updated. |
nice! |
please test |
please test for slc7_aarch64_gcc9 |
please test for slc7_ppc64le_gcc9 |
-1 Failed Tests: Build BuildI found compilation error when building: /cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]' /cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]' /cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]' /cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]' /cvmfs/cms-ib.cern.ch/nweek-02674/slc7_amd64_gcc900/external/gcc/9.3.0/bin/../lib/gcc/x86_64-unknown-linux-gnu/9.3.0/../../../../x86_64-unknown-linux-gnu/bin/ld: /data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_11_3_X_2021-03-31-2300/external/slc7_amd64_gcc900/lib/libonnxruntime.so: undefined reference to `[email protected]' collect2: error: ld returned 1 exit status >> Deleted: tmp/slc7_amd64_gcc900/src/PhysicsTools/ONNXRuntime/test/testONNXRuntime/testONNXRuntime gmake: *** [tmp/slc7_amd64_gcc900/src/PhysicsTools/ONNXRuntime/test/testONNXRuntime/testONNXRuntime] Error 1 >> Leaving Package PhysicsTools/ONNXRuntime >> Package PhysicsTools/ONNXRuntime built >> Subsystem PhysicsTools built |
-1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e46e0/13927/summary.html External BuildI found compilation error when building: File "./pkgtools/cmsBuild", line 3487, in installPackage installRpm(pkg, pkg.options.bootstrap) File "./pkgtools/cmsBuild", line 3235, in installRpm raise RpmInstallFailed(pkg, output) RpmInstallFailed: Failed to install package cudnn. Reason: error: Failed dependencies: libm.so.6(GLIBC_2.27)(64bit) is needed by external+cudnn+8.1.1.33-1fe5d615f0e3571e760119e066121081-1-1.aarch64 * The action "build-external+python_tools+2.0-8e466f93932071702fa843dee44853e2" was not completed successfully because The following dependencies could not complete: install-external+onnxruntime+1.6.0-b578716d6932c1ae9ed96cefdf913fea * The action "final-job" was not completed successfully because The following dependencies could not complete: |
Pull request #6776 was updated. |
please test |
@hqucms with these changes
|
@smuzaffar The unit test failures look unrelated to this PR? |
test parameters:
|
please test |
Thank you for confirming @smuzaffar ! |
yes ONNXRuntime on GPU unit tests should be added to make sure the functionaly is working. |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-8e46e0/14106/summary.html GPU Comparison SummarySummary:
|
+externals |
This pull request is fully signed and it will be integrated in one of the next IB/CMSSW_11_3_X/master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
This PR adds the GPU support for ONNXRuntime. The built library still runs on CPU by default (thus current applications in CMSSW are unaffected), while GPU inference can be enabled if needed (see example).
A few changes needed:
cuda
is needed (namely, keepinglibcudart_static.a
) for cmake to detectnvcc
correctlycudnn
is added as a dependency. Note that generally downloadingcudnn
needs NVIDIA Developer Program Membership, though direct download link w/o authentication exists (and used here). Experts should probably double check if the way we distribute it complies with its SLA.Also I am not sure if this will compile on
ppc64le
oraarch64
(thoughcudnn
exists for them).FYI @riga @mialiu149