-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triton test fixes #35328
Triton test fixes #35328
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35328/25373
|
A new Pull Request was created by @kpedro88 (Kevin Pedro) for master. It involves the following packages:
@makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
test parameters:
|
@cmsbuild, please test |
@makortel the bot gave your test settings a thumbs down... I think that means there's a syntax error |
test parameters:
|
@kpedro Thanks, let's see if I was more successful now... |
@cmsbuild, please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18725/summary.html |
echo "has NVIDIA driver" | ||
else | ||
echo "missing (or too old) NVIDIA driver" | ||
exit 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than simply bailing out, you could try using the CUDA compatibility drivers that are shipped with CMSSW.
You can see how it is implemented for the CMSSW environment set up by SCRAM in https://github.com/cms-sw/cmssw-config/blob/scramv3/SCRAM/hooks/runtime/00-nvidia-drivers .
For more (somewhat confusing) information, see https://docs.nvidia.com/deploy/cuda-compatibility/ .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a definite improvement. The PR is updated with this now. I have a few questions and comments that I'll post in the main PR thread.
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18726/summary.html Comparison SummaryThe workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons Summary:
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35328/25435
|
Compatibility drivers are now working. This actually led to a bug report for Triton, triton-inference-server/server#3382, though it's possible the actual bug should be fixed in nvidia-docker or something lower-level. For now, I've implemented a workaround in A few questions (for @fwyzard or anyone else who knows):
|
Yes, it's expected.
I'm not aware of any explicit way to test for it. For the SCRAM check, I resorted to this logic:
|
I also find the documentation rather confusing. From what I can tell with the Triton server, it's a bit pickier than CMSSW: it cares about driver-driver compatibility, not driver-runtime compatibility. I think the check I've implemented handles this properly. As far as non-datacenter GPUs, I guess it's fine if the test just fails on those machines for now, since it's unlikely we'll be running IB tests on them frequently. (I actually have such a GPU, but I keep its drivers up to date in order to continue Triton-related development.) |
please test |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18808/summary.html |
Ah... then maybe Triton (or, possibly, the TensorRT backend) is implemented using the CUDA driver API, rather than the runtime API. |
Looks ok to me |
The changes to the main part look good. Skipping the unit test on non- |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1
|
PR description:
Resolves #34547 (GPU IB tests)
Resolves #35206 (PPC IB tests)
PR validation:
Ran tests on machines with appropriate architectures/hardware.