Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton test fixes #35328

Merged
merged 8 commits into from
Sep 24, 2021
Merged

Triton test fixes #35328

merged 8 commits into from
Sep 24, 2021

Conversation

kpedro88
Copy link
Contributor

PR description:

Resolves #34547 (GPU IB tests)
Resolves #35206 (PPC IB tests)

PR validation:

Ran tests on machines with appropriate architectures/hardware.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35328/25373

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @kpedro88 (Kevin Pedro) for master.

It involves the following packages:

  • HeterogeneousCore/SonicTriton (heterogeneous)

@makortel, @cmsbuild, @fwyzard can you please review it and eventually sign? Thanks.
@makortel, @riga, @rovere this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

test parameters:

  • enable tests = gpu
  • release = slc7_ppc64le_gcc9

@makortel
Copy link
Contributor

@cmsbuild, please test

@kpedro88
Copy link
Contributor Author

@makortel the bot gave your test settings a thumbs down... I think that means there's a syntax error

@makortel
Copy link
Contributor

test parameters:

  • enable_tests = gpu
  • release = slc7_ppc64le_gcc9

@makortel
Copy link
Contributor

@kpedro Thanks, let's see if I was more successful now...

@makortel
Copy link
Contributor

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18725/summary.html
COMMIT: 1815dec
CMSSW: CMSSW_12_1_X_2021-09-16-2300/slc7_ppc64le_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35328/18725/install.sh to create a dev area with all the needed externals and cmssw changes.

echo "has NVIDIA driver"
else
echo "missing (or too old) NVIDIA driver"
exit 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than simply bailing out, you could try using the CUDA compatibility drivers that are shipped with CMSSW.

You can see how it is implemented for the CMSSW environment set up by SCRAM in https://github.com/cms-sw/cmssw-config/blob/scramv3/SCRAM/hooks/runtime/00-nvidia-drivers .

For more (somewhat confusing) information, see https://docs.nvidia.com/deploy/cuda-compatibility/ .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a definite improvement. The PR is updated with this now. I have a few questions and comments that I'll post in the main PR thread.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18726/summary.html
COMMIT: 1815dec
CMSSW: CMSSW_12_1_X_2021-09-17-1100/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/35328/18726/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1299 differences found in the comparisons
  • DQMHistoTests: Total files compared: 39
  • DQMHistoTests: Total histograms compared: 3000833
  • DQMHistoTests: Total failures: 3671
  • DQMHistoTests: Total nulls: 19
  • DQMHistoTests: Total successes: 2997121
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 45.703 KiB( 38 files compared)
  • DQMHistoSizes: changed ( 140.53 ): 44.531 KiB Hcal/DigiRunHarvesting
  • DQMHistoSizes: changed ( 140.53 ): 1.172 KiB RPC/DCSInfo
  • Checked 165 log files, 37 edm output root files, 39 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-35328/25435

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

Pull request #35328 was updated. @makortel, @cmsbuild, @fwyzard can you please check and sign again.

@kpedro88
Copy link
Contributor Author

Compatibility drivers are now working.

This actually led to a bug report for Triton, triton-inference-server/server#3382, though it's possible the actual bug should be fixed in nvidia-docker or something lower-level. For now, I've implemented a workaround in cmsTriton.

A few questions (for @fwyzard or anyone else who knows):

  1. When I run cuda-compatible-runtime -v on a (datacenter) GPU with older drivers, it reports:
    CUDA driver version 11.2
    CUDA runtime version 11.4
    
    and succeeds even without putting the compatibility drivers in LD_LIBRARY_PATH (as 00-nvidia-drivers does if the test fails without the compatibility drivers). Is this expected?
  2. What is the right way to test for non-datacenter GPUs, for which the compatibility drivers won't work (if I understand correctly)?

@fwyzard
Copy link
Contributor

fwyzard commented Sep 21, 2021

A few questions (for @fwyzard or anyone else who knows):

1. When I run `cuda-compatible-runtime -v` on a (datacenter) GPU with older drivers, it reports:
   ```
   CUDA driver version 11.2
   CUDA runtime version 11.4
   ```
   and succeeds even without putting the compatibility drivers in `LD_LIBRARY_PATH` (as [00-nvidia-drivers](https://github.com/cms-sw/cmssw-config/blob/scramv3/SCRAM/hooks/runtime/00-nvidia-drivers) does if the test fails without the compatibility drivers). Is this expected?

Yes, it's expected.
I find the documentation confusing, but the gist seems to be that any CUDA 11.x runtime should work with any CUDA 11.y driver, as long as the underlying NVIDIA driver is >= 450.80.02.
This is a welcome change with respect to the previous versions of CUDA.

2. What is the right way to test for non-datacenter GPUs, for which the compatibility drivers won't work (if I understand correctly)?

I'm not aware of any explicit way to test for it.

For the SCRAM check, I resorted to this logic:

  • first, check if the CUDA runtime bundled with CMSSW is compatible with the system libraries; if it is (due to being the same or newer version, or due to the minor version compatibility), nothing else is needed;
  • otherwise, check if the CUDA runtime bundled with CMSSW can be used with the compatibility drivers; if the previous check failed and this one works, we have a datacenter GPU - add the compatibility driver to the environment
  • otherwise, add the stub libraries to the environment; CUDA application will fail, but at least we can compile and link.

@kpedro88
Copy link
Contributor Author

I also find the documentation rather confusing.

From what I can tell with the Triton server, it's a bit pickier than CMSSW: it cares about driver-driver compatibility, not driver-runtime compatibility. I think the check I've implemented handles this properly.

As far as non-datacenter GPUs, I guess it's fine if the test just fails on those machines for now, since it's unlikely we'll be running IB tests on them frequently. (I actually have such a GPU, but I keep its drivers up to date in order to continue Triton-related development.)

@kpedro88
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-a6d7ce/18808/summary.html
COMMIT: b96778b
CMSSW: CMSSW_12_1_X_2021-09-20-2300/slc7_ppc64le_gcc9
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/35328/18808/install.sh to create a dev area with all the needed externals and cmssw changes.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 23, 2021

From what I can tell with the Triton server, it's a bit pickier than CMSSW: it cares about driver-driver compatibility, not driver-runtime compatibility. I think the check I've implemented handles this properly.

Ah... then maybe Triton (or, possibly, the TensorRT backend) is implemented using the CUDA driver API, rather than the runtime API.

@kpedro88
Copy link
Contributor Author

@fwyzard @makortel any further comments or tests?

@makortel
Copy link
Contributor

Looks ok to me

@fwyzard
Copy link
Contributor

fwyzard commented Sep 24, 2021

The changes to the main part look good.

Skipping the unit test on non-amd64 architectures doesn't seem like the intended behaviour for the unit tests - but I can see the point, since we don't have an "expected to fail" category for the tests.

@fwyzard
Copy link
Contributor

fwyzard commented Sep 24, 2021

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

  • Technical
  • Tested by the experts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HeterogeneousCore/SonicTriton unit test failing for PPC IBs SonicTriton test failures in GPU IBs
5 participants