Improve handling of TF CUDA tests for 14_1_X #44376

valsdav · 2024-03-12T11:24:14Z

PR description:

This PR improves the handling of CUDA unit tests for the TensorFlow package, using a new tf_cuda_support tool from scram , which checks if the GPU support is enabled in TensorFlow compilation.

The PR also makes the TF cuda tests more strict by checking explicitely if a CUDA device is visible to TF and not only to cmssw.

The test testTFVisibleDevicesCUDA is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The other testTF*CUDA tests were passing silently using the CPU to run the test. After this PR all the TF sessions using tf::backend::cuda , but not finding a GPU will fail explicitly.

This PR is needed to continue the integration of CUDA 12.4 in Update to CUDA 12.4.0 cmsdist#9046.
This PR needs Add tf_cuda_support tool if TF is build with cuda support cmsdist#9066 (thanks to @smuzaffar Improve handling of TF CUDA tests for 14_1_X #44376 (comment))

cmsbuild · 2024-03-12T11:24:38Z

cms-bot internal usage

cmsbuild · 2024-03-12T11:29:21Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39431

This PR adds an extra 12KB to repository

cmsbuild · 2024-03-12T11:29:41Z

A new Pull Request was created by @valsdav for master.

It involves the following packages:

PhysicsTools/TensorFlow (ml)

@cmsbuild, @valsdav, @wpmccormack can you please review it and eventually sign? Thanks.
@makortel, @riga this is something you requested to watch as well.
@rappoccio, @antoniovilela, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

smuzaffar · 2024-03-12T11:33:10Z

@valsdav , any idea why only testTFVisibleDevicesCUDA fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?

smuzaffar · 2024-03-12T11:39:45Z

@valsdav , I was thinking to add tf_cuda_support toolfile so that we can have something like

<iftool name="tf_cuda_support">
  all tests which requires TF to be build with GPU support
</iftool>

valsdav · 2024-03-12T12:12:50Z

@valsdav , any idea why only testTFVisibleDevicesCUDA fails for GPU IBs (https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc12/CMSSW_14_1_GPU_X_2024-03-11-2300/unitTestLogs/PhysicsTools/TensorFlow#/89-89 ) ?

Hi @smuzaffar I realized this morning testing locally that only that test is failing because it is the only one explicitly checking the list of devices visible to TF.

The other tests that are run when a CUDA device is visible to the cmssw framework are silently fall-backing to CPU running in TF. I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

valsdav · 2024-03-12T12:13:21Z

all tests which requires TF to be build with GPU support

That's a good idea! Thanks!

fwyzard · 2024-03-12T13:19:44Z

PhysicsTools/TensorFlow/test/BuildFile.xml

+<!-- <iftool name="cuda"> -->
+<!--   <bin name="testTFHelloWorldCUDA" file="testRunner.cpp,testHelloWorldCUDA.cc"> -->
+<!--     <use name="boost_filesystem"/> -->
+<!--     <use name="catch2"/> -->
+<!--     <use name="cppunit"/> -->
+<!--     <use name="cuda"/> -->
+<!--     <use name="tensorflow-cc"/> -->
+<!--     <use name="FWCore/ParameterSet"/> -->
+<!--     <use name="FWCore/ParameterSetReader"/> -->
+<!--     <use name="FWCore/PluginManager"/> -->
+<!--     <use name="FWCore/ServiceRegistry"/> -->
+<!--     <use name="FWCore/Utilities"/> -->
+<!--     <use name="HeterogeneousCore/CUDAServices"/> -->
+<!--     <use name="HeterogeneousCore/CUDAUtilities"/> -->
+<!--     <use name="PhysicsTools/TensorFlow"/> -->
+<!--   </bin> -->
+<!-- </iftool> -->


if you want to comment out a whole test, you can just do

Suggested change

fwyzard · 2024-03-12T13:20:49Z

The other testTF*CUDA tests are instead silently using the CPU to run the test.

I think this part should be fixed: if we want the tests to run with CUDA, they should not fall back to CPU.

fwyzard · 2024-03-12T13:22:51Z

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

antoniovilela · 2024-03-12T16:20:49Z

hold

As discussed at ORP.

cmsbuild · 2024-03-12T16:21:06Z

Pull request has been put on hold by @antoniovilela
They need to issue an unhold command to remove the hold state or L1 can unhold it for all

smuzaffar · 2024-03-12T16:34:17Z

I will add tf_cuda_support tool and will redo this PR

valsdav · 2024-03-12T17:20:49Z

I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly.

In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ?

I have a solution for this, @smuzaffar do you want me to push it here so that it can be included in the new PR? Thanks

smuzaffar · 2024-03-12T22:37:28Z

@valsdav , I have opened cms-sw/cmsdist#9066 which adds the suggested new tool tf_cuda_support . Feel free to update this PR with

Changes needed for In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs
Move TF GPU unit tests in <iftool name="tf_cuda_support"> </iftool> block

cmsbuild · 2024-03-13T18:11:58Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38108/summary.html
COMMIT: b9b41e4
CMSSW: CMSSW_14_1_X_2024-03-13-1100/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/44376/38108/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially added 1 lines to the logs
Reco comparison results: 50 differences found in the comparisons
DQMHistoTests: Total files compared: 48
DQMHistoTests: Total histograms compared: 3297383
DQMHistoTests: Total failures: 6
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3297357
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
Checked 202 log files, 165 edm output root files, 48 DQM output files
TriggerResults: no differences found

GPU Comparison Summary

Summary:

You potentially removed 8 lines from the logs
Reco comparison results: 35 differences found in the comparisons
DQMHistoTests: Total files compared: 3
DQMHistoTests: Total histograms compared: 39740
DQMHistoTests: Total failures: 1188
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 38552
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 2 files compared)
Checked 8 log files, 10 edm output root files, 3 DQM output files
TriggerResults: no differences found

smuzaffar · 2024-03-18T11:11:37Z

@cms-sw/ml-l2 , can you please review this?
@cms-sw/orp-l2 , this now has all the required changes. you can unhold it now

valsdav · 2024-03-18T12:20:56Z

+1

smuzaffar · 2024-03-18T16:23:50Z

@valsdav , can you please backport it for 14.0.X ?

antoniovilela · 2024-03-18T18:03:31Z

unhold

cmsbuild · 2024-03-18T18:03:58Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

antoniovilela · 2024-03-18T18:08:18Z

+1

cmsbuild added this to the CMSSW_14_1_X milestone Mar 12, 2024

cmsbuild added pending-signatures tests-pending orp-pending code-checks-pending ml-pending labels Mar 12, 2024

valsdav mentioned this pull request Mar 12, 2024

Update to CUDA 12.4.0 cms-sw/cmsdist#9046

Merged

cmsbuild added code-checks-approved and removed code-checks-pending labels Mar 12, 2024

fwyzard reviewed Mar 12, 2024

View reviewed changes

cmsbuild added the hold label Mar 12, 2024

smuzaffar mentioned this pull request Mar 12, 2024

Add tf_cuda_support tool if TF is build with cuda support cms-sw/cmsdist#9066

Merged

Make TF CUDA unit tests dependent on tf_cuda_support tool

7447264

valsdav changed the title ~~Disabling TF CUDA tests for 14_1_X~~ Improve handling of TF CUDA tests for 14_1_X Mar 12, 2024

valsdav force-pushed the tensorflow-disable-gpu-tests_14_1_X branch from b59bfdd to 52f14fb Compare March 13, 2024 01:47

cmsbuild removed the code-checks-approved label Mar 13, 2024

cmsbuild added tests-started and removed tests-approved labels Mar 13, 2024

cmsbuild added tests-approved and removed tests-started labels Mar 13, 2024

cmsbuild mentioned this pull request Mar 18, 2024

Add tf_cuda_support tool if TF is build with cuda support cms-sw/cmsdist#9077

Merged

cmsbuild added ml-approved and removed ml-pending labels Mar 18, 2024

cmsbuild added fully-signed and removed pending-signatures hold labels Mar 18, 2024

cmsbuild added orp-approved and removed orp-pending labels Mar 18, 2024

cmsbuild merged commit 28f76ff into cms-sw:master Mar 18, 2024
14 checks passed

valsdav mentioned this pull request Mar 18, 2024

[backport] Improve handling of TF CUDA tests for 14_0_X #44375

Merged

makortel mentioned this pull request Jun 5, 2024

[14_0_X] Introduce edm::Async service, and use it in CUDA and Alpaka modules #45143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of TF CUDA tests for 14_1_X #44376

Improve handling of TF CUDA tests for 14_1_X #44376

valsdav commented Mar 12, 2024 •

edited

Loading

cmsbuild commented Mar 12, 2024 •

edited

Loading

cmsbuild commented Mar 12, 2024

cmsbuild commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

valsdav commented Mar 12, 2024

valsdav commented Mar 12, 2024

fwyzard Mar 12, 2024

fwyzard commented Mar 12, 2024

fwyzard commented Mar 12, 2024

antoniovilela commented Mar 12, 2024

cmsbuild commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

valsdav commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

cmsbuild commented Mar 13, 2024

smuzaffar commented Mar 18, 2024

valsdav commented Mar 18, 2024

smuzaffar commented Mar 18, 2024

antoniovilela commented Mar 18, 2024

cmsbuild commented Mar 18, 2024

antoniovilela commented Mar 18, 2024

Improve handling of TF CUDA tests for 14_1_X #44376

Improve handling of TF CUDA tests for 14_1_X #44376

Conversation

valsdav commented Mar 12, 2024 • edited Loading

PR description:

cmsbuild commented Mar 12, 2024 • edited Loading

cmsbuild commented Mar 12, 2024

cmsbuild commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

valsdav commented Mar 12, 2024

valsdav commented Mar 12, 2024

fwyzard Mar 12, 2024

Choose a reason for hiding this comment

fwyzard commented Mar 12, 2024

fwyzard commented Mar 12, 2024

antoniovilela commented Mar 12, 2024

cmsbuild commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

valsdav commented Mar 12, 2024

smuzaffar commented Mar 12, 2024

cmsbuild commented Mar 13, 2024

Comparison Summary

GPU Comparison Summary

smuzaffar commented Mar 18, 2024

valsdav commented Mar 18, 2024

smuzaffar commented Mar 18, 2024

antoniovilela commented Mar 18, 2024

cmsbuild commented Mar 18, 2024

antoniovilela commented Mar 18, 2024

valsdav commented Mar 12, 2024 •

edited

Loading

cmsbuild commented Mar 12, 2024 •

edited

Loading