-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of TF CUDA tests for 14_1_X #44376
Improve handling of TF CUDA tests for 14_1_X #44376
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44376/39431
|
A new Pull Request was created by @valsdav for master. It involves the following packages:
@cmsbuild, @valsdav, @wpmccormack can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
@valsdav , any idea why only |
@valsdav , I was thinking to add
|
Hi @smuzaffar I realized this morning testing locally that only that test is failing because it is the only one explicitly checking the list of devices visible to TF. The other tests that are run when a CUDA device is visible to the cmssw framework are silently fall-backing to CPU running in TF. I can add the checks of the list of devices also in those tests to make them check CUDA usage explicitly. |
all tests which requires TF to be build with GPU support That's a good idea! Thanks! |
<!-- <iftool name="cuda"> --> | ||
<!-- <bin name="testTFHelloWorldCUDA" file="testRunner.cpp,testHelloWorldCUDA.cc"> --> | ||
<!-- <use name="boost_filesystem"/> --> | ||
<!-- <use name="catch2"/> --> | ||
<!-- <use name="cppunit"/> --> | ||
<!-- <use name="cuda"/> --> | ||
<!-- <use name="tensorflow-cc"/> --> | ||
<!-- <use name="FWCore/ParameterSet"/> --> | ||
<!-- <use name="FWCore/ParameterSetReader"/> --> | ||
<!-- <use name="FWCore/PluginManager"/> --> | ||
<!-- <use name="FWCore/ServiceRegistry"/> --> | ||
<!-- <use name="FWCore/Utilities"/> --> | ||
<!-- <use name="HeterogeneousCore/CUDAServices"/> --> | ||
<!-- <use name="HeterogeneousCore/CUDAUtilities"/> --> | ||
<!-- <use name="PhysicsTools/TensorFlow"/> --> | ||
<!-- </bin> --> | ||
<!-- </iftool> --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to comment out a whole test, you can just do
<!-- <iftool name="cuda"> --> | |
<!-- <bin name="testTFHelloWorldCUDA" file="testRunner.cpp,testHelloWorldCUDA.cc"> --> | |
<!-- <use name="boost_filesystem"/> --> | |
<!-- <use name="catch2"/> --> | |
<!-- <use name="cppunit"/> --> | |
<!-- <use name="cuda"/> --> | |
<!-- <use name="tensorflow-cc"/> --> | |
<!-- <use name="FWCore/ParameterSet"/> --> | |
<!-- <use name="FWCore/ParameterSetReader"/> --> | |
<!-- <use name="FWCore/PluginManager"/> --> | |
<!-- <use name="FWCore/ServiceRegistry"/> --> | |
<!-- <use name="FWCore/Utilities"/> --> | |
<!-- <use name="HeterogeneousCore/CUDAServices"/> --> | |
<!-- <use name="HeterogeneousCore/CUDAUtilities"/> --> | |
<!-- <use name="PhysicsTools/TensorFlow"/> --> | |
<!-- </bin> --> | |
<!-- </iftool> --> | |
<!-- | |
<iftool name="cuda"> | |
<bin name="testTFHelloWorldCUDA" file="testRunner.cpp,testHelloWorldCUDA.cc"> | |
<use name="boost_filesystem"/> | |
<use name="catch2"/> | |
<use name="cppunit"/> | |
<use name="cuda"/> | |
<use name="tensorflow-cc"/> | |
<use name="FWCore/ParameterSet"/> | |
<use name="FWCore/ParameterSetReader"/> | |
<use name="FWCore/PluginManager"/> | |
<use name="FWCore/ServiceRegistry"/> | |
<use name="FWCore/Utilities"/> | |
<use name="HeterogeneousCore/CUDAServices"/> | |
<use name="HeterogeneousCore/CUDAUtilities"/> | |
<use name="PhysicsTools/TensorFlow"/> | |
</bin> | |
</iftool> | |
--> |
I think this part should be fixed: if we want the tests to run with CUDA, they should not fall back to CPU. |
In addition to a check that there is a CUDA device available, can we actually force TF to run on the GPUs ? |
hold
|
Pull request has been put on hold by @antoniovilela |
I will add |
I have a solution for this, @smuzaffar do you want me to push it here so that it can be included in the new PR? Thanks |
@valsdav , I have opened cms-sw/cmsdist#9066 which adds the suggested new tool
|
b59bfdd
to
52f14fb
Compare
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-027130/38108/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
@cms-sw/ml-l2 , can you please review this? |
+1 |
@valsdav , can you please backport it for 14.0.X ? |
unhold |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This PR improves the handling of CUDA unit tests for the TensorFlow package, using a new
tf_cuda_support
tool from scram , which checks if the GPU support is enabled in TensorFlow compilation.The PR also makes the TF cuda tests more strict by checking explicitely if a CUDA device is visible to TF and not only to cmssw.
The test
testTFVisibleDevicesCUDA
is in fact run by the framework as a CUDA device is registered, but then TF does not recognize the device and the test fails. The othertestTF*CUDA
tests were passing silently using the CPU to run the test. After this PR all the TF sessions usingtf::backend::cuda
, but not finding a GPU will fail explicitly.