[TF] Update TF v2.16.1 (without libfft) #9388

iarspider · 2024-09-03T11:39:33Z

Alternative version of #9241

cmsbuild · 2024-09-03T11:39:58Z

A new Pull Request was created by @iarspider for branch IB/CMSSW_14_2_X/tf.

@aandvalenzuela, @cmsbuild, @iarspider, @smuzaffar can you please review it and eventually sign? Thanks.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

cmsbuild · 2024-09-03T11:39:58Z

cms-bot internal usage

iarspider · 2024-09-03T11:40:10Z

@cmsbuild please test for CMSSW_14_2_TF_X/el8_aarch64_gcc12

iarspider · 2024-09-03T11:40:47Z

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

iarspider · 2024-09-03T15:42:38Z

@cmsbuild please test for CMSSW_14_2_TF_X/el8_aarch64_gcc12

smuzaffar · 2024-09-04T21:00:58Z

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

cmsbuild · 2024-09-05T00:53:42Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41304/summary.html
COMMIT: 1ba91fc
CMSSW: CMSSW_14_2_TF_X_2024-09-02-2300/el8_aarch64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9388/41304/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

@matt2275 Fix issue with ZDCSimpleRecAlgo_Run3.cc cmssw#45868
@stahlleiton Add option to sort candidates by pT in Unified Particle Transformer cmssw#45843
@jfernan2 [DT] Position fix and new filter to TPs from Phase2 L1T emulator cmssw#44924
@fwyzard Implement a test for the copyFromDense and copyToDense utilities cmssw#45852
@bsunanda Phase2-hgx359X Modify the testing code in several packages in view of changes made in the Configuration/Geometry area for Phase2 scenarios cmssw#45822
@namapane Minor fixes to variables for lepton MVA cmssw#45860
@AdrianoDee Reduce RelVal Events for B0ToJpsiK0s_JMM_Filter_DGamma0 cmssw#45798
@nurfikri89 [BTV, NanoAOD] Add UParT discriminants for strange-jet tagging cmssw#45684
@tahuang1991 Fixed CLCT sorting in CSC TP emulator and synchronize the config to the Point-5 data-taking settings cmssw#45829
@mmusich add few more default geometry tests in Geometry/TrackerGeometryBuilder cmssw#45866
@pallabidas [Phase 2 L1T] Updates to GCT objects in HF cmssw#45540
@iarspider [GCC13][AlCa] Avoid Wdangling-reference in SiPhase2OuterTrackerFakeLorentzAngleESSource cmssw#45730
@mmusich Add certificates handling to G2GainsValidator.py cmssw#45867
@mmusich TrackingRecHitsSoACollection: early return hostData in CopyHost::copyAsync() when there aren't hits cmssw#45837
@bsunanda Phase2-hgx359Y Update the scripts in Alignment/OfflineValidation and Geometry/HcalCommonData|MTDCommonData for the changes in Configuration/Geometry cmssw#45848
@mmusich improve pagination of L1TUtmTriggerMenuPayloadInspector plots cmssw#45813
@makortel Add possibility for a CopyToHost::postCopy() operation cmssw#45801
@cms-sw [LLVM] Update to version 18.1.6 #9386
@cms-sw cmssw-osenv: added extra paths to mount; set LC_ALL=C #9389

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41304/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41304/git-merge-result

cmsbuild · 2024-09-05T06:11:14Z

-1

Failed Tests: GpuUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41308/summary.html
COMMIT: 1ba91fc
CMSSW: CMSSW_14_2_TF_X_2024-09-02-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9388/41308/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41308/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41308/git-merge-result

GPU Unit Tests

I found 9 errors in the following unit tests:

---> test testTFMetaGraphLoadingCUDA had ERRORS
---> test testTFGraphLoadingCUDA had ERRORS
---> test testTFConstSessionCUDA had ERRORS
and more ...

Comparison Summary

Summary:

You potentially added 77 lines to the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 2370 differences found in the comparisons
DQMHistoTests: Total files compared: 44
DQMHistoTests: Total histograms compared: 3328276
DQMHistoTests: Total failures: 30395
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3297861
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 36.092000000000006 KiB( 43 files compared)
DQMHistoSizes: changed ( 11634.0,... ): 2.879 KiB Physics/NanoAODDQM
DQMHistoSizes: changed ( 13234.0,... ): 1.771 KiB Physics/NanoAODDQM
Checked 193 log files, 163 edm output root files, 44 DQM output files
TriggerResults: found differences in 2 / 42 workflows

iarspider · 2024-09-05T10:10:09Z

@smuzaffar since the tests depend on Tensorflow (not Keras), the environment was not set. Should I add an explicit dependency on keras to PhysicsTools/TensorFlow? I don't think we can handle circular dependency keras ↔ tensorflow

smuzaffar · 2024-09-09T07:27:25Z

@iarspider , can you try running unit tests locally after setting the KERAS_BACKEND=tensorflow env?
Note that ## INITENV SET KERAS_BACKEND tensorflow only add/set this env via init.*sh file. For scram one need to update the xml file to set it. So if setting KERAS_BACKEND=tensorflow allows to fix the gou unit tests then please add KERAS_BACKEND=tensorflow in to one of tf xml files

cmsbuild · 2024-09-09T09:41:31Z

Pull request #9388 was updated.

iarspider · 2024-09-09T09:41:39Z

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

smuzaffar · 2024-09-09T09:46:44Z

scram-tools.file/tools/tensorflow/tensorflow.xml

@@ -3,6 +3,7 @@
    <environment name="TENSORFLOW_BASE" default="@TOOL_ROOT@"/>
    <environment name="LIBDIR" default="$TENSORFLOW_BASE/lib"/>
    <environment name="INCLUDE" default="$TENSORFLOW_BASE/include"/>
+    <environment name="KERAS_BACKEND" default="tensorflow"/>


it should be <runtime ..../> type variable ( see ROOTSYS as an example)
did you run test locally to see if gpu unit tests passed after setting this?

I tested it by setting the environment variable manually (not via toolfile).

for me all the unit tests still fails with error

##Failure Location unknown## : Error Test name: testHelloWorldCUDA::test uncaught exception of type std::exception (or derived). - An exception of category 'UnavailableAccelerator' occurred while [0] Calling tensorflow::setBackend() Exception Message: Cuda backend requested, NVIDIA GPU visible to cmssw, but not visible to TensorF low in the job

Some tests that were failing previously worked after setting KERAS_BACKEND. Yes, I saw these failures as well - I thought I missed some setup step to make them work (in a container started with --nv flag)

e.g. testTFConstSession was failing with ValueError: Unable to import backend : theano, but after setting KERAS_BACKEND it passed.

Could the failure be due to 12.4 not being an officially tested CUDA version for TF 2.16.1 (and even 2.17) - link lists 12.3 as officially tested version?

Running python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" prints this message:

successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355

and returns an empty list []. I googled this message, and there are basically three solutions:

Run TensorFlow using official docker image

Install TensorFlow using conda and prebuilt wheels

Force-connect NUMA node (as suggested in the document that TensorFlow prints out), namely run sudo echo 0 | sudo tee -a /sys/bus/pci/devices/0000\:06\:10.0/numa_node after each reboot. But that requires sudo rights (and, I would imagine, not in container, but on the host).

are you sure you started cmssw-el8 with --nv option? For me the following command

python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

runs fine (both for this PR and TF_X Ibs) and return

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

Now it works for me as well, weird.

iarspider · 2024-09-09T09:54:20Z

please abort

cmsbuild · 2024-09-09T09:55:27Z

Pull request #9388 was updated.

iarspider · 2024-09-09T09:55:57Z

@cmsbuild please test for CMSSW_14_2_TF_X/el8_amd64_gcc12

cmsbuild · 2024-09-09T17:31:52Z

-1

Failed Tests: UnitTests GpuUnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41405/summary.html
COMMIT: 34311da
CMSSW: CMSSW_14_2_TF_X_2024-09-06-2300/el8_amd64_gcc12
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmsdist/9388/41405/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41405/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-eef346/41405/git-merge-result

Unit Tests

I found 1 errors in the following unit tests:

---> test testSiStripPayloadInspector had ERRORS

GPU Unit Tests

I found 9 errors in the following unit tests:

---> test testBrokenLineFitGPU_t had ERRORS
---> test testFitsGPU_t had ERRORS
---> test testTFGraphLoadingCUDA had ERRORS
and more ...

Comparison Summary

Summary:

You potentially added 1448 lines to the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 2416 differences found in the comparisons
DQMHistoTests: Total files compared: 44
DQMHistoTests: Total histograms compared: 3328859
DQMHistoTests: Total failures: 35091
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 3293748
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 1831.5049999999999 KiB( 43 files compared)
DQMHistoSizes: changed ( 11634.0,... ): 96.395 KiB GEM/Segment_TnP
Checked 193 log files, 163 edm output root files, 44 DQM output files
TriggerResults: no differences found

iarspider · 2024-09-10T08:36:22Z

ChatGPT suggests using Session::ListDevices to check if GPU is available, instead of mutable_device_count():

#include "tensorflow/core/public/session.h"
#include "tensorflow/core/protobuf/config.pb.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/common_runtime/device_factory.h"

#include <iostream>

int main() {
    // Initialize a session
    tensorflow::Session* session;
    tensorflow::SessionOptions options;
    
    // Try to create a new session
    tensorflow::Status status = tensorflow::NewSession(options, &session);
    if (!status.ok()) {
        std::cerr << "Error creating session: " << status.ToString() << std::endl;
        return -1;
    }

    // Retrieve the list of available devices
    std::vector<tensorflow::DeviceAttributes> devices;
    status = session->ListDevices(&devices);
    if (!status.ok()) {
        std::cerr << "Error listing devices: " << status.ToString() << std::endl;
        return -1;
    }

    // Check if any GPU devices are available
    bool gpu_available = false;
    for (const auto& device : devices) {
        std::cout << "Device name: " << device.name() << ", type: " << device.device_type() << std::endl;
        if (device.device_type() == "GPU") {
            gpu_available = true;
        }
    }

    if (gpu_available) {
        std::cout << "GPU is available and can be used." << std::endl;
    } else {
        std::cout << "No GPU devices are available." << std::endl;
    }

    // Clean up
    session->Close();
    delete session;

    return 0;
}

smuzaffar and others added 14 commits June 11, 2024 18:13

Update TF v2.16.1

9b9a8c4

update abseil-cpp 20230802.2

e7f25dc

update bazel-absl patch

a5ab9c5

TF2.16: Apply abseil aarch64 patch

85957cb

Merge branch 'IB/CMSSW_14_1_X/tf' into tf2.16.1

3e11df4

Update py3-scipy

f3b3548

Update cython as well

a819169

Try updating py3-blosc2

81be620

Update py3-tables

f03065f

Add symlink for cudnn_frontend_archive

fd4d07d

Update tensorflow.spec

49ca37a

Remove more files with broken symbols

9abe3d2

Update keras

d7f70e3

Add missing keras dependencies

f38a194

iarspider changed the base branch from IB/CMSSW_14_2_X/master to IB/CMSSW_14_2_X/tf September 3, 2024 11:39

cmsbuild added tests-pending externals-pending pending-signatures orp-pending labels Sep 3, 2024

cmsbuild added tests-started and removed tests-pending labels Sep 3, 2024

cmsbuild added tests-rejected tests-started and removed tests-started tests-rejected labels Sep 3, 2024

cmsbuild added tests-rejected and removed tests-started labels Sep 5, 2024

smuzaffar mentioned this pull request Sep 9, 2024

[TF] Update TF v2.16.1 #9241

Merged

Set KERAS_BACKEND in toolfile

3a27d69

cmsbuild added tests-pending and removed tests-rejected labels Sep 9, 2024

cmsbuild added tests-started and removed tests-pending labels Sep 9, 2024

smuzaffar reviewed Sep 9, 2024

View reviewed changes

KERAS_BACKEND should be runtime

34311da

cmsbuild added tests-pending and removed tests-started labels Sep 9, 2024

cmsbuild added tests-started and removed tests-pending labels Sep 9, 2024

cmsbuild added tests-rejected and removed tests-started labels Sep 9, 2024

iarspider closed this Sep 11, 2024

smuzaffar deleted the tf2.16.1-nolibfft branch September 18, 2024 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TF] Update TF v2.16.1 (without libfft) #9388

[TF] Update TF v2.16.1 (without libfft) #9388

iarspider commented Sep 3, 2024

cmsbuild commented Sep 3, 2024

cmsbuild commented Sep 3, 2024 •

edited

Loading

iarspider commented Sep 3, 2024 •

edited

Loading

iarspider commented Sep 3, 2024 •

edited

Loading

iarspider commented Sep 3, 2024

smuzaffar commented Sep 4, 2024

cmsbuild commented Sep 5, 2024

cmsbuild commented Sep 5, 2024

iarspider commented Sep 5, 2024 •

edited

Loading

smuzaffar commented Sep 9, 2024

cmsbuild commented Sep 9, 2024

iarspider commented Sep 9, 2024

smuzaffar Sep 9, 2024

iarspider Sep 9, 2024

smuzaffar Sep 9, 2024

iarspider Sep 9, 2024

iarspider Sep 9, 2024

iarspider Sep 9, 2024

iarspider Sep 9, 2024

smuzaffar Sep 9, 2024

iarspider Sep 9, 2024

iarspider commented Sep 9, 2024

cmsbuild commented Sep 9, 2024

iarspider commented Sep 9, 2024

cmsbuild commented Sep 9, 2024

iarspider commented Sep 10, 2024

[TF] Update TF v2.16.1 (without libfft) #9388

[TF] Update TF v2.16.1 (without libfft) #9388

Conversation

iarspider commented Sep 3, 2024

cmsbuild commented Sep 3, 2024

cmsbuild commented Sep 3, 2024 • edited Loading

iarspider commented Sep 3, 2024 • edited Loading

iarspider commented Sep 3, 2024 • edited Loading

iarspider commented Sep 3, 2024

smuzaffar commented Sep 4, 2024

cmsbuild commented Sep 5, 2024

cmsbuild commented Sep 5, 2024

GPU Unit Tests

Comparison Summary

iarspider commented Sep 5, 2024 • edited Loading

smuzaffar commented Sep 9, 2024

cmsbuild commented Sep 9, 2024

iarspider commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iarspider commented Sep 9, 2024

cmsbuild commented Sep 9, 2024

iarspider commented Sep 9, 2024

cmsbuild commented Sep 9, 2024

Unit Tests

GPU Unit Tests

Comparison Summary

iarspider commented Sep 10, 2024

cmsbuild commented Sep 3, 2024 •

edited

Loading

iarspider commented Sep 3, 2024 •

edited

Loading

iarspider commented Sep 3, 2024 •

edited

Loading

iarspider commented Sep 5, 2024 •

edited

Loading