Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove tensorflow::setLogging() as thread-unsafe #46065

Merged
merged 1 commit into from
Sep 24, 2024

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Sep 19, 2024

PR description:

The setLogging() calls setenv(), which is not required to be thread safe, and specifically in glibc leads to a race condition with any concurrent getenv() calls. For more information see #46002 (comment). There is circumstantial evidence these specific setenv() calls could be causing the rare crash reported in #44659.

This PR should probably be accompanied with a PR to cmsdist setting TF_CPP_MIN_LOG_LEVEL=3 in the Tensorflow toolfile.

Resolves cms-sw/framework-team#1030

PR validation:

Code compiles, and tracing (with gdb) setenv() calls in workflow 12861.0 step2 no longer shows setenv() calls called in the framework's parallel section.

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

Good question if this should be backported. The race condition exists in earlier releases, but we haven't seen crash reports from production. Maybe 14_1_X and 14_0_X could still be useful?

The setLogging() calls setenv(), which is not required to be thread
safe, and specifically in glibc leads to a race condition with any
concurrent getenv() calls.
@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 19, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel for master.

It involves the following packages:

  • DQM/DTMonitorClient (dqm)
  • L1Trigger/L1CaloTrigger (l1, upgrade)
  • L1Trigger/L1THGCal (l1, upgrade)
  • L1Trigger/Phase2L1ParticleFlow (l1, upgrade)
  • PhysicsTools/TensorFlow (ml)
  • RecoEcal/EgammaCoreTools (reconstruction)
  • RecoMuon/TrackerSeedGenerator (reconstruction)
  • RecoTauTag/HLTProducers (hlt)

@Martin-Grunewald, @aloeliger, @antoniovagnerini, @cmsbuild, @epalencia, @jfernan2, @mandrenguyen, @mmusich, @nothingface0, @rvenditti, @srimanob, @subirsarkar, @syuvivida, @tjavaid, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@CeliaFernandez, @Fedespring, @HuguesBrun, @Martin-Grunewald, @Prasant1993, @ReyerBand, @Sam-Harper, @a-kapoor, @abbiendi, @afiqaize, @amarini, @andrea21z, @argiro, @azotz, @battibass, @bellan, @cericeci, @jainshilpi, @jbsauvan, @jhgoh, @lgray, @mbluj, @missirol, @mmusich, @ram1123, @rchatter, @riga, @rociovilar, @sameasy, @silviodonato, @sobhatta, @thomreis, @trocino, @valsdav, @varuns23, @wang0jin this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 84KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-407fec/41643/summary.html
COMMIT: bafd04e
CMSSW: CMSSW_14_2_X_2024-09-19-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46065/41643/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

@mmusich
Copy link
Contributor

mmusich commented Sep 20, 2024

This PR should probably be accompanied with a PR to cmsdist setting TF_CPP_MIN_LOG_LEVEL=3 in the Tensorflow toolfile.

has this already happened?

@mmusich
Copy link
Contributor

mmusich commented Sep 20, 2024

@kandrosov FYI

@mmusich
Copy link
Contributor

mmusich commented Sep 20, 2024

+hlt

@valsdav
Copy link
Contributor

valsdav commented Sep 20, 2024

+ml

Thanks for the fix.

@makortel
Copy link
Contributor Author

This PR should probably be accompanied with a PR to cmsdist setting TF_CPP_MIN_LOG_LEVEL=3 in the Tensorflow toolfile.

has this already happened?

Now in cms-sw/cmsdist#9418

@makortel
Copy link
Contributor Author

@cmsbuild, please test with cms-sw/cmsdist#9418

@makortel
Copy link
Contributor Author

could you also maybe initiate the backports to 14_1_X and 14_0_X? This way we can issue a test of the PR on the current playback release 14_0_X.

I can make the backports after the next round of tests succeed.

@smuzaffar
Copy link
Contributor

@cmsbuild, please test with cms-sw/cmsdist#9418

@makortel , I have updated cms-sw/cmsdist#9418 to make TF_CPP_MIN_LOG_LEVEL a runtime variable

smuzaffar added a commit to cms-sw/cms-common that referenced this pull request Sep 20, 2024
@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-407fec/41670/summary.html
COMMIT: bafd04e
CMSSW: CMSSW_14_2_X_2024-09-20-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46065/41670/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-407fec/41670/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-407fec/41670/git-merge-result

Comparison Summary

Summary:

@makortel
Copy link
Contributor Author

This PR + cms-sw/cmsdist#9418 seems to remove these printouts

To enable the following instructions: AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.

@antoniovagnerini
Copy link

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @antoniovilela, @rappoccio, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

@makortel
Copy link
Contributor Author

could you also maybe initiate the backports to 14_1_X and 14_0_X? This way we can issue a test of the PR on the current playback release 14_0_X.

I can make the backports after the next round of tests succeed.

The backports are in

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit c1a8e80 into cms-sw:master Sep 24, 2024
11 checks passed
@makortel makortel deleted the tensorflowRemoveLogging branch September 24, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment