-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in Tensorflow in ARM after exception in job #40996
Comments
assign l1, core |
New categories assigned: core,l1 @epalencia,@Dr15Jones,@smuzaffar,@aloeliger,@makortel,@cecilecaillol you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @Dr15Jones Chris Jones. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Umm, why is |
So the call chain from
cmssw/L1Trigger/L1TMuonEndCap/src/TrackFinder.cc Lines 97 to 105 in e7caf24
cmssw/L1Trigger/L1TMuonEndCap/src/SectorProcessor.cc Lines 156 to 171 in e7caf24
cmssw/L1Trigger/L1TMuonEndCap/src/PtAssignmentEngineDxy.cc Lines 18 to 37 in e7caf24
In the end it looks only like a "lazy initialization", but it would be nevertheless be better to read the TF graph and create the session around the module construction time (given that the model file seems to be specified in the configuration, and the |
@eyigitba I may be mistaken, but is this the model you were discussing/worked on/with? |
I found we've seen unreliable behavior of TensorFlow's |
Hi @aloeliger , sorry for replying late, but yes this is the NN module that I worked on. Please let me know if there is something that needs to be done on my end |
So the job
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_aarch64_gcc11/CMSSW_13_1_X_2023-03-07-2300/pyRelValMatrixLogs/run/136.771_RunDoubleMuon2016H/step3_RunDoubleMuon2016H.log#/
threw an 'out of memory' exception
but then the job was killed for taking too long. The log file shows that it wrote to the log for the first 5 minutes of the job and then after nearly 4 hours it was killed for taking too long. The stack traces are
One can see that thread 1 and 4 are both waiting to get a lock within tensorflow but no thread has any stack which is holding that lock. It looks like when the exception happened, during the stack unwinding of the exception the tensorflow lock was never released.
The text was updated successfully, but these errors were encountered: