-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROOT6, ROOT628] Fatal system signal has occurred during exit #40347
Comments
assign core |
New categories assigned: core @Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
I've been running WF 4.44 in that IB using 4 threads and so far (after 3 runs) I have yet to be able to reproduce the failure. Probably a threading issue. |
So I ran valgrind and got the following four errors at the end of the job
@pcanal FYI |
So after a couple more attempts (since the other tries didn't show the failure) I was able to get a failure with a longer stack trace
|
So while creating a module we register data products it is creating and one step in that is to call |
So we should figure out which product type(s) are causing the JIT? |
So the library that valgrind reports as having the cmsRun module is pluginRecoBTagCombinedPlugins.so which has the following modules
I'm running step3 of workflow 4.44. In configuration for that workflow only the module type not used is After pruning, no additional module types were cut. The following modules (label, type) are used in the job
|
Going through
they produce either
or
whose dictionary is declared in
The dictionary for cmssw/DataFormats/BTauReco/src/classes_def.xml Lines 76 to 79 in 7c0300b
Hmm, are we missing a dictionary for |
Turning on verbose info from ROOT, it looks like we already defined the appropriate dictionary
|
Hello all, Thu Mar 30 19:32:03 CEST 2023 : checking O2O
Begin processing the 1st record. Run 399700, Event 1, LumiSection 1 on stream 0 at 30-Mar-2023 19:32:06.688 CEST
-Fatal system signal has occurred during exit
Current TSC key = l1_trg_cosmics2023/v12:l1_trg_rs_cosmics2023/v14
L1TCaloParamsO2ORcd@CaloParams l1_trg_cosmics2023/v12:l1_trg_rs_cosmics2023/v14
L1-O2O-INFO: IOV OK Above it's a part of the log and the entire log is linked in [1] (conddb site). More specifically the only module that seems to be resulting in this error is the one in [2], while @francescobrivio also tested the rest O2O sequences and he didn't see similar crashing messages. As far as I read (briefly) here, this seems to be an issue with freeing up memory at the end and it doesn't affected the job itself. Are there any news on this topic? as it seems that it's a bit stale and we want to understand what is the situation cause AlCa's thinking to move the machines to production some time soon. [1] https://cms-conddb.cern.ch/cmsDbBrowser/logs/show_O2O_log/Prep/L1TSubs/2023-03-30%2017:31:57.834130 |
In general Without further information it is very hard to say if your problem has similar origin or not (e.g. we have had a rare |
Hi @makortel, thx for the suggestion. |
More occurrences in CMSSW_13_2_ROOT6_X_2023-06-28-2300 on el8_amd64_gcc11
|
Occurred in CMSSW_13_3_ROOT628_X_2023-08-08-2300 on el8_amd64_gcc11
|
Occurred again today in CMSSW_13_3_ROOT628_X_2023-08-09-2300 on el8_amd64_gcc11:
|
Also in CMSSW_13_3_ROOT6_X_2023-08-09-2300 on el8_amd64_gcc11:
|
Feels like the TCMalloc made this failure mode more frequent. |
Occurred in CMSSW_13_3_ROOT6_X_2023-08-13-2300
in CMSSW_13_3_ROOT628_X_2023-08-13-2300
in CMSSW_13_3_ROOT6_X_2023-08-11-2300
in CMSSW_13_3_ROOT6_X_2023-08-10-2300
in CMSSW_13_3_ROOT628_X_2023-08-10-2300
|
See also #42468, the stack traces do look different with tcmalloc than we were seeing with jemalloc, so it's not obvious that it's the same problem. |
Occurred in CMSSW_13_3_ROOT6_X_2023-08-14-2300
Occurred in CMSSW_13_3_ROOT628_X_2023-08-14-2300
|
Occurred in CMSSW_13_3_ROOT628_X_2023-08-16-2300
|
Occurred in CMSSW_13_3_ROOT6_X_2023-08-17-2300
Occurred in CMSSW_13_3_ROOT628_X_2023-08-17-2300
|
Does root-project/root#13463 improve the behavior? |
Occurred in CMSSW_13_3_ROOT6_X_2023-08-20-2300
Occurred in CMSSW_13_3_ROOT628_X_2023-08-20-2300
|
On a valgrind output on the same test, I see weirdness:
and we have:
vs
I.e. |
I came across gperftools/gperftools#792 that suggests the
|
I looked at a bunch of the "mismatched free" warnings and concluded that they are false positives. When I filtered those out it left a handful of invalid frees that do appear to be genuine double-frees in |
Occurred in CMSSW_13_3_ROOT6_X_2023-08-22-2300
Occurred in CMSSW_13_3_ROOT628_X_2023-08-22-2300
|
Occurred in CMSSW_13_3_ROOT6_X_2023-08-24-2300
|
Occurred in CMSSW_13_3_ROOT628_X_2023-08-25-2300
Occurred in CMSSW_13_3_ROOT6_X_2023-08-27-2300
Occured in CMSSW_13_3_ROOT628_X_2023-08-27-2300
|
Occurred in CMSSW_13_3_ROOT6_X_2023-08-28-2300
Occurred in CMSSW_13_3_ROOT628_X_2023-08-28-2300
|
The TCMalloc update cms-sw/cmsdist#8635 was merged around 08-24 (I wasn't able to identify the exact IB, for some reason the cmsdist PR does not show up in the per-IB "CMS Dist" PR list). This seems to correspond the appearance of the The default was changed back to jemalloc in CMSSW_13_3_X_2023-08-29-1100 |
@makortel , cms-sw/cmsdist#8635 was first used in 08-24-1100 IB (you can see it by using the "CMS Dist"->"architecture" link of the IB. |
click on the |
I will check if we can easily add the externals/cmsdist prs which we were merged for "Default" branch for production arch on the page |
Whoa, didn't know that. Thanks! |
CMSSW_13_3_X_2023-08-29-2300 IB did not show any failures. Wrt. the previous ROOT IBs, in addition of the changing the default allocator back to jemalloc, the ROOT was updated with cms-sw/cmsdist#8668 (master) and cms-sw/cmsdist#8669 (6.28) |
Also CMSSW_13_3_X_2023-08-30-2300 didn't show any failures. |
No new failures in 08-31-2300, 09-01-2300, 09-03-2300. Given the coincidence with changing back to jemalloc, and observing much lower failure rate with jemalloc than with TCMalloc, we discussed about the option to add a new IB for ROOT master with TCMalloc to build more confidence that this problem has been fixed (or worked around). |
We are seeing occasional, random job failures because of
in ROOT6 IBs, at least since CMSSW_13_0_ROOT6_X_2022-12-09-2300 (likely earlier). I'm staring to collect now at least how frequent the job failure is
The text was updated successfully, but these errors were encountered: