-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CMSSW tests failing again with Fatal Root Error: @SUB=Minuit2
#43577
Comments
cms-bot internal usage |
A new Issue was created by @aandvalenzuela Andrea Valenzuela. @Dr15Jones, @antoniovilela, @smuzaffar, @sextonkennedy, @rappoccio, @makortel can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
type root |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Did we see these errors the last time the ROOT 6.30 build was tested on these architectures? On the other hand, since the failures are not widespread, maybe that hints towards a random component in the cause? |
On el8_aarch64_gcc12 CMSSW_14_0_X_2023-12-13-2300 two workflows crashed in a way that looks possibly related 11024.0 step 3
11025.0 step 3
|
Somehow this unit test failure seems to be specific to slc7. The test started to fail on CMSSW_14_0_X_2023-12-13-1100 (where we deployed ROOT 6.30), and has failed on every slc7 IB since then, but not in any other IB. |
If this happens so often, then maybe we should come back to @Dr15Jones suggestion in #42979 (comment). Since ROOT 6.30, the default Minimizer is Minuit 2. Unlike the legacy Minuit, it logs fit failures in Root Errors that CMSSW turns into exceptions by default. Maybe it's not reasonable to expect that all fits in the DQM plots should succeed? |
I see @smuzaffar fixed this particular problem in #43588 by using the likelihood fit instead of chi-square (as was done earlier for |
@cms-sw/dqm-l2 Could you comment? |
Here is one on slc7_amd64_gcc12 CMSSW_14_0_X_2024-01-11-2300 workflow 25208.0 step 4
|
There is also some discussion here: |
el8_amd64_gcc12 CMSSW_14_0_X_2024-01-11-1100 11602.0 step4
|
So, should we continue changing fits from chi-square to likelihood as we find these issues? |
@vgvassilev Do you have any thoughts about the stack traces that include cling above (#43577 (comment))? They continue appearing randomly on ARM. |
Can we run with valgrind to make sure there are no obvious memory errors? |
I can give a try on slc7 x86 |
Given that the exception message mentioned in the issue description was demoted to a an error message in #43726, I would close this issue as the test should no longer fail because of this exception, and follow up the crashes on ARM in a separate issue. |
New issue is in #43802 |
+core |
@please close |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
Hello,
Since we moved to ROOT 6.30, we are seeing again errors like the one reported in #42979 throwing the following exception:
In this case, we see the errors on multiple archs:
el8_aarch64_gcc12
: RelVals25202.2
and1365.0
.el8_ppc64le_gcc12
: RelVal25202.15
.slc7_amd64_gcc12
: Unit testPrimaryVertex
(moduleAlignment/OfflineValidation
).#43106 fixed this issue in the past by using likelihood fit instead of chi-square, but it seems to be back in ROOT 6.30.
Thanks!
FYI @guitargeek, @smuzaffar
The text was updated successfully, but these errors were encountered: