-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ASAN problem in L1TMuonEndCapTrackProducer #29332
Comments
A new Issue was created by @Dr15Jones Chris Jones. @Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Adding an cmssw/L1Trigger/L1TMuonEndCap/src/bdt/Node.cc Line 327 in fe5b448
where the value of
while e->data has size 8
So there is a problem with the algorithm. |
assign l1 |
New categories assigned: l1 @benkrikler,@rekovic you have been requested to review this Pull request/Issue and eventually sign? Thanks |
The size of cmssw/L1Trigger/L1TMuonEndCap/src/PtAssignmentEngine2016.cc Lines 635 to 664 in fe5b448
So this seems to imply that the boosted decision tree being used is incompatible with the actual code calling it. |
We have 96 failing workflows in the ASAN IB and based on a random sampling of them it appears they all are from this problem. |
ping |
Is anyone looking at this? I think it is also causing crashes in the ROOT6 IBs. |
I'm sorry for the delay in getting back. Could you check if the failing workflows are all using Run2_2016 era? If so, I think it's related to #29237 . I believe the main problem is due to loading the wrong global tag, and thus the wrong BDT trees. I hope there is no new issue in |
@jiafulow you can actually look yourself by going to which was the most recent run of address sanitizer (ASAN). It looks like each of the failures labelled |
@jiafulow Any updates? The ASAN is still reporting the issue, e.g. here for workflow 1310.0 |
assign alca |
New categories assigned: alca @christopheralanwest,@tlampen,@pohsun,@tocheng you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@christopheralanwest,@tlampen,@pohsun,@tocheng
Could you please help to double-check if the conditions payload(s) related to |
@makortel using the provided link (updated to 2020-05-01-2300) , the following workflows that failed with error code 256 (incl 1310.0) are all using '--era Run2_2016':
(I didn't check beyond 1314.0. Note that 534.0 and 536.0 failures are not due to EMTF.) As I wrote above, I believe it's related to the issue #29237 which was due to GlobalTag confusion. Unfortunately I don't really know how to fix it. It would require experts who know how to fix the GTs and how to load the correct GTs for the Run2_2016 era in 'runTheMatrix'. |
Let me add that I find it disturbing that wrong conditions payload causes the code to misbehave this way. It would be much better that the code either always works according to the payload, or would be able check upfront if the payload is inconsistent with the code. |
Thanks for the suggestion, this is something we need to do better. In this particular case, the payload is a set of BDT trees. The number of trees and the number of variables, as well as the definitions of the variables, are different for different eras (2016 vs 2017/18). When there is a mismatch between the payload and the C++ codes, it crashes. |
Now it appears to not to crash but to produce undefined results. |
@jiafulow , We still have this error in ASAN IBs[a], I debugged it a bit an it happens when splitVariable >= 8 |
@smuzaffar , I explained the issue in #29237. The quick summary is that back in Feb 2020, someone from L1T has made changes to the global tag used in 10_6_X for era=Run2_2016. This revealed some issue in the software (specifically, the BDT loaded from global tag did not match the software after the change). We fixed that and put the fix in both 10_6_X and 11_1_X. But it turned out that the global tag used in 11_1_X for era=Run2_2016 has not been updated accordingly. This means that, after the fix, the loaded BDT did not match the software again, but now in 11_1_X. We are still waiting for #29237 to be resolved. In the meanwhile, perhaps I can try some incomplete fix? I'm not sure how to debug ASAN errors. If I modify the problematic lines at Node.cc#L327-L330 to do:
Would it make the ASAN errors go away? |
Thanks @jiafulow for the explaination. Sorry I did not read the #29237 You can reproduce the error in ASAN IBs by doing this
Yes your proposed fix should avoid the ASAN error but I would suggest the following change
|
@smuzaffar , thanks for the instructions. I tried your suggestion and it seems to work. Before the fix, I got:
After the fix, I got:
Would you implement the fix? Or should I submit a pull request? |
Go ahead @jiafulow and sumbit the PR. Thanks, |
Ok, I submitted a PR with your suggestion. |
ASAN IBs are good now, we do not see this isssue any more |
The address sanitizer has found the following problem
I got the stack trace by rebuilding the code with debug information on and then running workflow 1302.0 with just one thread.
The text was updated successfully, but these errors were encountered: