-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[UBSAN] Undefined behavior in Reco* and TrackingTools reco packages #35036
Comments
assign reconstruction |
A new Issue was created by @mrodozov Mircho Rodozov. @Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@vmariani @mtosi @mmusich a few are in PIX RecoLocalTracker @OzAmram RecoMTD is for @cms-sw/mtd-dpg-l2 RecoMuon is for @ArnabPurohit @trocino RecoHI @mandrenguyen EGM @afiqaize @SohamBhattacharya for RecoCaloTools/Navigation RecoEgamma/EgammaPhotonAlgos |
let's see if this works better or if we need to split this issue to 14 dependent issues |
kind ping for
you can see the summary of the current status at #35036 (comment) |
Concerning the error in RecoMuon/MuonIdentification/plugins/MuonIdProducer.cc, @CeliaFernandez and I isolated the bug that causes it. I add @khaosmos93 and @JanFSchulte for information, since the error occurs in HLT reconstruction in this case — but the bug is general, not HLT specific. Basically, here and here the This bug has been there for a very long time, so we don't understand why the error has never showed up before, nor why it only showed up in wf 11603.0 ( Anyway, we can provide an easy fix to prevent the error and reproduce the intended behavior. But we cannot fully test it if we cannot reproduce the error in the first place. Should we proceed anyway and just check for compilation errors? |
It is being reported in a special UBSAN IB, and therefore it does not show up in standard IBs. The latest UBSAN build is (it is also the very nature of undefined behavior that the problems can show by themselves or stay hidden for long time)
For running at CERN please try |
the call for the @makortel @Dr15Jones |
It indeed appears to be possible (in a quick test). I'm don't know though what the standard says. |
About |
about @OzAmram do you know how we get |
I looked into a bit and I do not know how it would have happened. When I ran the effected workflow to reproduce the issue it seemed there were many errors/warnings upstream so I wondered if it could have been a memory issue. The log file is attached |
I'm curious if it's possible to get a stack trace. |
@slava77 , ubsan IBs are build with |
Yes it is a recent change, added 3 weeks ago after discussing it in Core SW meeting. The env variable is part of IB configuration so will be set automatically in local runs. |
There are few more UBSAN errors like
This is generated when |
Let me clarify. Still in my opinion the issue is in UBSAN that flag as UB any copy of a uninitialized bool allocated on a "dirty area"
|
I think the case is ASAN isn't 'happy' with other dirty areas, it is just a weird value in a bool is the only absolutely positive indicator that ASAN has that a value is dirty, so it is able to safely report it. I then use that to look for other uninitialized variables in the same class and fill those in as well (as seen in this PR). Therefore the 'bad bool' value is like a canary in the coal mine and helps focus on areas which may have other problems. |
Looks like there's still UB in cmssw/RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc Lines 375 to 379 in 4b3840a
so we get zero-length arrays here: cmssw/RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc Lines 388 to 393 in 4b3840a
|
@dan131riley which workflow is that, so we can confirm it's gone with #37384? |
I used 35434.0. |
Just for reference, these are still left
As far as I understand, these are not going to be fixed (solutions were proposed, but did not converge to a full fix)
TrackListMerger: #37384 Do we have a way to mark in the code "ignore in UBSAN"? https://cmssdt.cern.ch/SDT/jenkins-artifacts/ubsan_logs/CMSSW_12_5_UBSAN_X_2022-05-16-1100/ |
Hi @jpata
I think @ferencek planned to look at these. About
I plan to come back to it in the not to distant future. |
No. |
type tracking |
type trk |
the above tags are just to have easy access later. |
Some time ago @OzAmram had a look at PixelCPEBase but it was not clear what was causing the issue. Oz mentioned that in the same job where he reproduced the issue he saw a bunch of errors upstream from other modules before even reaching the PixelCPEBase code so he thought that maybe some other code was writing to memory out of bounds and overwriting some values in the CPE code. |
I am not sure how that's possible. In any case is it something you and @OzAmram plan to pursue? |
You can see the log where I looked into this previously here:
If you look at the log there is first an error of
and then afterwards many memory errors like:
before the error related to CPEBase. So I think to debug one would need to see what is causing the initial integer overflow which is hopefully related to the pointer issue which is causing the issue with CPEBase I believe. I can try to take a look again later this week |
I am not an expert, but I don't think that's the way the UBSAN build is supposed to work. As far as I understand it flags in-point failures and the errors above in the stack are not related to the CPE code but to unrelated code paths executed earlier. @makortel please correct me if I misinterpreted. |
I would interpret the log in the same way, i.e. the non-CPE warnings are likely not related to the CPE warnings. |
So after some debugging it seems like it is the uninitialized value of the param which was causing the error. The only time this variable is ever set (checked with lxr) is here and I checked that all of the values being set there were valid. The error seems to come before this part of the code when the values are actually initialized, so seems to be based on whatever undefined value it is present when the struct is created. It seems very strange to me that UBSAN reports this as an error for this if this undefined value is never used. But in any case, simply setting a default initialization value fixed the UBSAN error. PR #38400 |
@cms-sw/reconstruction-l2 I am bit lost on what is still needing to be fixed here. |
cms-bot internal usage |
All of these has been fixed, I do not see any ref to any of these in latest UBSAN . |
The UBSAN IB reports undefined behavior in reco with example relval and step they appear in:
check the relval logs in here for the examples:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/
The text was updated successfully, but these errors were encountered: