Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UBSAN] Undefined behavior in Reco* and TrackingTools reco packages #35036

Closed
mrodozov opened this issue Aug 27, 2021 · 41 comments
Closed

[UBSAN] Undefined behavior in Reco* and TrackingTools reco packages #35036

mrodozov opened this issue Aug 27, 2021 · 41 comments

Comments

@mrodozov
Copy link
Contributor

mrodozov commented Aug 27, 2021

The UBSAN IB reports undefined behavior in reco with example relval and step they appear in:


23434.9921 step4
TrackingTools/TrajectoryParametrization/interface/PerigeeTrajectoryError.h:14:7: runtime error: load of value 253, which is not a valid value for type 'bool'

134.706 step3 
TrackingTools/TrajectoryState/interface/BasicTrajectoryState.h:66:7: runtime error: load of value 131071, which is not a valid value for type 'SurfaceSide'

136.811 step3
TrackingTools/TrajectoryState/interface/TrajectoryStateClosestToPoint.h:18:7: runtime error: load of value 164, which is not a valid value for type 'bool'

136.751 step3
RecoCaloTools/Navigation/interface/CaloRectangle.h:80:87: runtime error: reference binding to null pointer of type 'const struct CaloSubdetectorTopology'

138.1 step2
RecoEgamma/EgammaPhotonAlgos/src/EnergyUncertaintyPhotonSpecific.cc:229:99: runtime error: index -1 out of bounds for type 'float [6]'

159.3 step3
RecoHI/HiJetAlgos/plugins/HiPuRhoProducer.cc:357:11: runtime error: load of value 16, which is not a valid value for type 'bool'

134.706 step3
RecoLocalTracker/SiPixelClusterizer/plugins/PixelThresholdClusterizer.cc:225:27: runtime error: variable length array bound evaluates to non-positive value 0

11603.0 step3
RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 138369145, which is not a valid value for type 'SubDetector'

34634.21 step3
RecoMTD/TrackExtender/plugins/TrackExtenderWithMTD.cc:889:29: runtime error: downcast of address 0x2af64df85f40 which does not point to an object of type 'MTDRingForwardDoubleLayer'

11603.0 step2
RecoMuon/MuonIdentification/plugins/MuonIdProducer.cc:894:40: runtime error: member call on null pointer of type 'struct MuonShowerDigiFiller'

11650.0 step2
RecoMuon/TrackerSeedGenerator/plugins/TSGFromL2Muon.cc:106:28: runtime error: reference binding to null pointer of type 'const struct RectangularEtaPhiTrackingRegion'

11834.24 step2
RecoTracker/DeDx/plugins/HLTDeDxFilter.cc:119:92: runtime error: variable length array bound evaluates to non-positive value 0

11603.0 step3
RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:441:20: runtime error: variable length array bound evaluates to non-positive value 0

11723.17 step3
RecoTracker/TkSeedGenerator/plugins/DeepCoreSeedGenerator.cc:340:97: runtime error: left shift of negative value -956

check the relval logs in here for the examples:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/

@mrodozov
Copy link
Contributor Author

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @mrodozov Mircho Rodozov.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented Aug 27, 2021

@vmariani @mtosi @mmusich
it looks like many issues are in the tracking code : TrackingTools and RecoTracker

a few are in PIX RecoLocalTracker @OzAmram

RecoMTD is for @cms-sw/mtd-dpg-l2

RecoMuon is for @ArnabPurohit @trocino

RecoHI @mandrenguyen

EGM @afiqaize @SohamBhattacharya for RecoCaloTools/Navigation RecoEgamma/EgammaPhotonAlgos

@slava77
Copy link
Contributor

slava77 commented Aug 27, 2021

let's see if this works better or if we need to split this issue to 14 dependent issues

@slava77
Copy link
Contributor

slava77 commented Sep 7, 2021

kind ping for

@vmariani @mmusich
it looks like many issues are in the tracking code : TrackingTools and RecoTracker

a few are in PIX RecoLocalTracker @OzAmram

RecoMuon is for @CeliaFernandez @trocino

RecoHI @mandrenguyen

EGM @afiqaize @SohamBhattacharya for RecoCaloTools/Navigation RecoEgamma/EgammaPhotonAlgos

you can see the summary of the current status at #35036 (comment)

@trocino
Copy link
Contributor

trocino commented Sep 7, 2021

Concerning the error in RecoMuon/MuonIdentification/plugins/MuonIdProducer.cc, @CeliaFernandez and I isolated the bug that causes it. I add @khaosmos93 and @JanFSchulte for information, since the error occurs in HLT reconstruction in this case — but the bug is general, not HLT specific.

Basically, here and here the theShowerDigiFiller_ pointer is used uninitialized to call function fillDefault(...). This happens whenever the fillShowerDigis_ flag is False (see here), which is always the case in HLT.

This bug has been there for a very long time, so we don't understand why the error has never showed up before, nor why it only showed up in wf 11603.0 (SingleElectronPt1000). Also, I cannot reproduce the error with runTheMatrix. I tried running both wf 11603.0 and others with more muons, but the error is never printed out. Note that I cannot use the exact same IB (it doesn't exist anymore) nor the exact same sample (it is not accessible at CERN_T2 anymore). But in principle the error should be raised in any release and anytime the HLT muon reconstruction is run. So I am probably missing something: should I use some specific configuration to print out this error?

Anyway, we can provide an easy fix to prevent the error and reproduce the intended behavior. But we cannot fully test it if we cannot reproduce the error in the first place. Should we proceed anyway and just check for compilation errors?

@makortel
Copy link
Contributor

makortel commented Sep 7, 2021

This bug has been there for a very long time, so we don't understand why the error has never showed up before, nor why it only showed up in wf 11603.0 (SingleElectronPt1000). Also, I cannot reproduce the error with runTheMatrix. I tried running both wf 11603.0 and others with more muons, but the error is never printed out

It is being reported in a special UBSAN IB, and therefore it does not show up in standard IBs. The latest UBSAN build is CMSSW_12_1_UBSAN_X_2021-09-03-2300, could you try that?

(it is also the very nature of undefined behavior that the problems can show by themselves or stay hidden for long time)

Note that I cannot use the exact same IB (it doesn't exist anymore) nor the exact same sample (it is not accessible at CERN_T2 anymore).

For running at CERN please try runTheMatrix.py ... --ibeos, then it will use the same cache for input data as IBs do.

@slava77
Copy link
Contributor

slava77 commented Sep 16, 2021

Basically, here and here the theShowerDigiFiller_ pointer is used uninitialized to call function fillDefault(...). This happens whenever the fillShowerDigis_ flag is False (see here), which is always the case in HLT.

This bug has been there for a very long time, so we don't understand why the error has never showed up before, nor why it only showed up in wf 11603.0 (SingleElectronPt1000).

the call for the False case is (out of line/.cc implementation) void MuonShowerDigiFiller::fillDefault(reco::MuonChamberMatch& muChMatch) const { muChMatch.nDigisInRange = 0; }
and the type is apparently not virtual.
So, the call does not depend on the algorithm instance.

@makortel @Dr15Jones
is it possible that the compiler optimizes away somehow the unique_ptr dereferencing with the direct call in the method?

@makortel
Copy link
Contributor

is it possible that the compiler optimizes away somehow the unique_ptr dereferencing with the direct call in the method?

It indeed appears to be possible (in a quick test). I'm don't know though what the standard says.

@smuzaffar
Copy link
Contributor

About RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:441:20: runtime error: variable length array bound evaluates to non-positive value 0 , this happens when trackProducers_ has size of zero ( https://github.com/cms-sw/cmssw/blob/master/RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc#L327 ) resulting in empty trackColls (https://github.com/cms-sw/cmssw/blob/master/RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc#L325 ). Can we just protect https://github.com/cms-sw/cmssw/blob/master/RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc#L341-L347 and run it if collsSize>0 ?

@slava77
Copy link
Contributor

slava77 commented Oct 27, 2021

about 11603.0 step3 RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 138369145, which is not a valid value for type 'SubDetector'

@OzAmram do you know how we get DetParam constructed without assigning GeomDetType::SubDetector thePart;?

@OzAmram
Copy link
Contributor

OzAmram commented Oct 28, 2021

I looked into a bit and I do not know how it would have happened. When I ran the effected workflow to reproduce the issue it seemed there were many errors/warnings upstream so I wondered if it could have been a memory issue. The log file is attached
step3_SingleElectronPt1000+2021+SingleElectronPt1000_pythia8_GenSim+Digi+Reco+HARVEST+ALCA.log

@slava77
Copy link
Contributor

slava77 commented Oct 28, 2021

I looked into a bit and I do not know how it would have happened. When I ran the effected workflow to reproduce the issue it seemed there were many errors/warnings upstream so I wondered if it could have been a memory issue. The log file is attached step3_SingleElectronPt1000+2021+SingleElectronPt1000_pythia8_GenSim+Digi+Reco+HARVEST+ALCA.log

I'm curious if it's possible to get a stack trace.
Looking at https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html#stack-traces-and-report-symbolization
there is apparently a way to get some more data.
@smuzaffar is the code in UBSAN already compiled with -g and -fno-omit-frame-pointer?
it may be that UBSAN_OPTIONS=print_stacktrace=1 is enough.

@slava77
Copy link
Contributor

slava77 commented Oct 29, 2021

@slava77 , UBSAN_OPTIONS=print_stacktrace=1 is already set for UBSAN

ah, nice. Is this recent?
I did not see it in some older logs and in the more recent log posted by @OzAmram .
Or is this an extra option that should be used to run locally?

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 29, 2021

Yes it is a recent change, added 3 weeks ago after discussing it in Core SW meeting. The env variable is part of IB configuration so will be set automatically in local runs.

@smuzaffar
Copy link
Contributor

There are few more UBSAN errors like
https://cmssdt.cern.ch/SDT/jenkins-artifacts/ubsan_logs/CMSSW_12_2_X_2021-11-26-2300/logs/07/07ac3c7a6f198589d18164a98a1108d6/log

123.1/step1:RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:341:40: runtime error: variable length array bound evaluates to non-positive value 0
123.2/step1:RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:341:40: runtime error: variable length array bound evaluates to non-positive value 0

This is generated when trackColls.size() is zero at https://github.com/cms-sw/cmssw/blob/master/RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc#L339 . Is someone looking in to this?

@VinInn
Copy link
Contributor

VinInn commented Feb 18, 2022

Let me clarify.
The advice of algorithm experts is welcome.

Still in my opinion the issue is in UBSAN that flag as UB any copy of a uninitialized bool allocated on a "dirty area"
while is happy with any other type (see below)
So either we flag as error any copy of uninitialized values or I do not see why bool should be considered "special".

[innocent@patatrack02 ctest]$ cat boolUB.cc
#include<memory>
#include<iostream>
int main() {

  unsigned char a[4] = {123,244,123,244};

  void * p = a;

  auto * q1 = new(p)  bool;

  std::cout << "here it comes" << std::endl;
  bool b = *q1;

  auto * q2 = new(p)  float;

  std::cout << "here it's happy" << std::endl;
  float f = *q2;


  return b&(f!=0);

}
[innocent@patatrack02 ctest]$ c++ -O2 -Wall -fsanitize=undefined boolUB.cc
[innocent@patatrack02 ctest]$ ./a.out
here it comes
boolUB.cc:12:8: runtime error: load of value 123, which is not a valid value for type 'bool'
here it's happy
[innocent@patatrack02 ctest]$

@Dr15Jones
Copy link
Contributor

Still in my opinion the issue is in UBSAN that flag as UB any copy of a uninitialized bool allocated on a "dirty area"
while is happy with any other type (see below)

I think the case is ASAN isn't 'happy' with other dirty areas, it is just a weird value in a bool is the only absolutely positive indicator that ASAN has that a value is dirty, so it is able to safely report it.

I then use that to look for other uninitialized variables in the same class and fill those in as well (as seen in this PR).

Therefore the 'bad bool' value is like a canary in the coal mine and helps focus on areas which may have other problems.

@dan131riley
Copy link

Looks like there's still UB in TrackListMerger after merging #37308. Apparently it's possible for trackColls to be non-empty but have all of its elements empty, so the rSize calculated here is zero:

for (unsigned int i = 0; i != collsSize; i++) {
trackCollSizes[i] = trackColls[i]->size();
trackCollFirsts[i] = rSize;
rSize += trackCollSizes[i];
}

so we get zero-length arrays here:

int selected[rSize];
int indexG[rSize];
bool trkUpdated[rSize];
int trackCollNum[rSize];
int trackQuals[rSize];
float trackMVAs[rSize];

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/1bd3a624eec422435ea44a59d32d7356/opt/cmssw/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_UBSAN_X_2022-03-25-1100/src/RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:388:21: runtime error: variable length array bound evaluates to non-positive value 0

@jpata
Copy link
Contributor

jpata commented Mar 28, 2022

Looks like there's still UB in ...

@dan131riley which workflow is that, so we can confirm it's gone with #37384?

@jpata
Copy link
Contributor

jpata commented May 17, 2022

Just for reference, these are still left

456 RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 1060586858, which is not a valid value for type 'SubDetector'
2 RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforGPU.h:248:23: runtime error: signed integer overflow: 8 - -2147483648 cannot be represented in type 'int'
2 RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforGPU.h:249:23: runtime error: signed integer overflow: 104 - -2147483648 cannot be represented in type 'int'

As far as I understand, these are not going to be fixed (solutions were proposed, but did not converge to a full fix)

187 RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:451:45: runtime error: variable length array bound evaluates to non-positive value 0
187 RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:452:20: runtime error: variable length array bound evaluates to non-positive value 0
10 TrackingTools/TrajectoryParametrization/interface/PerigeeTrajectoryError.h:14:7: runtime error: load of value 29, which is not a valid value for type 'bool' 

TrackListMerger: #37384
PerigeeTrajectoryError: #36895

Do we have a way to mark in the code "ignore in UBSAN"?

https://cmssdt.cern.ch/SDT/jenkins-artifacts/ubsan_logs/CMSSW_12_5_UBSAN_X_2022-05-16-1100/

@mmusich
Copy link
Contributor

mmusich commented May 17, 2022

Hi @jpata

Just for reference, these are still left

456 RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 1060586858, which is not >a valid value for type 'SubDetector'
2 RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforGPU.h:248:23: runtime error: signed integer overflow: 8 - >-2147483648 cannot be represented in type 'int'
2 RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforGPU.h:249:23: runtime error: signed integer overflow: 104 - >-2147483648 cannot be represented in type 'int'

I think @ferencek planned to look at these.

About

As far as I understand, these are not going to be fixed (solutions were proposed, but did not converge to a full fix)

187 RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:451:45: runtime error: variable length array bound evaluates to non-positive value 0
187 RecoTracker/FinalTrackSelectors/plugins/TrackListMerger.cc:452:20: runtime error: variable length array boun

I plan to come back to it in the not to distant future.

@makortel
Copy link
Contributor

Do we have a way to mark in the code "ignore in UBSAN"?

No.

@mmusich
Copy link
Contributor

mmusich commented May 19, 2022

type tracking

@mmusich
Copy link
Contributor

mmusich commented May 19, 2022

type trk

@mmusich
Copy link
Contributor

mmusich commented May 19, 2022

the above tags are just to have easy access later.

@ferencek
Copy link
Contributor

Just for reference, these are still left

456 RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 1060586858, which is not >a valid value for type 'SubDetector'
2 RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforGPU.h:248:23: runtime error: signed integer overflow: 8 - >-2147483648 cannot be represented in type 'int'
2 RecoLocalTracker/SiPixelRecHits/interface/pixelCPEforGPU.h:249:23: runtime error: signed integer overflow: 104 - >-2147483648 cannot be represented in type 'int'

I think @ferencek planned to look at these.

Some time ago @OzAmram had a look at PixelCPEBase but it was not clear what was causing the issue. Oz mentioned that in the same job where he reproduced the issue he saw a bunch of errors upstream from other modules before even reaching the PixelCPEBase code so he thought that maybe some other code was writing to memory out of bounds and overwriting some values in the CPE code.

@mmusich
Copy link
Contributor

mmusich commented May 25, 2022

PixelCPEBase code so he thought that maybe some other code was writing to memory out of bounds and overwriting some values in the CPE code.

I am not sure how that's possible. In any case is it something you and @OzAmram plan to pursue?

@OzAmram
Copy link
Contributor

OzAmram commented May 30, 2022

You can see the log where I looked into this previously here:

I looked into a bit and I do not know how it would have happened. When I ran the effected workflow to reproduce the issue it seemed there were many errors/warnings upstream so I wondered if it could have been a memory issue. The log file is attached step3_SingleElectronPt1000+2021+SingleElectronPt1000_pythia8_GenSim+Digi+Reco+HARVEST+ALCA.log

If you look at the log there is first an error of

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/a1dd9deda35a37ab828b89609a57944f/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/FWCore/ParameterSet/src/types.cc:147:13: runtime error: signed integer overflow: 2001599834 * 10 cannot be represented in type 'int'

and then afterwards many memory errors like:

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/a1dd9deda35a37ab828b89609a57944f/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/Validation/EcalDigis/plugins/EcalSelectiveReadoutValidation.cc:99:27: runtime error: member call on address 0x7f4f73645ac0 which does not point to an object of type 'EDConsumerBase'
0x7f4f73645ac0: note: object has a possibly invalid vptr: abs(offset to top) too big
00 00 00 00 58 ef 3a 3a 4f 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^~~~~~~~~~~~~~~~~~~~~~~
possibly invalid vptr

before the error related to CPEBase. So I think to debug one would need to see what is causing the initial integer overflow which is hopefully related to the pointer issue which is causing the issue with CPEBase I believe. I can try to take a look again later this week

@mmusich
Copy link
Contributor

mmusich commented May 30, 2022

So I think to debug one would need to see what is causing the initial integer overflow which is hopefully related to the pointer issue which is causing the issue with CPEBase I believe.

I am not an expert, but I don't think that's the way the UBSAN build is supposed to work. As far as I understand it flags in-point failures and the errors above in the stack are not related to the CPE code but to unrelated code paths executed earlier. @makortel please correct me if I misinterpreted.

@makortel
Copy link
Contributor

So I think to debug one would need to see what is causing the initial integer overflow which is hopefully related to the pointer issue which is causing the issue with CPEBase I believe.

I am not an expert, but I don't think that's the way the UBSAN build is supposed to work. As far as I understand it flags in-point failures and the errors above in the stack are not related to the CPE code but to unrelated code paths executed earlier.

I would interpret the log in the same way, i.e. the non-CPE warnings are likely not related to the CPE warnings.

@OzAmram
Copy link
Contributor

OzAmram commented Jun 16, 2022

So after some debugging it seems like it is the uninitialized value of the param which was causing the error. The only time this variable is ever set (checked with lxr) is here and I checked that all of the values being set there were valid. The error seems to come before this part of the code when the values are actually initialized, so seems to be based on whatever undefined value it is present when the struct is created.

It seems very strange to me that UBSAN reports this as an error for this if this undefined value is never used. But in any case, simply setting a default initialization value fixed the UBSAN error. PR #38400

@mmusich
Copy link
Contributor

mmusich commented Jun 29, 2023

@cms-sw/reconstruction-l2 I am bit lost on what is still needing to be fixed here.
Would it be possible for you to list the remaining (or new) undefined behaviour problems in the tracking code stack?

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@smuzaffar
Copy link
Contributor

All of these has been fixed, I do not see any ref to any of these in latest UBSAN . TrackingTools runtime errors are fixed by #46380 . I think we can close this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests