Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UBSAN] Undefined behavior in DataFormats/* reco packages #35033

Closed
mrodozov opened this issue Aug 26, 2021 · 28 comments
Closed

[UBSAN] Undefined behavior in DataFormats/* reco packages #35033

mrodozov opened this issue Aug 26, 2021 · 28 comments

Comments

@mrodozov
Copy link
Contributor

The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:

1302.17 step3
DataFormats/BTauReco/interface/TemplatedSecondaryVertexTagInfo.h:38:21: runtime error: load of value 4, which is not a valid value for type 'Status'

159.3 step3
DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

136.885 step3
DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h:17:7: runtime error: load of value 72, which is not a valid value for type 'bool'

38634.0 step3
DataFormats/TrackReco/src/HitPattern.cc:102:42: runtime error: left shift of negative value -1

138.2 step2
DataFormats/TrajectorySeed/interface/TrajectorySeed.h:44:3: runtime error: load of value 10967, which is not a valid value for type 'PropagationDirection'

check the relval logs in here for the examples:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/

@mrodozov
Copy link
Contributor Author

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @mrodozov Mircho Rodozov.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented Aug 27, 2021

checklist

The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:

1302.17 step3
DataFormats/BTauReco/interface/TemplatedSecondaryVertexTagInfo.h:38:21: runtime error: load of value 4, which is not a valid value for type 'Status'

     enum Status { trackSelected = 0, trackUsedForVertexFit, trackAssociatedToVertex };

      inline bool associatedToVertex(unsigned int index) const {
        return (int)svStatus == (int)trackAssociatedToVertex + (int)index;//<== HERE
      }

      Status svStatus;

the warning points to a use case; I guess that the store may have happened here

trackData[index].second.svStatus =
(btag::TrackData::Status)((unsigned int)btag::TrackData::trackAssociatedToVertex + idx);

and this conflicts with the 3-value enum type
@emilbols

159.3 step3
DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate {
I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet)
@cms-sw/hcal-dpg-l2

136.885 step3
DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h:17:7: runtime error: load of value 72, which is not a valid value for type 'bool'

this points to class CTPPSPixelRecHit {, which is not very helpful.

@cms-sw/ctpps-dpg-l2

38634.0 step3
DataFormats/TrackReco/src/HitPattern.cc:102:42: runtime error: left shift of negative value -1

this is apparently for

        case MuonSubdetId::GEM: {
          GEMDetId gemid(id.rawId());
          layer = ((gemid.station() - 1) << 2); // <== HERE
          layer |= abs(gemid.layer() - 1);

the workflow is D86 D86 = T24+C17+M10+I14+O8+F6 with an era Phase2C11I13M9
@jshlee @watson-ij please remind if the gemid.station() here going to be above 0? If it stays at zero, we need a fix for this new ME0 in HitPattern.cc

138.2 step2
DataFormats/TrajectorySeed/interface/TrajectorySeed.h:44:3: runtime error: load of value 10967, which is not a valid value for type 'PropagationDirection'

this points to TrajectorySeed(TrajectorySeed const& o) = default;
I guess the original is a default-constructed TrajectorySeed, which has TrajectorySeed() {}
In case we decide to use uninitialized , what is the way to make UBSAN to ignore this case?
@VinInn I recall that you were in favor of not initializing data in some cases; is it one of these cases?
@mtosi @vmariani @mmusich

check the relval logs in here for the examples:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/

@mrodozov

  • I see that this is for CMSSW_12_1_UBSAN_X_2021-08-20-2300: how long will this IB stay in place? It may be hard to debug if this goes away this weekend.
  • is it possible to get a longer stack trace to the calling code?
  • a few cases seem to be related to random behavior, having .root files here may help to reduce at least the corresponding cmsRun input randomness.

@jshlee
Copy link
Contributor

jshlee commented Aug 27, 2021

@slava77 for ME0 station we would like to keep it as 0.

@mrodozov
Copy link
Contributor Author

@slava77 until saturday/sunday (don't remember in which day we publish the next week)
it's in this week:
/cvmfs/cms-ib.cern.ch/nweek-02694/slc7_amd64_gcc10/cms/cmssw
which will be substituted at the end of the current week by nweek-02696
We can install the same IB elsewhere more permanently, but the root files we don't keep, they are gone once the IB is built
Would installing the same IB on more permanent place be sufficient ?

@VinInn I recall that you were in favor of not initializing data in some cases; is it one of these cases

Here is one case on insisting on lazy init to keep a struct as а POD
#33742

@mseidel42
Copy link
Contributor

159.3 step3

DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate {
I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet)
@cms-sw/hcal-dpg-l2

@abdoulline Do you have an idea where this happens? Should we initialize this in the CaloTower constructor?

@fabferro
Copy link
Contributor

All this looks weird to me. As to CTPPSPixelRecHits, I've never seen this error before. Any reason why these errors are happening only in this IB?
BTW, I tried to reproduce the error running the matrix for 136.885 but I just get
OSError: [Errno 28] No space left on device: '/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-08-20-2300/python/PhysicsTools/PatAlgos/slimming/isolatedTracks_cfi.py'

@mrodozov
Copy link
Contributor Author

mrodozov commented Aug 27, 2021

This IB turns the Undefined behavior sanitizer (-fsanitize=undefined flag), which searches for such errors:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

OSError: [Errno 28] No space left on device:

could you share what command you are using to run the relval and on what machine ?

@fabferro
Copy link
Contributor

This IB turns the Undefined behavior sanitizer (-fsanitize=undefined flag), which searches for such errors:
https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

OSError: [Errno 28] No space left on device:

could you share what command you are using to run the relval and on what machine ?

lxplus. Just (naively) installing cmsrel CMSSW_12_1_UBSAN_X_2021-08-20-2300 and then running runTheMatrix.py -l 136.885

@mrodozov
Copy link
Contributor Author

mrodozov commented Aug 27, 2021

would you try from lxplus, log onto cmsdev* machine and retry

ssh cmsdev25
mkdir -p /build/$USER ; cd /build/$USER
export SCRAM_ARCH=slc7_amd64_gcc10;
scram p CMSSW_12_1_UBSAN_X_2021-08-20-2300
cd CMSSW_12_1_UBSAN_X_2021-08-20-2300 ; cmsenv ;
runTheMatrix.py -l 136.885 -i all --ibeos

@fabferro
Copy link
Contributor

It took ages but I could run 136.885, confirming the problem for CTPPSPixelRecHits. With @jan-kaspar we worked out a possible solution:
removing line https://github.com/cms-sw/cmssw/blob/master/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L19
and adding default values here
https://github.com/cms-sw/cmssw/blob/master/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L21-L22

@slava77 Shall I submit a PR?

@mrodozov
Copy link
Contributor Author

Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures

@slava77
Copy link
Contributor

slava77 commented Aug 27, 2021

@slava77 Shall I submit a PR?

@fabferro
yes, please.
I think that for PPS hit multiplicities there is no need to worry about the cost of initializing a hit data if it will not be used.

@slava77
Copy link
Contributor

slava77 commented Aug 27, 2021

Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures

That's expected for most cases in this and other reco-related UBSAN issues: the symptoms are always for uninitialized values (which are also apparently not used at least in a few cases, beyond a load). The uninit values are often random (and not repeatable).

@fabferro
Copy link
Contributor

fabferro commented Aug 27, 2021

Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures

No. I ran it once before changing the code and the error was there, I ran once after the changes and the error disappeared.
Actually, for some reason, it took hours to run 136.885, so it would have been impossible to repeate it several times...

@abdoulline
Copy link

abdoulline commented Aug 28, 2021

159.3 step3

DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate {
I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet)
@cms-sw/hcal-dpg-l2

@abdoulline Do you have an idea where this happens? Should we initialize this in the CaloTower constructor?

@mseidel42
I don't immediately see a reason of the issue, which might have been present there since > 10 years.
It requires some investigation (via calotowermaker), which is beyond of what I can do at the moment from vacation.
Hopefully I'll have a possibility to do it some time next week, still being on vacation.

NB: there seems to be yet another issue in CaloTower (along with the aforementioned 'HcalSubdetector' one):
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/159.3_ZMM_14_HI_2021+ZMM_14_HI_2021+DIGIHI2021PPRECO+RECOHI2021PPRECO+HARVESTHI2021PPRECO/step3_ZMM_14_HI_2021+ZMM_14_HI_2021+DIGIHI2021PPRECO+RECOHI2021PPRECO+HARVESTHI2021PPRECO.log

-> CaloTower.h:26:7: runtime error: load of value 32, which is not a valid value for type 'bool'

@mrodozov
Copy link
Contributor Author

mrodozov commented Aug 28, 2021

present there since > 10 years

yes this issues were existing for long time that's why cleaning them out will narrow consequent bugs that pop up for no obvious reasons. although for all bugs we know we have some idea non of which is related to this reports (afaik)
in short now that we know the bugs are there lets do something

@slava77
Copy link
Contributor

slava77 commented Sep 7, 2021

@slava77 for ME0 station we would like to keep it as 0.

@jshlee
please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that station -1 is not used in this case.

@slava77
Copy link
Contributor

slava77 commented Sep 7, 2021

@emilbols
@vmariani @mmusich

this is a kind ping on the pending issues as detailed in #35033 (comment)

@jshlee
Copy link
Contributor

jshlee commented Sep 7, 2021

@slava77 for ME0 station we would like to keep it as 0.

@jshlee
please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that station -1 is not used in this case.

hi @slava77, where is the code implemented?

@slava77
Copy link
Contributor

slava77 commented Sep 7, 2021

@slava77 for ME0 station we would like to keep it as 0.

@jshlee
please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that station -1 is not used in this case.

hi @slava77, where is the code implemented?

case MuonSubdetId::GEM: {
GEMDetId gemid(id.rawId());
layer = ((gemid.station() - 1) << 2);
layer |= abs(gemid.layer() - 1);
} break;

@emilbols
Copy link
Contributor

emilbols commented Sep 8, 2021

@slava77 @mrodozov So if i understand correctly the error in 1302.17 appears because svStatus is set to a value beyond the enumeration range? Obviously this code was written a long time ago, but as far as i can see, this is the intended behavior. An svStatus=2 indicates a match to the vertex with index 0, svStatus=3 is a match with vertex with index 1, etc. . If we dont want to use enum objects like this, i guess the solution could be to change this object to an integer.

I tried to run this on cmsdev25 following the recipe above with workflow 1302.17 and switching to CMSSW_12_1_UBSAN_X_2021-09-03-2300, but i dont get the error. If i understand the error correctly, it is most likely because it only appears when there are more than 1 SV in the event which the track is being associated to, so there is probably some randomness in whether you get it

@slava77
Copy link
Contributor

slava77 commented Sep 8, 2021

@slava77 @mrodozov So if i understand correctly the error in 1302.17 appears because svStatus is set to a value beyond the enumeration range? Obviously this code was written a long time ago, but as far as i can see, this is the intended behavior. An svStatus=2 indicates a match to the vertex with index 0, svStatus=3 is a match with vertex with index 1, etc. . If we dont want to use enum objects like this, i guess the solution could be to change this object to an integer.

@emilbols
thank you for checking the code and clarifying its intents in using the variable. I think that the design span of values should match the type. So, if a change to an int is needed, let's update it. It was not obvious to me in this case, but perhaps this case also implies that a new variable is needed instead instead of an "easy" overload of what was originally designed to be an enum.

As for reproducibility, you may want to enable more threads (the reference job used 4 threads) and add more events.

@slava77
Copy link
Contributor

slava77 commented Oct 26, 2021

+reconstruction

the summary was updated at
#35033 (comment)

@smuzaffar
Copy link
Contributor

all of these ubsan runtime errors are fixed

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants