Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[UBSAN] Undefined behavior in DataFormats/* simulation packages #35034

Closed
mrodozov opened this issue Aug 26, 2021 · 17 comments
Closed

[UBSAN] Undefined behavior in DataFormats/* simulation packages #35034

mrodozov opened this issue Aug 26, 2021 · 17 comments

Comments

@mrodozov
Copy link
Contributor

The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:

11603.0 step2 
DataFormats/SiPixelDigi/interface/PixelDigi.h:75:61: runtime error: left shift of negative value -1

136.882 step3
DataFormats/RPCDigi/interface/RecordBX.h:21:22: runtime error: left shift of negative value -1

1302.17 step1
DataFormats/HcalDetId/src/HcalCastorDetId.cc:11:49: runtime error: left shift of negative value -1

136.821 step3
DataFormats/Math/interface/approx_exp.h:162:32: runtime error: signed integer overflow: 2147483647 + 127 cannot be represented in type 'int'

11603.0 step2
DataFormats/Math/interface/libminifloat.h:74:78: runtime error: shift exponent -1 is negative

check the relval logs in here for the examples:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/

@mrodozov
Copy link
Contributor Author

assign simulation

@cmsbuild
Copy link
Contributor

New categories assigned: simulation

@civanch,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @mrodozov Mircho Rodozov.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@ferencek
Copy link
Contributor

@mrodozov what are the exact steps to reproduce the issues. Just to run specified workflows inside an UBSAN IB? Here I am primarily interested in the issue with PixelDigi.h

@mrodozov
Copy link
Contributor Author

yes, just run the 11603.0 step 2 with the UBSAN IB, set SCRAM_ARCH=slc7_amd64_gcc10 , we use gcc10 for UBSAN
setup the UBSAN IB, you may checkout DataFormats/SiPixelDigi if you want and then run the
runTheMatrix -i all -l 11603.0 --ibeos -t4

@civanch
Copy link
Contributor

civanch commented Sep 13, 2021

@ferencek , have you any idea what happens?

@ferencek
Copy link
Contributor

@civanch, I plan to have a look this week.

@ferencek
Copy link
Contributor

ferencek commented Sep 15, 2021

@mrodozov, @civanch, I had a look at this but I am afraid this issue exceeds my level of expertise with C++. About two weeks ago I just had a quick look at the code and was quite confused because the problematic variable in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75, PixelChannelIdentifier::thePacking.column_width, should be properly initialized. This time I followed the following recipe

voms-proxy-init -rfc -voms cms

export SCRAM_ARCH=slc7_amd64_gcc10
cmsrel CMSSW_12_1_UBSAN_X_2021-09-13-1100
cd CMSSW_12_1_UBSAN_X_2021-09-13-1100/src
cmsenv
git cms-addpkg DataFormats/SiPixelDigi
git cms-addpkg DataFormats/SiPixelDetId
git cms-checkdeps -a -A
scram b -j12

runTheMatrix.py -i all -l 11603.0 --ibeos -t4 --maxSteps=2 -j 0

and ran step2 with /store/relval/CMSSW_12_0_0_pre4/RelValSingleElectronPt1000/GEN-SIM/120X_mcRun3_2021_realistic_v2-v1/00000/c3b64309-ec4a-4c5f-b7a6-c97ef7b031fa.root as input.

I was able to reproduce the problem and for some reason it occurs for the first time in the 4th event. So for the subsequent tests I modified the cfg file to skip three and process only one event. The error was still there. I added a cout statement to the pixelToChannel(int row, int col) method in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75 to debug the issue

  static int pixelToChannel(int row, int col) { 
      std::cout << ">> column width: " << PixelChannelIdentifier::thePacking.column_width << std::endl;
      return (row << PixelChannelIdentifier::thePacking.column_width) | col;
  }

but the values printed out were always 10 as they should be. Here is an interesting portion of the printout

>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/DataFormats/SiPixelDigi/interface/PixelDigi.h:78:19: runtime error: left shift of negative value -1
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10

I even modified the data members in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDetId/interface/PixelChannelIdentifier.h#L33-L35 from int to unsigned int but the error was still there.

Interestingly enough, when I set the cfg file to skip the first two events, I noticed the following error which otherwise does not appear

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/a1dd9deda35a37ab828b89609a57944f/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 2446543800, which is not a valid value for type 'SubDetector'

Any suggestions on what to check next would be very welcome.

@ferencek
Copy link
Contributor

Just a quick follow-up on the issue with PixelCPEBase.h I mentioned above. The first time I observed it was before I locally compiled all dependent packages (git cms-checkdeps -a -A). After compiling locally, I checked the issue two more times and each time the reported value was different as you can see here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905974791, which is not a valid value for type 'SubDetector'

and here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905975319, which is not a valid value for type 'SubDetector'

So there seems to be some randomness in the loaded value.

@smuzaffar
Copy link
Contributor

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/DataFormats/SiPixelDigi/interface/PixelDigi.h:78:19: runtime error: left shift of negative value -1

@ferencek , I think problem is row being -1 here https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75

@ferencek
Copy link
Contributor

@smuzaffar, good point, thanks. I was interpreting the error message incorrectly. OK, this will be helpful for further debugging.

@ferencek
Copy link
Contributor

OK, so here is the culprit https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/SimTracker/SiPixelDigitizer/plugins/SiPixelChargeReweightingAlgorithm.cc#L234 Now I need to figure out why the row coordinate gets evaluated to -1.

@ferencek
Copy link
Contributor

Just a quick follow-up on the issue with PixelCPEBase.h I mentioned above. The first time I observed it was before I locally compiled all dependent packages (git cms-checkdeps -a -A). After compiling locally, I checked the issue two more times and each time the reported value was different as you can see here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905974791, which is not a valid value for type 'SubDetector'

and here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905975319, which is not a valid value for type 'SubDetector'

So there seems to be some randomness in the loaded value.

The problem with the PixelCPEBase has already been reported in #35036

ferencek added a commit to CMSTrackerDPG/cmssw that referenced this issue Sep 18, 2021
Restricted the loop over the row and column coordinates to prevent it
form going out of physical bounds. This addresses the following issue

DataFormats/SiPixelDigi/interface/PixelDigi.h:75:61: runtime error: left shift of negative value -1

reported in cms-sw#35034
@civanch
Copy link
Contributor

civanch commented Dec 5, 2021

+1

the issue was fixed in #35337

@smuzaffar
Copy link
Contributor

closing this, some of these are fixed and we have newer issues open to track the rest

@cmsbuild
Copy link
Contributor

cms-bot internal usage

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants