Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DBG_X] RelVal 140.58 Step 2: Corrupted data in SiStripNoises::getNoise #42162

Closed
iarspider opened this issue Jun 30, 2023 · 19 comments · Fixed by #42486
Closed

[DBG_X] RelVal 140.58 Step 2: Corrupted data in SiStripNoises::getNoise #42162

iarspider opened this issue Jun 30, 2023 · 19 comments · Fixed by #42486

Comments

@iarspider
Copy link
Contributor

The following exceptions are reported:

REPACK:DigiToApproxClusterRaw,ENDJOB
We have determined that this is simulation (if not, rerun cmsDriver.py with --data)
entry filelist:step1_dasquery.log
found files:  ['/store/hidata/HIRun2018A/HIHardProbes/RAW/v1/000/326/479/00000/0E2CC5D5-9D87-7348-9219-B00CD718C847.root', '/store/hidata/HIRun2018A/HIHardProbes/RAW/v1/000/326/479/00000/45001EBC-B4D4-9043-A276-8F3AF621C64A.root', '/store/hidata/HIRun2018A/HIHardProbes/RAW/v1/000/326/479/00000/7B3F72ED-E183-3F4B-9FE4-DAE6D911403E.root', '/store/hidata/HIRun2018A/HIHardProbes/RAW/v1/000/326/479/00000/853DBE29-53BA-9A44-9FDD-58E4E9064EB1.root']
Step: REPACK Spec: ['DigiToApproxClusterRaw']
Step: ENDJOB Spec: 
customising the process with customisePostEra_Run2_2018_pp_on_AA from Configuration/DataProcessing/RecoTLR
customising the process with customiseWithTimeMemorySummary from Validation/Performance/TimeMemorySummary
Starting python2 /data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 9000  cmsRun -j JobReport2.xml  step2_REPACK.py
%MSG-i ThreadStreamSetup:  (NoModuleName) 30-Jun-2023 08:32:40 CEST pre-events
setting # threads 4
setting # streams 4
%MSG
30-Jun-2023 08:33:02 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/hidata/HIRun2018A/HIHardProbes/RAW/v1/000/326/479/00000/0E2CC5D5-9D87-7348-9219-B00CD718C847.root
30-Jun-2023 08:33:05 CEST  Successfully opened file root://eoscms.cern.ch//eos/cms/store/user/cmsbuild/store/hidata/HIRun2018A/HIHardProbes/RAW/v1/000/326/479/00000/0E2CC5D5-9D87-7348-9219-B00CD718C847.root
Begin processing the 1st record. Run 326479, Event 1394020, LumiSection 7 on stream 3 at 30-Jun-2023 08:33:13.279 CEST
Begin processing the 2nd record. Run 326479, Event 1579493, LumiSection 7 on stream 2 at 30-Jun-2023 08:33:13.279 CEST
Begin processing the 3rd record. Run 326479, Event 1402087, LumiSection 7 on stream 0 at 30-Jun-2023 08:33:13.279 CEST
Begin processing the 4th record. Run 326479, Event 1328354, LumiSection 7 on stream 1 at 30-Jun-2023 08:33:13.279 CEST
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigisHLT  30-Jun-2023 08:33:13 CEST Run: 326479 Event: 1402087
NULL pointer to FEDRawData for FED: id 434
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)
%MSG
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigisHLT  30-Jun-2023 08:33:13 CEST Run: 326479 Event: 1579493
NULL pointer to FEDRawData for FED: id 434
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)
%MSG
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigisHLT  30-Jun-2023 08:33:13 CEST Run: 326479 Event: 1328354
NULL pointer to FEDRawData for FED: id 434
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)
%MSG
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigisHLT  30-Jun-2023 08:33:13 CEST Run: 326479 Event: 1394020
NULL pointer to FEDRawData for FED: id 434
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)
%MSG
----- Begin Fatal Exception 30-Jun-2023 08:33:33 CEST-----------------------
An exception of category 'CorruptedData' occurred while
   [0] Processing  Event run: 326479 lumi: 7 event: 1328354 stream: 1
   [1] Running path 'REPACKRAWoutput_step'
   [2] Prefetching for module PoolOutputModule/'REPACKRAWoutput'
   [3] Calling method for module SiStripClusters2ApproxClusters/'hltSiStripClusters2ApproxClusters'
Exception Message:
[SiStripNoises::getNoise] looking for SiStripNoises for a strip out of range: strip 768
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 30-Jun-2023 08:33:33 CEST-----------------------
An exception of category 'CorruptedData' occurred while
   [0] Processing  Event run: 326479 lumi: 7 event: 1402087 stream: 0
   [1] Running path 'REPACKRAWoutput_step'
   [2] Prefetching for module PoolOutputModule/'REPACKRAWoutput'
   [3] Calling method for module SiStripClusters2ApproxClusters/'hltSiStripClusters2ApproxClusters'
Exception Message:
[SiStripNoises::getNoise] looking for SiStripNoises for a strip out of range: strip 768
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 30-Jun-2023 08:33:33 CEST-----------------------
An exception of category 'CorruptedData' occurred while
   [0] Processing  Event run: 326479 lumi: 7 event: 1579493 stream: 2
   [1] Running path 'REPACKRAWoutput_step'
   [2] Prefetching for module PoolOutputModule/'REPACKRAWoutput'
   [3] Calling method for module SiStripClusters2ApproxClusters/'hltSiStripClusters2ApproxClusters'
Exception Message:
[SiStripNoises::getNoise] looking for SiStripNoises for a strip out of range: strip 768
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 30-Jun-2023 08:33:33 CEST-----------------------
An exception of category 'CorruptedData' occurred while
   [0] Processing  Event run: 326479 lumi: 7 event: 1394020 stream: 3
   [1] Running path 'REPACKRAWoutput_step'
   [2] Prefetching for module PoolOutputModule/'REPACKRAWoutput'
   [3] Calling method for module SiStripClusters2ApproxClusters/'hltSiStripClusters2ApproxClusters'
Exception Message:
[SiStripNoises::getNoise] looking for SiStripNoises for a strip out of range: strip 768

There are also warnings about NULL pointers to FEDRawData, not sure if they are related or harmless:

%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigisHLT  30-Jun-2023 08:33:13 CEST Run: 326479 Event: 1402087
NULL pointer to FEDRawData for FED: id 434
Note: further warnings of this type will be suppressed (this can be changed by enabling debugging printout)
(...)
%MSG-w SiStripRawToDigi:  SiStripRawToDigiModule:siStripDigisHLT@endStream  30-Jun-2023 08:33:33 CEST PostEndProcessBlock
[sistrip::RawToDigiUnpacker::createDigis] warnings:
NULL pointer to FEDRawData for FED (1)
@iarspider
Copy link
Contributor Author

assign reconstruction

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @iarspider .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@mmusich
Copy link
Contributor

mmusich commented Jun 30, 2023

Both this and the other companion issue #42131 are caused because there is a condition (SiStripNoises) which is read off its domain from this call:

cut_ = std::min<float>(seedCutMIPs * mip, seedCutSN * noiseObj_->getNoise(firstStrip + 1, noises_));

when it looks for the noise of strip n. 768 for Detid 369120277.
I suppose this was triggered by #41815 (tagging also @Ksavva1021) .
I am not sure why the payload stored in the tag SiStripNoise_v2_prompt for run 326417 (which is the run used for wf 140.58) is missing this particular strip.

@aandvalenzuela
Copy link
Contributor

Hello! There are three new workflows failing with the same issue for CMSSW_13_2 2023-07-13-2300 (latest DBG IB). I link the logs here: 159.03, 160.02 and 160.03.

@Ksavva1021
Copy link
Contributor

Ksavva1021 commented Jul 18, 2023

There is a workaround to this (to just not use that strip) but, still, the question stands as to why the payload stored in the tag SiStripNoise_v2_prompt for run 326417 (which is the run used for wf 140.58) is missing this particular strip as @mmusich mentioned.

@makortel
Copy link
Contributor

assign alca

FYI @cms-sw/trk-dpg-l2

@cmsbuild
Copy link
Contributor

New categories assigned: alca

@perrotta,@consuegs,@francescobrivio,@saumyaphor4252,@tvami you have been requested to review this Pull request/Issue and eventually sign? Thanks

@tvami
Copy link
Contributor

tvami commented Aug 1, 2023

I dont know how one can figure out why that strips is missing since this is a story from 2018 (Nov 9, 2018)... By putting "326417" into my gmail, I find that the payload is coming from an O2O from some HI running. The msg from Raffaele says "after the first HI fills of yesterday, we saw, as expected from the online calibrations loaded one week ago, a reduction in the S/N particularly visible in TID and TECs. With this o2o, a new set of pedestal and noise values measured yesterday are deployed offline. We can profit of the next cosmics/collisions to quickly validate them. In case they will create troubles, a roll back is possible in 10 minutes. As far as i can tell, no large differences observed in pedestals, reduction of noise in TEC/TID and inner TIB layers of about 5%." which is followed by an unanswered question from Andrea Venturi saying " I am wondering if there is a way to produce S/N distributions selecting only (on-track) clusters in APVs that were zero-suppressed in the standard way. Just to be sure that the hybrid ZS and the clusterizer do not introduce a reduction of S and, therefore, of S/N."

--> so was that strip ZS differently than the rest and somehow didnt enter the payload? That's all I can offer.

Actually here is the tracker map of the noise in the relevant payload: https://cern.ch/u3ouy

And I also pingged the strips convenors

@mdelcourt
Copy link

I checked a recent noise payload, and it seems that this strip never exists. For this detId (and similar modules), the strip range is from 0 to 767.

@mmusich
Copy link
Contributor

mmusich commented Aug 3, 2023

One would presume that protecting this call:

with the output of this filter:

theFilter->getSizes(detId, cluster, lp, ldir, hitStrips, hitPredPos);

would prevent reading off domain, but apparently this condition:

bool usable = (fs >= 1) & (fs + meas <= ns + 1);

is not restrictive enough.

@makortel
Copy link
Contributor

makortel commented Aug 3, 2023

Given that the access in

cut_ = std::min<float>(seedCutMIPs * mip, seedCutSN * noiseObj_->getNoise(firstStrip + 1, noises_));

is with firstStrip + 1 == 768, and firstStrip == 767 seems to be a valid value for a strip (i.e. assuming on the caller side

observing the last strip of a module is ok), could it be the PeakFinderTest does not properly handle the edge case?

Actually, why does the PeakFinderTest inspect the noise of the strip next to the first strip of the cluster? (as opposed to e.g. the first strip, or "center strip" of the cluster)

@mmusich
Copy link
Contributor

mmusich commented Aug 3, 2023

could it be the PeakFinderTest does not properly handle the edge case?

I think that's possiblity, but I am wondering why we observe this now (a very similar code was already used in

) which I think it has been used since ever for the Strip seeded iterations.

Actually, why does the PeakFinderTest inspect the noise of the strip next to the first strip of the cluster? (as opposed to e.g. the first strip, or "center strip" of the cluster)

that's also the question I asked myself, but could not find an answer. I suspect all of this code is from Run1.

@slava77
Copy link
Contributor

slava77 commented Aug 3, 2023

Do I understand correctly that this is present only in DBG_X ? (not in other builds)

@mmusich
Copy link
Contributor

mmusich commented Aug 3, 2023

Do I understand correctly that this is present only in DBG_X ? (not in other builds)

A similar issue happens in ASAN_X: #42131

@makortel
Copy link
Contributor

makortel commented Aug 3, 2023

Do I understand correctly that this is present only in DBG_X ? (not in other builds)

The exception occurs only in DBG_X builds because the verify() function (that does the check and throws the exception) is called only when EDM_ML_DEBUG is defined

static float getNoise(uint16_t strip, const Range& range) {
#ifdef EDM_ML_DEBUG
verify(strip, range);
#endif
return getNoiseFast(strip, range);
}

To me it looks like the issue is there in all builds, but doesn't show any (technical) sign exception in DBG_X and ASAN_X.

@mmusich
Copy link
Contributor

mmusich commented Aug 7, 2023

type trk

@mmusich
Copy link
Contributor

mmusich commented Aug 7, 2023

but I am wondering why we observe this now (a very similar code was already used in cmssw/RecoTracker/PixelLowPtUtilities/src/StripSubClusterShapeTrajectoryFilter.cc which I think it has been used since ever for the Strip seeded iterations.

answering to myself, this doesn't happen because of this check:

if (std::abs(hitPredPos) < 1.5f && hitStrips <= 2) {
return true;
}

I propose the same kind of protection at #42486

@tvami
Copy link
Contributor

tvami commented Aug 9, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants