Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HLT crash in run 359998: Unavailable Conditions of type HcalChannelQuality #39693

Closed
trtomei opened this issue Oct 11, 2022 · 31 comments
Closed

Comments

@trtomei
Copy link
Contributor

trtomei commented Oct 11, 2022

Crash in Run 359998
http://cmsonline.cern.ch/cms-elog/1159020

with following message:

[2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll'
Exception Message:
Unavailable Conditions of type HcalChannelQuality for cell (0x0) 

Unfortunately not reproducible yet. The file reconverted to ROOT is
/nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/run359998_ls0335.root,
and the relevant configurations are:

  • CMSSW_12_4_9
  • GT: 124X_dataRun3_HLT_v4
  • /cdaq/physics/Run2022/2e34/v1.4.0/HLT/V10

A copy of the configuration file is available in /nfshome0/hltpro/hilton_c2e36_35_04/hltpro/thiagoScratch/hlt.py

@cmsbuild
Copy link
Contributor

A new Issue was created by @trtomei Thiago Tomei.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@francescobrivio
Copy link
Contributor

francescobrivio commented Oct 11, 2022

From the alca point of view I can confirm that the GT 124X_dataRun3_HLT_v4 has been online for a while now and the tag HcalChannelQuality_v2.0_hlt was last modified on 2022-08-03, so i don't see any clear reason for this failure.
Maybe someone from @cms-sw/hcal-dpg-l2 can comment on the cell (0x0)?

@francescobrivio
Copy link
Contributor

assign hcal-dpg

@missirol
Copy link
Contributor

FYI: @cms-sw/hlt-l2 @silviodonato

@cmsbuild
Copy link
Contributor

New categories assigned: hcal-dpg

@wang-hui,@georgia14,@igv4321 you have been requested to review this Pull request/Issue and eventually sign? Thanks

@missirol
Copy link
Contributor

For the record, this online crash happened more than once (and it does not seem to be reproducible offline). Affected runs (afaik):

357898
359998

@missirol
Copy link
Contributor

missirol commented Oct 11, 2022

@trtomei , please update the title of the issue with something like "HLT crash in run 359998: ...".

@trtomei trtomei changed the title Unavailable Conditions of type HcalChannelQuality HLT crash in run 359998: Unavailable Conditions of type HcalChannelQuality Oct 11, 2022
@wang-hui
Copy link
Contributor

Hi @trtomei Could you please copy the config file to lxplus so that we HCAL DPG can try to reproduce the crash?

@trtomei
Copy link
Contributor Author

trtomei commented Oct 13, 2022

Hi @wang-hui The files are available in /afs/cern.ch/user/t/tomei/public/issue39693 now!

@silviodonato
Copy link
Contributor

silviodonato commented Oct 13, 2022

Hello @cms-sw/hcal-dpg-l2 @cms-sw/alca-l2 , this crash happened again this night in run 360295. HLT was using CMSSW_12_4_10.
The crash happened 5 times (2022-10-13):

  • 1 time at 09:16:07 (fu-c2b03-33-01)
  • 1 time at 08:27:12 (fu-c2b03-12-01)
  • 3 time around 05:40
    • 05:39:07 (fu-c2b03-33-01)
    • 05:41:26 (fu-c2b03-30-01)
    • 05:44:57 (fu-c2b02-03-01)

f3mon_logtable_2022-10-13T07_53_34.976Z.txt

List of runs with the crashes:

357898
359998
360295

@silviodonato
Copy link
Contributor

In Run 360330

   [2] Calling method for module CaloTowersCreator/'hltStoppedHSCPTowerMakerForAll'
Exception Message:
Requested conditions of type HcalChannelQuality for cell (0x45104408) (HE -17,8,1) got conditions for cell (0x0)

@mariadalfonso
Copy link
Contributor

mariadalfonso commented Oct 14, 2022

Hi @wang-hui The files are available in /afs/cern.ch/user/t/tomei/public/issue39693 now!

this particular event was investigated by @wang-hui
offline-cpu code give the following warning

[1] %MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  13-Oct-2022 21:36:29 CEST Run: 359998 Event: 538274608
 bad SOI/maxTS in cell (HB 10,47,3)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1

see here https://github.com/cms-sw/cmssw/blob/master/RecoLocalCalo/HcalRecProducers/src/HBHEPhase1Reconstructor.cc#L520

There is a shift in the SOI, I do not see this condition in the the hlt-gpu code so that's why crash on this.
I will implement some fix today so that HLT will not crash.

Of course we need understand why the electronics thinks this rec-hit is shifted of 25ns !

@mariadalfonso
Copy link
Contributor

Hello @cms-sw/hcal-dpg-l2 @cms-sw/alca-l2 , this crash happened again this night in run 360295. HLT was using CMSSW_12_4_10. The crash happened 5 times (2022-10-13):

  • 1 time at 09:16:07 (fu-c2b03-33-01)

  • 1 time at 08:27:12 (fu-c2b03-12-01)

  • 3 time around 05:40

    • 05:39:07 (fu-c2b03-33-01)
    • 05:41:26 (fu-c2b03-30-01)
    • 05:44:57 (fu-c2b02-03-01)

f3mon_logtable_2022-10-13T07_53_34.976Z.txt

List of runs with the crashes:

357898
359998
360295

Hi where I can find the events here ?
would be good to have these events copied somewhere so that we can classify all these exceptions.

@missirol
Copy link
Contributor

missirol commented Oct 14, 2022

where I can find the events here ?

They are available on the online GPU-development machines, e.g. gpu-c2a02-35-01.cms, at

/store/error_stream/run{357898,359998,360295}/*raw

For an example of how to rerun HLT directly on *.raw files, see #39045 (comment).

I do not see this condition in the the hlt-gpu code so that's why crash on this.

FYI: @cms-sw/heterogeneous-l2

@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2022

Seems to be happening a lot more frequently in recent runs:

720 crashes in run 360330

Run number: 360330
L1/HLT key: collisions2022/v249
HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.3/HLT/V1
CMSSW version: CMSSW_12_4_10

Most (if not all) of the crashes have the message: 
[2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll'
Exception Message:
Requested conditions of type HcalChannelQuality for cell (0x45104407) (HE -17,7,1) got conditions for cell (0x0)

@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2022

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@francescobrivio
Copy link
Contributor

francescobrivio commented Oct 14, 2022

Seems to be happening a lot more frequently in recent runs:

720 crashes in run 360330
Run number: 360330
L1/HLT key: collisions2022/v249
HLT Menu: /cdaq/physics/Run2022/2e34/v1.4.3/HLT/V1
CMSSW version: CMSSW_12_4_10

Most (if not all) of the crashes have the message: 
[2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll'
Exception Message:
Requested conditions of type HcalChannelQuality for cell (0x45104407) (HE -17,7,1) got conditions for cell (0x0)

Just to add a bit of information on "recent runs":
this morning there was the update of the HCAL conditions with a few hiccups exactly in the run range 360329–360333
(as described in this CMSTalk post) and which consequently caused some processing issues in Tier0 (see this CMSTalk post).

Since the crashes reported in this GH issue pre-date the errors I just described, I think the two things might be un-related, but I just wanted to add the information for completeness.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2022

One thing that I can reproduce is that the soi (what is it ?) computed on GPU is "wrong" for the same event as for the CPU:

Begin processing the 18th record. Run 359998, Event 538274608, LumiSection 335 on stream 0 at 14-Oct-2022 17:38:13.244 CEST
%MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  14-Oct-2022 17:38:13 CEST Run: 359998 Event: 538274608
 bad SOI/maxTS in cell (HB 10,47,3)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1
%MSG

and

Begin processing the 18th record. Run 359998, Event 538274608, LumiSection 335 on stream 0 at 14-Oct-2022 17:39:00.806 CEST
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
gch: 6861, nchannelsf01HE: 4265, nchannelsf015: 4265, soi: -85
----- Begin Fatal Exception 14-Oct-2022 17:39:00 CEST-----------------------
An exception of category 'Conditions not found' occurred while
   [0] Processing  Event run: 359998 lumi: 335 event: 538274608 stream: 0
   [1] Running path 'Path'
   [2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll'
Exception Message:
Unavailable Conditions of type HcalChannelQuality for cell (0x0) 
----- End Fatal Exception -------------------------------------------------

@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2022

Looking at the legacy code in RecoLocalCalo/HcalRecProducers/src/HBHEPhase1Reconstructor.cc

  • if the soi is bad, it prints a warning and sets a badSOI flag:
        const int soi = tsFromDB_ ? properties.paramTs->firstSample() : frame.presamples();
        const bool badSOI = !(maxTS >= 3 && soi > 0 && soi < maxTS - 1);
        if (badSOI) {
          edm::LogWarning("HBHEDigi") << " bad SOI/maxTS in cell " << cell
                                      << "\n expect maxTS >= 3 && soi > 0 && soi < maxTS - 1"
                                      << "\n got maxTS = " << maxTS << ", SOI = " << soi;
        }
  • the badSOI flag is set in the channelInfo:
        channelInfo->setChannelInfo(cell,
                                    pulseShapeID,
                                    nTSToCopy,
                                    fitSoi,
                                    soiCapid,
                                    darkCurrent,
                                    fcByPE,
                                    lambda,
                                    noisecorr,
                                    hwerr.first,
                                    hwerr.second,
                                    properties.taggedBadByDb || dropByZS || badSOI);
  • the last argument of setChannelInfo sets the dropped_ flag, which is read via the bool isDropped() method
  • which in turn makes the producer skip this channel:
        // If needed, add the channel info to the output collection
        const bool makeThisRechit = !channelInfo->isDropped();
        [...]
    
        // Reconstruct the rechit
        if (rechits && makeThisRechit) {
            [...]

Now the question is - how do we skip a "bad" channel in the GPU reconstruction ?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2022

By the way, a source of problems is that soiSamples was uninitialised, hence the random value -87.

@wang-hui
Copy link
Contributor

By the way, a source of problems is that soiSamples was uninitialised, hence the random value -87.

We discussed this issue in today's HCAL DPG meeting.
Our OPS colleagues are investigating possible data corruption in the digi.
Will let you know if they find something.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2022

OK.

In the meantime, I've prepared what I think is a fix to skip the channels affected by this problem, trying to follow the same approach used in the legacy rechit reconstruction: #39738 .

@fwyzard
Copy link
Contributor

fwyzard commented Oct 14, 2022

+heterogeneous

@fwyzard
Copy link
Contributor

fwyzard commented Oct 15, 2022

The same error has been reported in runs 360393 and 360400.

Running with the candidate fix from #39740 lets all HLT jobs complete, with some HCAL-related messages:

%MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  15-Oct-2022 10:42:05 CEST Run: 360400 Event: 26202582
 bad SOI/maxTS in cell (HB -12,28,1)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1
%MSG
%MSG-w Invalid Data:  HcalRawToDigi:hltHcalDigis 15-Oct-2022 10:42:45 CEST  Run: 360400 Event: 44480955
The default QIE11 Collection has 8 samples per digi, while the current data has 17!  This data cannot be included with the default collection.
In order to store this data in the event, it must have a unique tag.  To accomplish this, provide two lists to HcalRawToDigi 
1) that specifies the number of samples and 2) that gives tags with which these data are saved.
For example in this case you might add 
process.hcalDigis.saveQIE11DataNSamples = cms.untracked.vint32( 17) 
process.hcalDigis.saveQIE11DataTags = cms.untracked.vstring( "MYDATA" )
%MSG
%MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  15-Oct-2022 10:43:04 CEST Run: 360400 Event: 136235629
 bad SOI/maxTS in cell (HB 11,27,1)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1
%MSG
%MSG-w HBHEDigi:  HBHEPhase1Reconstructor:hltHbherecoLegacy  15-Oct-2022 10:43:04 CEST Run: 360400 Event: 136235629
 bad SOI/maxTS in cell (HB -9,30,3)
 expect maxTS >= 3 && soi > 0 && soi < maxTS - 1
 got maxTS = 8, SOI = -1
%MSG

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2022

@cms-sw/hcal-dpg-l2 please consider signing this issue.
@trtomei please consider closing this issue (as per today's joint ops document no more issues of this type were noticed in recent runs)

@wang-hui
Copy link
Contributor

Hi @mmusich the patch of this issue has been merged in #39738.
We HCAL DPG are happy with the patch.

@mmusich
Copy link
Contributor

mmusich commented Oct 24, 2022

@wang-hui then please sign-off this issue.
Thanks.

@wang-hui
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@missirol
Copy link
Contributor

please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants