Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash of DQM visualization second-instance client due to missing hltSiStripClusters2ApproxClusters #42878

Closed
syuvivida opened this issue Sep 26, 2023 · 10 comments

Comments

@syuvivida
Copy link
Contributor

syuvivida commented Sep 26, 2023

Dear CMSSW core team,

In the HI collision runs during the night of Sep 26 (374288--374294), our online DQM client
visualization-live-secondInstance started crashing for some events, but not always. One could see an example log here. The complain is all about missing hltSiStripClusters2ApproxClusters. The input of this client is streamEventDisplay: we double checked and did see hltSiStripClusters2ApproxClusters included in the event content.

Since our DQM framework re-starts a client several times if it fails, one could also see in these runs, this client still manages to process a few events. The output are also seen at /eos/cms/store/group/visualization.

We are wondering if this because there is no protection when an event doesn't contain hltSiStripClusters2ApproxClusters. However, we do see a PR 42776 that shall resolve this issue and this is integrated in 13_2_4.
Please let us know if our understanding is incorrect. Any input from you is highly appreciated!

@cmsbuild
Copy link
Contributor

A new Issue was created by @syuvivida .

@Dr15Jones, @rappoccio, @smuzaffar, @makortel, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@tjavaid,@micsucmed,@nothingface0,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

One could see an example log here.

The PR #42776 fixes a crash (typically segfault) when the hltSiStripClusters2ApproxClusters collection exists, but is empty. Instead, the error in the log file says the hltSiStripClusters2ApproxClusters collection is missing.

we double checked and did see hltSiStripClusters2ApproxClusters included in the event content.

How did you verify that?

@syuvivida
Copy link
Contributor Author

@makortel
sorry this is understood now that the crash is due to a mixture of some UPC trigger paths which don't have raw' (hltSiStripClusters2ApproxClusters). Therefore, when events are triggered by these trigger paths, collection is missing.
See the comment from Marco here

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

This looks like a problem in the triggers populating the HIEventDisplay dataset.
In the output module of that dataset we do have the necessary collection:

hltOutputHIDQMEventDisplay = cms.OutputModule( "GlobalEvFOutputModule",
    use_compression = cms.untracked.bool( True ),
    compression_algorithm = cms.untracked.string( "ZSTD" ),
    compression_level = cms.untracked.int32( 3 ),
    lumiSection_interval = cms.untracked.int32( 0 ),
    SelectEvents = cms.untracked.PSet(  SelectEvents = cms.vstring( 'Dataset_HIEventDisplay' ) ),
    outputCommands = cms.untracked.vstring( 'drop *',
      'keep *_hltSiStripClusters2ApproxClusters_*_*',
      'keep DetIds_hltSiStripRawToDigi_*_*',
      'keep FEDRawDataCollection_rawDataRepacker_*_*',
      'keep FEDRawDataCollection_rawPrimeDataRepacker_*_*',
      'keep FEDRawDataCollection_source_*_*',
      'keep edmTriggerResults_*_*_*',
      'keep triggerTriggerEvent_*_*_*' ),
    psetMap = cms.untracked.InputTag( "hltPSetMap" )
)

but I suspect that the problem is that in CMSHLT-2925 we mixed in the paths that run the RAW' reconstruction (most of them) with paths that don't (e.g. the HLT_HIUPC_DoubleMuOpen_NotMBHF2AND_MaxPixelCluster1000_v1 ).
So events triggered by that will not have the appropriate output collection and this will generate the runtime exception.
This is being followed up at CMSHLT-2933.

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

@cmsbuild
Copy link
Contributor

New categories assigned: hlt

@mmusich,@missirol,@Martin-Grunewald you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Sep 26, 2023

for the record, the solution will have to come outside of cmssw so this issue is misplaced.

@syuvivida
Copy link
Contributor Author

syuvivida commented Sep 27, 2023

I agree. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants