Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of range exception from RPCAMCRawToDigi #38939

Closed
makortel opened this issue Aug 2, 2022 · 33 comments
Closed

Out of range exception from RPCAMCRawToDigi #38939

makortel opened this issue Aug 2, 2022 · 33 comments

Comments

@makortel
Copy link
Contributor

makortel commented Aug 2, 2022

Workflow 136.8561 step 3 has been failing since CMSSW_12_5_X_2022-07-28-1100 with

----- Begin Fatal Exception 02-Aug-2022 14:38:05 CEST-----------------------
An exception of category 'OutOfRange' occurred while
   [0] Processing  Event run: 314890 lumi: 591 event: 497483740 stream: 2
   [1] Running path 'dqmoffline_step'
   [2] Prefetching for module L1TdeStage2CPPF/'l1tdeStage2Cppf'
   [3] Calling method for module RPCAMCRawToDigi/'rpcCPPFRawToDigi'
Exception Message:
Out-of-range input for RPCAMCLink::bf_set, position 0: 100
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc10/CMSSW_12_5_X_2022-08-02-1100/pyRelValMatrixLogs/run/136.8561_RunZeroBias_hBStarTk+RunZeroBias_hBStarTk+HLTDR2_2018_hBStar+RECODR2_2018reHLT_Offline_hBStar+HARVEST2018_hBStar/step3_RunZeroBias_hBStarTk+RunZeroBias_hBStarTk+HLTDR2_2018_hBStar+RECODR2_2018reHLT_Offline_hBStar+HARVEST2018_hBStar.log#/

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2022

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Aug 2, 2022

assign reconstruction

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2022

New categories assigned: reconstruction

@jpata,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

makortel commented Aug 2, 2022

Seems that this was already reported in #38564 (comment)

@zhangcg123

@zhangcg123
Copy link
Contributor

The same error can be reproduced simply by

cmsRun DQM/Integration/python/clients/l1tstage2emulator_dqm_sourceclient-live_cfg.py unitTest=True dataset=/ZeroBias/Commissioning2018-v1/RAW runNumber=314890 eventsPerLumi=-1

It looks like the error only occurs when dataset=/ZeroBias/Commissioning2018-v1/RAW is used as input.

@davidlange6
Copy link
Contributor

davidlange6 commented Aug 2, 2022 via email

@makortel
Copy link
Contributor Author

makortel commented Aug 2, 2022

assign l1

L1 would seem to be the more appropriate group to assign this to.

Thanks. I followed RPCAMCRawToDigi module being defined in EventFilter/RPCRawToDigi, and that package being assigned to reconstruction.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 2, 2022

New categories assigned: l1

@epalencia,@rekovic,@cecilecaillol you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mileva
Copy link
Contributor

mileva commented Aug 10, 2022

Hi @makortel all,
The reason for the crash is that during the run 314890 (used by the workflow 136.8561_RunZeroBias) the CPPF data were considered as corrupted. The reason was an firmware update, which led to some problems.. The CPPF good data have been restored after the run 315764.

My personal advice is to change the input data with some recent zerobias run in order to test the workflow.

And from the other side - probably some sanity checks (if there are data, or if they are valid...) need to be added to the analyzer in order to avoid further crash of the code in such cases.
Best!
Roumyana (for RPCs)

@qliphy
Copy link
Contributor

qliphy commented Aug 10, 2022

@mileva Is #38974 supposed to fix this issue?

@mileva
Copy link
Contributor

mileva commented Aug 10, 2022

Hi @qliphy
No, #38974 is not supposed to fix this issue here.

#38974 is intended to fix the CPPF DAQ delay and the unpacked RPC digis, while the current issue relates to a comparison between the unpacked cppf digis vs emulated ones.

The RPCCPPF unpacker processes two different records
TXRecord: Processing it the unpacker fills the CPPFDigi collection (clusters that are sent to L1-EMTF), used by the colleagues for their #38564.
RXRecord contains an information for the initial RPC detector data and used to fill the RPCDigi collection used as an input for local reconstruction.

In the particular case with the test of the ZeroBias workflow, the input run was bad for CPPF - the cppf data were corrupted and thus led to a crash of the L1 CPPF DQM module.

Best!
Roumyana

@makortel
Copy link
Contributor Author

My personal advice is to change the input data with some recent zerobias run in order to test the workflow.

Thanks, adding @cms-sw/pdmv-l2 for that

@makortel
Copy link
Contributor Author

assign pdmv

@cmsbuild
Copy link
Contributor

New categories assigned: pdmv

@bbilin,@jordan-martins,@kskovpen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@kskovpen
Copy link
Contributor

My personal advice is to change the input data with some recent zerobias run in order to test the workflow.

Thanks, adding @cms-sw/pdmv-l2 for that

One can use for example this input: https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/relval_steps.py#L483

@perrotta
Copy link
Contributor

My personal advice is to change the input data with some recent zerobias run in order to test the workflow.

Thanks, adding @cms-sw/pdmv-l2 for that

One can use for example this input: https://github.com/cms-sw/cmssw/blob/master/Configuration/PyReleaseValidation/python/relval_steps.py#L483

@kskovpen can this be done, then? Having continuously failing workflows in the IB is far sub-optimal for the sake of checking the effect of newly integrated PRs on them.

@kskovpen
Copy link
Contributor

@mileva @makortel if we want to move it to Run 3, which era should be used for this specific wf? I see that Run2_2018_highBetaStar was used in Run 2.

@mmusich
Copy link
Contributor

mmusich commented Aug 23, 2022

if we want to move it to Run 3, which era should be used for this specific wf? I see that Run2_2018_highBetaStar was used in Run 2.

there's no equivalent (yet) for Run3 as we didn't have yet a high beta star run.
https://github.com/cms-sw/cmssw/tree/master/Configuration/Eras/python

I don't quite understand the point of changing the input of this wf, since I think it was expressly designed to test the reconstruction with high beta start (tracking) setup

@mmusich
Copy link
Contributor

mmusich commented Aug 23, 2022

@mileva

And from the other side - probably some sanity checks (if there are data, or if they are valid...) need to be added to the analyzer in order to avoid further crash of the code in such cases.

is there any downstream consumer of CPPF digis? if we know that the data is corrupt exactly in the run range in which we have the high beta star can the unpacker be removed in the sequence run in that era?

@mileva
Copy link
Contributor

mileva commented Aug 23, 2022

I don't quite understand the point of changing the input of this wf

Hi @mmusich ,
In fact I tried to explain the reason for the crash with this workflow - namely the cppf data were corrupted in the input run, and the reason for the crash is not in the proposed pull request, but the data.
And I tried the same workflow with one of the recent runs with available data on eos to see that the software runs, nothing more.
I guess that the L1/CPPF/DQM code just needs some sanity checks to avoid such cases with corrupted data.

Best!
Roumyana

@mmusich
Copy link
Contributor

mmusich commented Aug 23, 2022

@mileva

In fact I tried to explain the reason for the crash with this workflow - namely the cppf data were corrupted in the input run, and the reason for the crash is not in the proposed pull request, but the data.

yes, I understand, but changing the input data is NOT an option, unless we want to give up testing the high beta* reco...

@mileva
Copy link
Contributor

mileva commented Aug 23, 2022

is there any downstream consumer of CPPF digis? if we know that the data is corrupt exactly in the run range in which we have the high beta star can the unpacker be removed in the sequence run in that era?

In fact all the 2018A data before run 315764 are affected. The issue happened somewhere in March, before the start of data taking.
The CPPF digis are called by the L1TStage2Emulator. But I think at that moment (2018A) L1 didn't use them and just produced the cppf clusters on flight using the RPCDigis on the emulation step. (CPPF concentrates rpc digis in the endcap and clusterize them)
So, I guess there will be no problem the CPPFRPCunpacker to be removed for this particular workflow.

I just ping @efeyazgan for EMTF, in case I am missing something.

@kskovpen
Copy link
Contributor

is there any downstream consumer of CPPF digis? if we know that the data is corrupt exactly in the run range in which we have the high beta star can the unpacker be removed in the sequence run in that era?

In fact all the 2018A data before run 315764 are affected. The issue happened somewhere in March, before the start of data taking. The CPPF digis are called by the L1TStage2Emulator. But I think at that moment (2018A) L1 didn't use them and just produced the cppf clusters on flight using the RPCDigis on the emulation step. (CPPF concentrates rpc digis in the endcap and clusterize them) So, I guess there will be no problem the CPPFRPCunpacker to be removed for this particular workflow.

I just ping @efeyazgan for EMTF, in case I am missing something.

Does it imply modifying some specific step of this workflow or its désactivation in IBs? In the former case, how it should be modified?

@mileva
Copy link
Contributor

mileva commented Aug 23, 2022

Just to be clear! I don't think the problem is in the workflow. The workflow shows that there is a problem with the particular pull request.
It might happen that the CPPF data will be corrupted again. So, some checks to not run over corrupted data in the L1/CPPF/DQM needs to be implemented.
But I am not an expert and could be I am wrong.

@kskovpen
Copy link
Contributor

As @makortel has mentioned above, the problem is at step3 of 136.8561. Anyhow, if experts could comment on how relevant this wf is for Run3 (as it stands now), it would help to decide on a proper action.

@mmusich
Copy link
Contributor

mmusich commented Aug 23, 2022

@kskovpen

Anyhow, if experts could comment on how relevant this wf is for Run3 (as it stands now), it would help to decide on a proper action.

this wf has no relevance whatsoever for run-3, but it is there to ensure we can still reconstruct properly the run2 high beta star data. I think someone with higher paygrade than me should decide if this is something that CMS wants to keep being able to do, but I don't see why that would not be the case.
Having said that to me it seems that the right course of action is to provide these checks in the CPPF / RPC code in order to avoid crashing on bad input data. Such checks are customarily included in DPG / POG code to avoid to failures at run time.

@mmusich
Copy link
Contributor

mmusich commented Aug 23, 2022

The CPPF digis are called by the L1TStage2Emulator. But I think at that moment (2018A) L1 didn't use them and just produced the cppf clusters on flight using the RPCDigis on the emulation step. (CPPF concentrates rpc digis in the endcap and clusterize them)
So, I guess there will be no problem the CPPFRPCunpacker to be removed for this particular workflow.

by the way removing the CPPF unpacker results in

----- Begin Fatal Exception 23-Aug-2022 18:56:12 CEST-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 314890 lumi: 591 event: 497757635 stream: 0
   [1] Running path 'dqmoffline_step'
   [2] Calling method for module L1TStage2CPPF/'l1tStage2Cppf'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: std::vector<l1t::CPPFDigi>
Looking for module label: rpcCPPFRawToDigi
Looking for productInstanceName: 

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "SkipEvent = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

----- End Fatal Exception -------------------------------------------------

so, there are downstream consumers.

@kskovpen
Copy link
Contributor

Shall we disable 136.8561 in IBs and wait for further indications from the relevant groups?

@perrotta
Copy link
Contributor

I guess that the L1/CPPF/DQM code just needs some sanity checks to avoid such cases with corrupted data.

Trying to find a solution for this longstanding issue: @mileva, could you or someone in your group please commit to provide those sanity checks in the code? If not for pre5 (today-ish), they should be made available before we build the final 12_5_0, so that this particolar workflow can continue to be tested in the cycle

@mileva
Copy link
Contributor

mileva commented Aug 25, 2022

Hi @perrotta
I can try to have a look or to ask some from the RPCs colleagues. However for today I am not able to help, as I am in the mountains. When the final 12_5_0 build is scheduled - 20.09 or earlier?
Roumyana

@perrotta
Copy link
Contributor

When the final 12_5_0 build is scheduled - 20.09 or earlier?

Sep 20, see https://twiki.cern.ch/twiki/bin/viewauth/CMS/CMSSW_12_5_0
However, I would aim for a fix well before that date, in order to have still a few IBs available in which the wf can be tested

@perrotta
Copy link
Contributor

perrotta commented Sep 1, 2022

urgent
(To make it visible in the list of issues: it keeps breaking the IBs)

@perrotta
Copy link
Contributor

perrotta commented Sep 6, 2022

Fixed by #39307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants