`DeepTauId` failures in RelVals (`Incompatible shapes`) #44333

AdrianoDee · 2024-03-07T06:31:31Z

Running RelVals we are observing some failures due to a tensorflow exception coming from DeepTauId module. Some examples listed here.

1) 2023 Data reHLT + reRECO

In HLTDR3_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in 14_0_0_pre3 RelVals

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 11 event: 22076365 stream: 0
[1] Running path 'HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducerForVBFIsoTau'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

with the config here, that is what we get from wf 141.035 running L1REPACK:Full,HLT:@relval2024 (HLT pointing at GRun here). The error here. The wf on Stats2.

Also in the same step in 13_3_0_pre5 RunDisplacedJet2023C in a different path (HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6 ) run in HLT:@relval2023. The error here. The wf on Stats2.

2) 2022 Data reHLT + reRECO

Much rarer in AODNANORUN3_reHLT_2022 step in deepTau2017v2p1ForMini in RunJetMET2022D with 14_0_0 The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 357735 lumi: 20 event: 32782226 stream: 0
[1] Running path 'NANOEDMAODoutput_step'
[2] Prefetching for module PoolOutputModule/'NANOEDMAODoutput'
[3] Prefetching for module SimpleCandidateFlatTableProducer/'boostedTauTable'
[4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
[5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
[6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
[7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
[8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
[9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
[10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
[11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
[12] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[13] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,64] vs. [154]
[[{{node inner_muon_norm_1/FusedBatchNorm_1/Mul}}]]

3) MC 2023

In DigiPU_2023PU step in hltHpsPFTauDeepTauProducer in RelValTenTau_15_500 with 13_3_0_pre1 (at the moment the first occurrence I found). The error here. The wf on Stats2.

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 1 lumi: 18 event: 1707 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_OneProng_M5to80_v2'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

CPU

At the moment it appears that in all cases the jobs were running on Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz (or on a Gold one), Cascade Lake (see #44333 (comment)).

The text was updated successfully, but these errors were encountered:

cmsbuild · 2024-03-07T06:31:48Z

cms-bot internal usage

cmsbuild · 2024-03-07T06:31:49Z

A new Issue was created by @AdrianoDee.

@Dr15Jones, @antoniovilela, @smuzaffar, @makortel, @sextonkennedy, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

AdrianoDee · 2024-03-07T06:31:54Z

assign hlt

AdrianoDee · 2024-03-07T06:32:04Z

assign pdmv

cmsbuild · 2024-03-07T06:32:09Z

New categories assigned: hlt,pdmv

@Martin-Grunewald,@mmusich,@AdrianoDee,@sunilUIET,@miquork you have been requested to review this Pull request/Issue and eventually sign? Thanks

mmusich · 2024-03-07T06:49:11Z

@cms-sw/tau-pog-l2 FYI

mmusich · 2024-03-07T06:49:19Z

type tau

mmusich · 2024-03-07T06:50:26Z

just as an observation this path is not new (first included in the GRun menu in 2022, https://its.cern.ch/jira/browse/CMSHLT-2289)

EDIT but was touched recently in https://its.cern.ch/jira/browse/CMSHLT-3052

mmusich · 2024-03-07T07:09:04Z

@cms-sw/pdmv-l2

In data reHLT+reRECO RelVals we are observing some failures at HLTDR_2023 step in path HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7

Please help filling in some information:

In which release is this happening?
Is it reproducibile?
Does it affect all jobs of the relvals?
Is there a pattern w.r.t. the CPU microarchitecture of the node on which the job lands?

Martin-Grunewald · 2024-03-07T07:54:30Z

I can't find it in the Dashboard. Since it is labelled HLTDR_2023, and the path in question is not in the Fake* menus, it must be in some 13_X release running the actual 2023 HLT with the 2023 version of that path.

AdrianoDee · 2024-03-07T07:58:48Z

Quick answers:

this happened both in 14_0_0_pre3 and 14_0_0 but I'm tracking it back to older releases (coming back as soon as I find the first occurrence);
it just happens on a fraction of the jobs and the fraction itself is quite random (fluctuates in the order of few percentages of the events failing).

For the reproducibility and the CPU pattern I'll need a moment to check those.

Martin-Grunewald · 2024-03-07T07:59:53Z

Hmm well, in 14_X, HLTDR_2023 should (now) run the Fake* menus, while the real HLT menus should be within HLTDR_2024.

mmusich · 2024-03-07T08:03:20Z

in 14_X, HLTDR_2023 should (now) run the Fake* menus, while the real HLT menus should be within HLTDR_2024

Indeed the configuration linked above has
L1REPACK:Full,HLT:@relval2024, but in absence of real 2024 data we're running the 2024 menu on 2023 data.

AdrianoDee · 2024-03-07T08:07:43Z

I see the same (similar) error

Fatal Exception (Exit code: 8001)
An exception of category 'InvalidRun' occurred while
[0] Processing Event run: 367131 lumi: 122 event: 206577729 stream: 1
[1] Running path 'HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6'
[2] Calling method for module DeepTauId/'hltHpsPFTauDeepTauProducer'
Exception Message:
error while running session: INVALID_ARGUMENT: Incompatible shapes: [0,1,1,38] vs. [92]
[[{{node inner_hadrons_norm_1/FusedBatchNorm_1/Mul}}]]

in 13_3_0_pre5 RunDisplacedJet2023C running L1REPACK:Full,HLT:@relval2023.

mmusich · 2024-03-07T08:18:13Z

HLT_DoubleMediumDeepTauPFTauHPS30_L2NN_eta2p1_PFJet60_v6

This is a different path, so it points to a general problem with DeepTauId (path-aspecific)

Dr15Jones · 2024-03-07T14:01:30Z

For context, it appears the exception comes from here:

cmssw/PhysicsTools/TensorFlow/src/TensorFlow.cc

Lines 272 to 275 in ff51428

    
           Status status = session->Run(runOptions, inputs, outputNames, {}, outputs, nullptr, threadPoolOptions); 
        
           if (!status.ok()) { 
        
             throw cms::Exception("InvalidRun") << "error while running session: " << status.ToString(); 
        
           }

makortel · 2024-03-07T14:45:11Z

assign ml

makortel · 2024-03-07T14:45:21Z

assign reconstruction

cmsbuild · 2024-03-07T14:45:29Z

New categories assigned: ml,reconstruction

@jfernan2,@mandrenguyen,@valsdav,@wpmccormack you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2024-03-15T22:21:14Z

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5 .

mmusich · 2024-03-16T06:27:10Z

urgent

This failure was now seen in Tier0 PromptReco https://cms-talk.web.cern.ch/t/update-t0-skim-config-for-2024-pp-collision/36794/5

I can prepare a PR with guards to avoid the execution of the model with empty inputs, and in parallel investigate more deeply this TF behaviour.

@valsdav, we have established that this issue can affect Prompt Reconstruction and (potentially, when the new nodes for the HLT farm arrive) also online trigger operations. Please prepare PRs with guards to avoid the execution of the model with empty inputs.
Thank you.

Marco (as ORM)

mmusich · 2024-03-19T07:28:37Z

for record, the proposed fixes are:

jfernan2 · 2024-03-20T09:24:57Z

+1
solved by #44455

valsdav · 2024-03-20T09:41:02Z

+ml

Basic guards to solve the empty input problem in DeepTauId are in place, but the reason of the empty grid needs to be investigated with Tau experts.

A more general guard for empty inputs will be added (see #44481)

AdrianoDee · 2024-03-20T09:42:27Z

+pdmv
(really only the reporter)

mmusich · 2024-03-20T09:43:37Z

... hlt will sign once the 14.0.X PR is merged and tested in IBs.

mmusich · 2024-03-20T09:55:14Z

but the reason of the empty grid needs to be investigated with Tau experts.

@cms-sw/reconstruction-l2 this looks like needs a separate issue. Can you open one?

mmusich · 2024-03-25T13:40:11Z

+hlt

no issues observed after the 14.0.X PR got merged and tested in IBs.

cmsbuild · 2024-03-25T13:40:36Z

This issue is fully signed and ready to be closed.

makortel · 2024-03-25T18:30:37Z

@cmsbuild, please close

cmsbuild added the pending-assignment label Mar 7, 2024

cmsbuild added hlt-pending pending-signatures pdmv-pending and removed pending-assignment labels Mar 7, 2024

AdrianoDee changed the title ~~DeepTau failures in HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7~~ DeepTauId failures in HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in RelVals Mar 7, 2024

cmsbuild added the tau label Mar 7, 2024

AdrianoDee changed the title ~~DeepTauId failures in HLT_VBF_DoubleMediumDeepTauPFTauHPS20_eta2p1_v7 in RelVals~~ DeepTauId failures in RelVals Mar 7, 2024

AdrianoDee changed the title ~~DeepTauId failures in RelVals~~ DeepTauId failures in RelVals (Incompatible shapes) Mar 7, 2024

cmsbuild added reconstruction-pending ml-pending labels Mar 7, 2024

cmsbuild added the urgent label Mar 16, 2024

This was referenced Mar 18, 2024

DeepTau - Do not call TF inference with empty grid #44455

Merged

[backport] DeepTau - Do not call TF inference with empty grid #44456

Merged

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Mar 20, 2024

valsdav mentioned this issue Mar 20, 2024

Avoid TensorFlow empty inputs in central interface #44481

Open

cmsbuild added ml-approved and removed ml-pending labels Mar 20, 2024

cmsbuild added pdmv-approved and removed pdmv-pending labels Mar 20, 2024

jfernan2 mentioned this issue Mar 21, 2024

Empty grid for DeepTauId #44501

Open

cmsbuild added hlt-approved fully-signed and removed hlt-pending pending-signatures labels Mar 25, 2024

cmsbuild closed this as completed Mar 25, 2024

mmusich mentioned this issue Jun 4, 2024

HLT farm crash in run 381543 #45136

Closed

This was referenced Jun 5, 2024

Skip evaluation of TensorFlow model if inputs are empty #45139

Merged

[14_0_X] Skip evaluation of TF model if one of the input tensors is empty #45145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DeepTauId` failures in RelVals (`Incompatible shapes`) #44333

`DeepTauId` failures in RelVals (`Incompatible shapes`) #44333

AdrianoDee commented Mar 7, 2024 •

edited

Loading

cmsbuild commented Mar 7, 2024 •

edited

Loading

cmsbuild commented Mar 7, 2024

AdrianoDee commented Mar 7, 2024

AdrianoDee commented Mar 7, 2024

cmsbuild commented Mar 7, 2024

mmusich commented Mar 7, 2024

mmusich commented Mar 7, 2024

mmusich commented Mar 7, 2024 •

edited

Loading

mmusich commented Mar 7, 2024 •

edited

Loading

Martin-Grunewald commented Mar 7, 2024 •

edited

Loading

AdrianoDee commented Mar 7, 2024

Martin-Grunewald commented Mar 7, 2024 •

edited

Loading

mmusich commented Mar 7, 2024 •

edited

Loading

AdrianoDee commented Mar 7, 2024

mmusich commented Mar 7, 2024

Dr15Jones commented Mar 7, 2024

makortel commented Mar 7, 2024

makortel commented Mar 7, 2024

cmsbuild commented Mar 7, 2024

makortel commented Mar 15, 2024

mmusich commented Mar 16, 2024 •

edited

Loading

mmusich commented Mar 19, 2024

jfernan2 commented Mar 20, 2024

valsdav commented Mar 20, 2024

AdrianoDee commented Mar 20, 2024 •

edited

Loading

mmusich commented Mar 20, 2024

mmusich commented Mar 20, 2024

mmusich commented Mar 25, 2024

cmsbuild commented Mar 25, 2024

makortel commented Mar 25, 2024

DeepTauId failures in RelVals (Incompatible shapes) #44333

DeepTauId failures in RelVals (Incompatible shapes) #44333

Comments

AdrianoDee commented Mar 7, 2024 • edited Loading

1) 2023 Data reHLT + reRECO

2) 2022 Data reHLT + reRECO

3) MC 2023

CPU

cmsbuild commented Mar 7, 2024 • edited Loading

cmsbuild commented Mar 7, 2024

AdrianoDee commented Mar 7, 2024

AdrianoDee commented Mar 7, 2024

cmsbuild commented Mar 7, 2024

mmusich commented Mar 7, 2024

mmusich commented Mar 7, 2024

mmusich commented Mar 7, 2024 • edited Loading

mmusich commented Mar 7, 2024 • edited Loading

Martin-Grunewald commented Mar 7, 2024 • edited Loading

AdrianoDee commented Mar 7, 2024

Martin-Grunewald commented Mar 7, 2024 • edited Loading

mmusich commented Mar 7, 2024 • edited Loading

AdrianoDee commented Mar 7, 2024

mmusich commented Mar 7, 2024

Dr15Jones commented Mar 7, 2024

makortel commented Mar 7, 2024

makortel commented Mar 7, 2024

cmsbuild commented Mar 7, 2024

makortel commented Mar 15, 2024

mmusich commented Mar 16, 2024 • edited Loading

mmusich commented Mar 19, 2024

jfernan2 commented Mar 20, 2024

valsdav commented Mar 20, 2024

AdrianoDee commented Mar 20, 2024 • edited Loading

mmusich commented Mar 20, 2024

mmusich commented Mar 20, 2024

mmusich commented Mar 25, 2024

cmsbuild commented Mar 25, 2024

makortel commented Mar 25, 2024

`DeepTauId` failures in RelVals (`Incompatible shapes`) #44333

`DeepTauId` failures in RelVals (`Incompatible shapes`) #44333

AdrianoDee commented Mar 7, 2024 •

edited

Loading

cmsbuild commented Mar 7, 2024 •

edited

Loading

mmusich commented Mar 7, 2024 •

edited

Loading

mmusich commented Mar 7, 2024 •

edited

Loading

Martin-Grunewald commented Mar 7, 2024 •

edited

Loading

Martin-Grunewald commented Mar 7, 2024 •

edited

Loading

mmusich commented Mar 7, 2024 •

edited

Loading

mmusich commented Mar 16, 2024 •

edited

Loading

AdrianoDee commented Mar 20, 2024 •

edited

Loading