CUDA-related HLT crashes between run-359694 and run-359764 #39680

missirol · 2022-10-08T15:50:26Z

Note : This issue is hopefully just for documentation purposes, as a possibile solution (#39619) has already been put in place.

Between Oct-1 and Oct-4 (2022), several runs [1] were affected by burst of HLT crashes (a reproducer can be found in [3]).

Stack traces [4] and offline checks [2] pointed to issues with reconstruction running on GPU.

@fwyzard identified an issue in the ECAL-GPU unpacker, and fixed it in #39617 (12_6_X), #39618 (12_5_X), #39619 (12_4_X). With the latter update, crashes were not observed anymore when re-running the HLT offline on data from some of the affected runs.

In parallel to this, it was realised by @Sam-Harper and ECAL experts that the crashes coincided with runs where ECAL suffered from data-integrity issues (see, for example, this DQM plot). On Oct-4, ECAL masked its channels (TT11 EB-6) affected by data-integrity errors, and since then no more online crashes of this kind have been observed thus far (despite HLT still running in CMSSW_12_4_9).

There is a separate open issue (#39568) likely related to the ECAL-GPU unpacker, but it was checked that #39619 does not solve it, so the issue in #39568 is likely different from the issue discussed here.

[1] Affected runs:

[2] Offline checks:

The crashes could be reproduced offline using error-stream files from some of the affected runs, but they were not entirely reproducibile. They were more likely to occur when using more than 1 EDM stream, and would only occur in offline tests if both the ECAL unpacker and pixel reconstruction were offloaded to GPUs. A summary of the checks done offline was compiled by @Sam-Harper in this document.

[3] Reproducer (requires access to one of the online GPU machines, might need to run it multiple times to see the crash at runtime):

#!/bin/bash

# release: CMSSW_12_4_9

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 359694 > hlt.py

cat >> hlt.py <<@EOF

# # disable ECAL unpacking on GPU (this makes the crash disappear)
# del process.hltEcalDigis.cuda
# del process.hltEcalUncalibRecHit.cuda

process.options.numberOfThreads = 4
process.source.fileListMode = True
process.source.fileNames = [
   '/store/error_stream/run359694/run359694_ls0112_index000079_fu-c2b04-35-01_pid2742152.raw',
   '/store/error_stream/run359694/run359694_ls0112_index000090_fu-c2b04-35-01_pid2742152.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000141_fu-c2b02-35-01_pid2465574.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000142_fu-c2b02-35-01_pid2465574.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000175_fu-c2b02-16-01_pid2674062.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000195_fu-c2b02-16-01_pid2674062.raw',
]
@EOF

cmsRun hlt.py &> hlt.log

[4] Example of a stack trace (from reproducer) in the attachment hlt.log.

The text was updated successfully, but these errors were encountered:

cmsbuild · 2022-10-08T15:50:47Z

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

missirol · 2022-10-08T15:51:34Z

assign ecal-dpg,reconstruction,heterogeneous

FYI: @cms-sw/hlt-l2

cmsbuild · 2022-10-08T15:51:56Z

New categories assigned: heterogeneous,reconstruction,ecal-dpg

@mandrenguyen,@fwyzard,@clacaputo,@simonepigazzini,@makortel,@jainshilpi,@thomreis you have been requested to review this Pull request/Issue and eventually sign? Thanks

Sam-Harper · 2022-10-10T13:34:05Z

so just keep track of things, it is still happening at a much lower rate (14 crashes in 360090 thanks @trocino for eporting).

The relavent DQM is :
https://cmsweb.cern.ch/dqm/online/start?runnr=360090;dataset=/Global/Online/ALL;sampletype=online_data;filter=all;referencepos=overlay;referenceshow=customise;referencenorm=True;referenceobj1=refobj;referenceobj2=none;referenceobj3=none;referenceobj4=none;search=;striptype=object;stripruns=;stripaxis=run;stripomit=none;workspace=Ecal;size=L;root=EcalBarrel/EBIntegrityTask/TTId;focus=;zoom=no;

trocino · 2022-10-10T14:03:10Z

We also had 8 in 360075.
The corresponding files are on Hilton:

/store/error_stream/run360075
/store/error_stream/run360090

thomreis · 2022-10-10T14:13:22Z

For both runs it is EB+08 that shows integrity errors.

thomreis · 2022-10-10T14:21:09Z

Are there plans for a patch release to deploy the fix from #39619 ?

fwyzard · 2022-10-11T08:16:23Z

Given that we have a likely fix since last week, why don't we have a patch release with it ?

perrotta · 2022-10-11T08:37:49Z

Given that we have a likely fix since last week, why don't we have a patch release with it ?

@fwyzard @thomreis we could build that release even now (it will be a full release, CMSSW_12_4_10).
However, yesterday at the joint ops meeting it was said that HLT would have liked to wait for the end of the week instead, when also the other fixes expected could have been ready and merged.
We can discuss and agree a possible change of plan with respect to what concluded yesterday later this afternoon at the ORP meeting

fwyzard · 2022-10-11T09:13:33Z

@perrotta can't we build a CMSSW_12_4_9_patch2 patch release based on CMSSW_12_4_9_patch1 adding only the changes necessary to fix the HLT crashes, namely #39580, #39619, and #39681 ?

thomreis · 2022-10-11T09:56:52Z

A CMSSW_12_4_9_patch2 now would allow ECAL to unmask EB-06 and prevent an eventual masking of EB+08, so I think that is the better option.

perrotta · 2022-10-11T10:02:54Z

We will discuss it this afternoon. To get ready for that discussion: why don't you like a full release, which would be far simpler (for us, of course)? Then a patch release on top of it can be quickly prepared with the remaining HLT fixes later this week

fwyzard · 2022-10-11T10:31:38Z

Because a patch release is something that we should be able to build and deploy in a matter of hours, not days. I.e. once a fix is available, we should be able to request a patch release at the morning meeting and be using it online by the afternoon.

IMHO this year we seem to have lost the mindset of a data taking experiment, where fixes are both frequent and urgent. Instead, it looks to me like we are still operating with a focus on MC and reprocessing mode, where "urgent fixes" arrive on the timescale of days or even weeks, and even for more critical ones, well, Tier-0 can always be paused, and the offline reconstruction is always re-RECO'ed eventually.

I do appreciate that PRs are merged pretty frequently also in 12.4.x branch, although I'm starting to think that that may actually be counterproductive: as soon as a non-patch change is made, it becomes more complicated to build a patch release.
The upside is that they are tested in the IBs, though this doesn't always help prevent mistakes from going in production.

So, how do we get back the capability of building patch releases quickly and as needed ?

fwyzard · 2022-10-11T10:46:00Z

Case in point, if we build a CMSSW_12_4_10 release, the sooner that it can be deployed online is likely on Thursday morning (build overnight, Tier-0 replay and HLT tests on Wednesday, if all goes well deploy on Thursday).

If we could have requested a CMSSW_12_4_9_patch2 release this morning, we should have a mechanism to have it ready by this afternoon :-/

perrotta · 2022-10-11T10:47:33Z

Andrea, as of NOW a full release could be much faster: if you ask, we can even start building now, and you'll have it ready by this evening, tomorrow at most. (Would you or HLT have asked yesterday instead of saying that it could have been delaied till the end of the week it would have been even faster)

Making a patch release with the three PRs you are asking for, would require some amount of manual intervention:

branch off from 12_4_9_patch1
add the commits from those three PRs alone, hoping that no rebase is needed
further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct

perrotta · 2022-10-11T10:48:37Z

While you think what is best for you, I start building 12_4_10: it can always get stopped, if you prefere something else.

fwyzard · 2022-10-11T10:51:44Z

(Would you or HLT have asked yesterday instead of saying that it could have been delayed till the end of the week it would have been even faster)

This is a separate discussion than I am trying to have with TSG: why this was not asked.
Were I not stuck in bed dealing with Covid, I would have.

Making a patch release with the three PRs you are asking for, would require some amount of manual intervention:

branch off from 12_4_9_patch1

add the commits from those three PRs alone, hoping that no rebase is needed

further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct

I do understand that. And I don't think this is a viable procedure: we do need a way to make patch releases easier than that !

perrotta · 2022-10-11T10:53:58Z

While you think what is best for you, I start building 12_4_10: it can always get stopped, if you prefere something else.

See #39694

davidlange6 · 2022-10-11T11:32:12Z

• branch off from 12_4_9_patch1 • add the commits from those three PRs alone, hoping that no rebase is needed • further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct I do understand that. And I don't think this is a viable procedure: we do need a way to make patch releases easier than that !

What is actually complex here? You can open a milestone and accept PRs just as with any other release cycle. (At least thats what was done in run2) - the "hope" can then be tested via the release build (if you are doing something that makes this procedure error prone, then likely its best not being a patch...)

…

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

fwyzard · 2022-10-11T14:23:00Z

I agree, and I guess the concern is that the part where the PRs that are already merged in the 12.4.x branch need to be re-opened for the new target.

I think that, at least for simple PRs like the ones discussed here, it should be enough to do

# backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39580
git-cms-cherry-pick-pr 39580 CMSSW_12_4_X
git push my-cmssw -u backport_39580:backport_39580

# backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39619
git-cms-cherry-pick-pr 39619 CMSSW_12_4_X
git push my-cmssw -u backport_39619:backport_39619

# backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39681
git-cms-cherry-pick-pr 39681 CMSSW_12_4_X
git push my-cmssw -u backport_39681:backport_39681

then push the resulting branches and open the corresponding PRs.

Still, I acknowledge that it's not ideal, and more work than building a release from the current branch.

Maybe we could teach cmsbot to do the dirty work ?

davidlange6 · 2022-10-11T14:33:52Z

If the original branches were based on 12_4_9_patch1 or older, all this is a no-op and one can just make a second PR with the same developer branch as originally (90%+ of the time its true I guess). This of course should be automated if that automation is thought to be less work to do and maintain than doing what's below. It never seemed that way when I dealt with it. (But ymmv)

…

On Oct 11, 2022, at 4:23 PM, Andrea Bocci ***@***.***> wrote: I agree, and I guess the concern is that the part where the PRs that are already merged in the 12.4.x branch need to be re-opened for the new target. I think that, at least for simple PRs like the ones discussed here, it should be enough to do # backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1 git checkout CMSSW_12_4_9_patch1 -b backport_39580 git-cms-cherry-pick-pr 39580 CMSSW_12_4_X git push my-cmssw -u backport_39580:backport_39580 # backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1 git checkout CMSSW_12_4_9_patch1 -b backport_39619 git-cms-cherry-pick-pr 39619 CMSSW_12_4_X git push my-cmssw -u backport_39619:backport_39619 # backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1 git checkout CMSSW_12_4_9_patch1 -b backport_39681 git-cms-cherry-pick-pr 39681 CMSSW_12_4_X git push my-cmssw -u backport_39681:backport_39681 then push the resulting branches and open the corresponding PRs. Still, I acknowledge that it's not ideal, and more work than building a release from the current branch. Maybe we could teach cmsbot to do the dirty work ? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

fwyzard · 2022-10-11T14:53:53Z

If the original branches were based on 12_4_9_patch1 or older, all this is a no-op and one can just make a second PR with the same developer branch as originally (90%+ of the time its true I guess).

Agreed, and in this case all three are based on CMSSW_12_4_9 or CMSSW_12_4_9_patch1... but at least for two of them, the original branch is gone :-p

Still, one could use the official-cmssw:pull/39580/head etc. branched instead of recreating new ones.

perrotta · 2022-11-02T06:46:43Z

@missirol can this be considered fixed, and close therefore the issue?

missirol · 2022-11-02T06:49:25Z

I think so, but imho it would be better if the experts confirm, e.g. @cms-sw/heterogeneous-l2 @cms-sw/ecal-dpg-l2 .

thomreis · 2022-11-02T11:06:14Z

+ecal-dpg-l2

We still see integrity errors from the HW in the ECAL from time to time but the unpacker seems to handle them gracefully now.

thomreis · 2022-11-02T11:08:14Z

+ecal-dpg

fwyzard · 2022-11-02T11:43:59Z

+heterogeneous

missirol · 2022-11-03T14:17:26Z

please close

cmsbuild added the pending-assignment label Oct 8, 2022

cmsbuild added ecal-dpg-pending heterogeneous-pending pending-signatures reconstruction-pending and removed pending-assignment labels Oct 8, 2022

cmsbuild added ecal-dpg-approved and removed ecal-dpg-pending labels Nov 2, 2022

cmsbuild added heterogeneous-approved and removed heterogeneous-pending labels Nov 2, 2022

cmsbuild closed this as completed Nov 3, 2022

missirol mentioned this issue May 1, 2023

patch release of 13_0_X with #41467 for HLT #41475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA-related HLT crashes between run-359694 and run-359764 #39680

CUDA-related HLT crashes between run-359694 and run-359764 #39680

missirol commented Oct 8, 2022

cmsbuild commented Oct 8, 2022

missirol commented Oct 8, 2022

cmsbuild commented Oct 8, 2022

Sam-Harper commented Oct 10, 2022

trocino commented Oct 10, 2022

thomreis commented Oct 10, 2022

thomreis commented Oct 10, 2022

fwyzard commented Oct 11, 2022 via email

perrotta commented Oct 11, 2022 •

edited

Loading

fwyzard commented Oct 11, 2022 •

edited

Loading

thomreis commented Oct 11, 2022

perrotta commented Oct 11, 2022

fwyzard commented Oct 11, 2022

fwyzard commented Oct 11, 2022

perrotta commented Oct 11, 2022

perrotta commented Oct 11, 2022

fwyzard commented Oct 11, 2022

perrotta commented Oct 11, 2022

davidlange6 commented Oct 11, 2022 via email

fwyzard commented Oct 11, 2022 •

edited

Loading

davidlange6 commented Oct 11, 2022 via email

fwyzard commented Oct 11, 2022

perrotta commented Nov 2, 2022

missirol commented Nov 2, 2022

thomreis commented Nov 2, 2022

thomreis commented Nov 2, 2022

fwyzard commented Nov 2, 2022

missirol commented Nov 3, 2022

CUDA-related HLT crashes between run-359694 and run-359764 #39680

CUDA-related HLT crashes between run-359694 and run-359764 #39680

Comments

missirol commented Oct 8, 2022

cmsbuild commented Oct 8, 2022

missirol commented Oct 8, 2022

cmsbuild commented Oct 8, 2022

Sam-Harper commented Oct 10, 2022

trocino commented Oct 10, 2022

thomreis commented Oct 10, 2022

thomreis commented Oct 10, 2022

fwyzard commented Oct 11, 2022 via email

perrotta commented Oct 11, 2022 • edited Loading

fwyzard commented Oct 11, 2022 • edited Loading

thomreis commented Oct 11, 2022

perrotta commented Oct 11, 2022

fwyzard commented Oct 11, 2022

fwyzard commented Oct 11, 2022

perrotta commented Oct 11, 2022

perrotta commented Oct 11, 2022

fwyzard commented Oct 11, 2022

perrotta commented Oct 11, 2022

davidlange6 commented Oct 11, 2022 via email

fwyzard commented Oct 11, 2022 • edited Loading

davidlange6 commented Oct 11, 2022 via email

fwyzard commented Oct 11, 2022

perrotta commented Nov 2, 2022

missirol commented Nov 2, 2022

thomreis commented Nov 2, 2022

thomreis commented Nov 2, 2022

fwyzard commented Nov 2, 2022

missirol commented Nov 3, 2022

perrotta commented Oct 11, 2022 •

edited

Loading

fwyzard commented Oct 11, 2022 •

edited

Loading

fwyzard commented Oct 11, 2022 •

edited

Loading