Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA-related HLT crashes between run-359694 and run-359764 #39680

Closed
missirol opened this issue Oct 8, 2022 · 28 comments
Closed

CUDA-related HLT crashes between run-359694 and run-359764 #39680

missirol opened this issue Oct 8, 2022 · 28 comments

Comments

@missirol
Copy link
Contributor

missirol commented Oct 8, 2022

Note : This issue is hopefully just for documentation purposes, as a possibile solution (#39619) has already been put in place.

Between Oct-1 and Oct-4 (2022), several runs [1] were affected by burst of HLT crashes (a reproducer can be found in [3]).

Stack traces [4] and offline checks [2] pointed to issues with reconstruction running on GPU.

@fwyzard identified an issue in the ECAL-GPU unpacker, and fixed it in #39617 (12_6_X), #39618 (12_5_X), #39619 (12_4_X). With the latter update, crashes were not observed anymore when re-running the HLT offline on data from some of the affected runs.

In parallel to this, it was realised by @Sam-Harper and ECAL experts that the crashes coincided with runs where ECAL suffered from data-integrity issues (see, for example, this DQM plot). On Oct-4, ECAL masked its channels (TT11 EB-6) affected by data-integrity errors, and since then no more online crashes of this kind have been observed thus far (despite HLT still running in CMSSW_12_4_9).

There is a separate open issue (#39568) likely related to the ECAL-GPU unpacker, but it was checked that #39619 does not solve it, so the issue in #39568 is likely different from the issue discussed here.


[1] Affected runs:

359694
359699
359750
359751
359762
359763
359764

[2] Offline checks:

The crashes could be reproduced offline using error-stream files from some of the affected runs, but they were not entirely reproducibile. They were more likely to occur when using more than 1 EDM stream, and would only occur in offline tests if both the ECAL unpacker and pixel reconstruction were offloaded to GPUs. A summary of the checks done offline was compiled by @Sam-Harper in this document.

[3] Reproducer (requires access to one of the online GPU machines, might need to run it multiple times to see the crash at runtime):

#!/bin/bash

# release: CMSSW_12_4_9

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 359694 > hlt.py

cat >> hlt.py <<@EOF

# # disable ECAL unpacking on GPU (this makes the crash disappear)
# del process.hltEcalDigis.cuda
# del process.hltEcalUncalibRecHit.cuda

process.options.numberOfThreads = 4
process.source.fileListMode = True
process.source.fileNames = [
   '/store/error_stream/run359694/run359694_ls0112_index000079_fu-c2b04-35-01_pid2742152.raw',
   '/store/error_stream/run359694/run359694_ls0112_index000090_fu-c2b04-35-01_pid2742152.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000141_fu-c2b02-35-01_pid2465574.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000142_fu-c2b02-35-01_pid2465574.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000175_fu-c2b02-16-01_pid2674062.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000195_fu-c2b02-16-01_pid2674062.raw',
]
@EOF

cmsRun hlt.py &> hlt.log

[4] Example of a stack trace (from reproducer) in the attachment hlt.log.

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 8, 2022

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@missirol
Copy link
Contributor Author

missirol commented Oct 8, 2022

assign ecal-dpg,reconstruction,heterogeneous

FYI: @cms-sw/hlt-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 8, 2022

New categories assigned: heterogeneous,reconstruction,ecal-dpg

@mandrenguyen,@fwyzard,@clacaputo,@simonepigazzini,@makortel,@jainshilpi,@thomreis you have been requested to review this Pull request/Issue and eventually sign? Thanks

@trocino
Copy link
Contributor

trocino commented Oct 10, 2022

We also had 8 in 360075.
The corresponding files are on Hilton:

/store/error_stream/run360075
/store/error_stream/run360090

@thomreis
Copy link
Contributor

For both runs it is EB+08 that shows integrity errors.

@thomreis
Copy link
Contributor

Are there plans for a patch release to deploy the fix from #39619 ?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022 via email

@perrotta
Copy link
Contributor

perrotta commented Oct 11, 2022

Given that we have a likely fix since last week, why don't we have a patch release with it ?

@fwyzard @thomreis we could build that release even now (it will be a full release, CMSSW_12_4_10).
However, yesterday at the joint ops meeting it was said that HLT would have liked to wait for the end of the week instead, when also the other fixes expected could have been ready and merged.
We can discuss and agree a possible change of plan with respect to what concluded yesterday later this afternoon at the ORP meeting

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022

@perrotta can't we build a CMSSW_12_4_9_patch2 patch release based on CMSSW_12_4_9_patch1 adding only the changes necessary to fix the HLT crashes, namely #39580, #39619, and #39681 ?

@thomreis
Copy link
Contributor

A CMSSW_12_4_9_patch2 now would allow ECAL to unmask EB-06 and prevent an eventual masking of EB+08, so I think that is the better option.

@perrotta
Copy link
Contributor

We will discuss it this afternoon. To get ready for that discussion: why don't you like a full release, which would be far simpler (for us, of course)? Then a patch release on top of it can be quickly prepared with the remaining HLT fixes later this week

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022

Because a patch release is something that we should be able to build and deploy in a matter of hours, not days. I.e. once a fix is available, we should be able to request a patch release at the morning meeting and be using it online by the afternoon.

IMHO this year we seem to have lost the mindset of a data taking experiment, where fixes are both frequent and urgent. Instead, it looks to me like we are still operating with a focus on MC and reprocessing mode, where "urgent fixes" arrive on the timescale of days or even weeks, and even for more critical ones, well, Tier-0 can always be paused, and the offline reconstruction is always re-RECO'ed eventually.

I do appreciate that PRs are merged pretty frequently also in 12.4.x branch, although I'm starting to think that that may actually be counterproductive: as soon as a non-patch change is made, it becomes more complicated to build a patch release.
The upside is that they are tested in the IBs, though this doesn't always help prevent mistakes from going in production.

So, how do we get back the capability of building patch releases quickly and as needed ?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022

Case in point, if we build a CMSSW_12_4_10 release, the sooner that it can be deployed online is likely on Thursday morning (build overnight, Tier-0 replay and HLT tests on Wednesday, if all goes well deploy on Thursday).

If we could have requested a CMSSW_12_4_9_patch2 release this morning, we should have a mechanism to have it ready by this afternoon :-/

@perrotta
Copy link
Contributor

Andrea, as of NOW a full release could be much faster: if you ask, we can even start building now, and you'll have it ready by this evening, tomorrow at most. (Would you or HLT have asked yesterday instead of saying that it could have been delaied till the end of the week it would have been even faster)

Making a patch release with the three PRs you are asking for, would require some amount of manual intervention:

  • branch off from 12_4_9_patch1
  • add the commits from those three PRs alone, hoping that no rebase is needed
  • further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct

@perrotta
Copy link
Contributor

While you think what is best for you, I start building 12_4_10: it can always get stopped, if you prefere something else.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022

(Would you or HLT have asked yesterday instead of saying that it could have been delayed till the end of the week it would have been even faster)

This is a separate discussion than I am trying to have with TSG: why this was not asked.
Were I not stuck in bed dealing with Covid, I would have.

Making a patch release with the three PRs you are asking for, would require some amount of manual intervention:

  • branch off from 12_4_9_patch1
  • add the commits from those three PRs alone, hoping that no rebase is needed
  • further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct

I do understand that. And I don't think this is a viable procedure: we do need a way to make patch releases easier than that !

@perrotta
Copy link
Contributor

While you think what is best for you, I start building 12_4_10: it can always get stopped, if you prefere something else.

See #39694

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 11, 2022 via email

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022

I agree, and I guess the concern is that the part where the PRs that are already merged in the 12.4.x branch need to be re-opened for the new target.

I think that, at least for simple PRs like the ones discussed here, it should be enough to do

# backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39580
git-cms-cherry-pick-pr 39580 CMSSW_12_4_X
git push my-cmssw -u backport_39580:backport_39580

# backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39619
git-cms-cherry-pick-pr 39619 CMSSW_12_4_X
git push my-cmssw -u backport_39619:backport_39619

# backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39681
git-cms-cherry-pick-pr 39681 CMSSW_12_4_X
git push my-cmssw -u backport_39681:backport_39681

then push the resulting branches and open the corresponding PRs.

Still, I acknowledge that it's not ideal, and more work than building a release from the current branch.

Maybe we could teach cmsbot to do the dirty work ?

@davidlange6
Copy link
Contributor

davidlange6 commented Oct 11, 2022 via email

@fwyzard
Copy link
Contributor

fwyzard commented Oct 11, 2022

If the original branches were based on 12_4_9_patch1 or older, all this is a no-op and one can just make a second PR with the same developer branch as originally (90%+ of the time its true I guess).

Agreed, and in this case all three are based on CMSSW_12_4_9 or CMSSW_12_4_9_patch1... but at least for two of them, the original branch is gone :-p

Still, one could use the official-cmssw:pull/39580/head etc. branched instead of recreating new ones.

@perrotta
Copy link
Contributor

perrotta commented Nov 2, 2022

@missirol can this be considered fixed, and close therefore the issue?

@missirol
Copy link
Contributor Author

missirol commented Nov 2, 2022

I think so, but imho it would be better if the experts confirm, e.g. @cms-sw/heterogeneous-l2 @cms-sw/ecal-dpg-l2 .

@thomreis
Copy link
Contributor

thomreis commented Nov 2, 2022

+ecal-dpg-l2

We still see integrity errors from the HW in the ECAL from time to time but the unpacker seems to handle them gracefully now.

@thomreis
Copy link
Contributor

thomreis commented Nov 2, 2022

+ecal-dpg

@fwyzard
Copy link
Contributor

fwyzard commented Nov 2, 2022

+heterogeneous

@missirol
Copy link
Contributor Author

missirol commented Nov 3, 2022

please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants