-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA-related HLT crashes between run-359694 and run-359764 #39680
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign ecal-dpg,reconstruction,heterogeneous FYI: @cms-sw/hlt-l2 |
New categories assigned: heterogeneous,reconstruction,ecal-dpg @mandrenguyen,@fwyzard,@clacaputo,@simonepigazzini,@makortel,@jainshilpi,@thomreis you have been requested to review this Pull request/Issue and eventually sign? Thanks |
so just keep track of things, it is still happening at a much lower rate (14 crashes in 360090 thanks @trocino for eporting). |
We also had 8 in 360075.
|
For both runs it is EB+08 that shows integrity errors. |
Are there plans for a patch release to deploy the fix from #39619 ? |
Given that we have a likely fix since last week, why don't we have a patch
release with it ?
|
@fwyzard @thomreis we could build that release even now (it will be a full release, CMSSW_12_4_10). |
A |
We will discuss it this afternoon. To get ready for that discussion: why don't you like a full release, which would be far simpler (for us, of course)? Then a patch release on top of it can be quickly prepared with the remaining HLT fixes later this week |
Because a patch release is something that we should be able to build and deploy in a matter of hours, not days. I.e. once a fix is available, we should be able to request a patch release at the morning meeting and be using it online by the afternoon. IMHO this year we seem to have lost the mindset of a data taking experiment, where fixes are both frequent and urgent. Instead, it looks to me like we are still operating with a focus on MC and reprocessing mode, where "urgent fixes" arrive on the timescale of days or even weeks, and even for more critical ones, well, Tier-0 can always be paused, and the offline reconstruction is always re-RECO'ed eventually. I do appreciate that PRs are merged pretty frequently also in 12.4.x branch, although I'm starting to think that that may actually be counterproductive: as soon as a non-patch change is made, it becomes more complicated to build a patch release. So, how do we get back the capability of building patch releases quickly and as needed ? |
Case in point, if we build a If we could have requested a |
Andrea, as of NOW a full release could be much faster: if you ask, we can even start building now, and you'll have it ready by this evening, tomorrow at most. (Would you or HLT have asked yesterday instead of saying that it could have been delaied till the end of the week it would have been even faster) Making a patch release with the three PRs you are asking for, would require some amount of manual intervention:
|
While you think what is best for you, I start building 12_4_10: it can always get stopped, if you prefere something else. |
This is a separate discussion than I am trying to have with TSG: why this was not asked.
I do understand that. And I don't think this is a viable procedure: we do need a way to make patch releases easier than that ! |
See #39694 |
• branch off from 12_4_9_patch1
• add the commits from those three PRs alone, hoping that no rebase is needed
• further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct
I do understand that. And I don't think this is a viable procedure: we do need a way to make patch releases easier than that !
What is actually complex here? You can open a milestone and accept PRs just as with any other release cycle. (At least thats what was done in run2) - the "hope" can then be tested via the release build (if you are doing something that makes this procedure error prone, then likely its best not being a patch...)
…
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.
|
I agree, and I guess the concern is that the part where the PRs that are already merged in the 12.4.x branch need to be re-opened for the new target. I think that, at least for simple PRs like the ones discussed here, it should be enough to do # backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39580
git-cms-cherry-pick-pr 39580 CMSSW_12_4_X
git push my-cmssw -u backport_39580:backport_39580
# backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39619
git-cms-cherry-pick-pr 39619 CMSSW_12_4_X
git push my-cmssw -u backport_39619:backport_39619
# backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39681
git-cms-cherry-pick-pr 39681 CMSSW_12_4_X
git push my-cmssw -u backport_39681:backport_39681 then push the resulting branches and open the corresponding PRs. Still, I acknowledge that it's not ideal, and more work than building a release from the current branch. Maybe we could teach cmsbot to do the dirty work ? |
If the original branches were based on 12_4_9_patch1 or older, all this is a no-op and one can just make a second PR with the same developer branch as originally (90%+ of the time its true I guess).
This of course should be automated if that automation is thought to be less work to do and maintain than doing what's below. It never seemed that way when I dealt with it. (But ymmv)
… On Oct 11, 2022, at 4:23 PM, Andrea Bocci ***@***.***> wrote:
I agree, and I guess the concern is that the part where the PRs that are already merged in the 12.4.x branch need to be re-opened for the new target.
I think that, at least for simple PRs like the ones discussed here, it should be enough to do
# backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39580
git-cms-cherry-pick-pr 39580 CMSSW_12_4_X
git push my-cmssw -u backport_39580:backport_39580
# backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39619
git-cms-cherry-pick-pr 39619 CMSSW_12_4_X
git push my-cmssw -u backport_39619:backport_39619
# backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39681
git-cms-cherry-pick-pr 39681 CMSSW_12_4_X
git push my-cmssw -u backport_39681:backport_39681
then push the resulting branches and open the corresponding PRs.
Still, I acknowledge that it's not ideal, and more work than building a release from the current branch.
Maybe we could teach cmsbot to do the dirty work ?
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you commented.
|
Agreed, and in this case all three are based on Still, one could use the |
@missirol can this be considered fixed, and close therefore the issue? |
I think so, but imho it would be better if the experts confirm, e.g. @cms-sw/heterogeneous-l2 @cms-sw/ecal-dpg-l2 . |
+ecal-dpg-l2 We still see integrity errors from the HW in the ECAL from time to time but the unpacker seems to handle them gracefully now. |
+ecal-dpg |
+heterogeneous |
please close |
Note : This issue is hopefully just for documentation purposes, as a possibile solution (#39619) has already been put in place.
Between Oct-1 and Oct-4 (2022), several runs [1] were affected by burst of HLT crashes (a reproducer can be found in [3]).
Stack traces [4] and offline checks [2] pointed to issues with reconstruction running on GPU.
@fwyzard identified an issue in the ECAL-GPU unpacker, and fixed it in #39617 (
12_6_X
), #39618 (12_5_X
), #39619 (12_4_X
). With the latter update, crashes were not observed anymore when re-running the HLT offline on data from some of the affected runs.In parallel to this, it was realised by @Sam-Harper and ECAL experts that the crashes coincided with runs where ECAL suffered from data-integrity issues (see, for example, this DQM plot). On Oct-4, ECAL masked its channels (TT11 EB-6) affected by data-integrity errors, and since then no more online crashes of this kind have been observed thus far (despite HLT still running in
CMSSW_12_4_9
).There is a separate open issue (#39568) likely related to the ECAL-GPU unpacker, but it was checked that #39619 does not solve it, so the issue in #39568 is likely different from the issue discussed here.
[1] Affected runs:
[2] Offline checks:
The crashes could be reproduced offline using error-stream files from some of the affected runs, but they were not entirely reproducibile. They were more likely to occur when using more than 1 EDM stream, and would only occur in offline tests if both the ECAL unpacker and pixel reconstruction were offloaded to GPUs. A summary of the checks done offline was compiled by @Sam-Harper in this document.
[3] Reproducer (requires access to one of the online GPU machines, might need to run it multiple times to see the crash at runtime):
[4] Example of a stack trace (from reproducer) in the attachment
hlt.log
.The text was updated successfully, but these errors were encountered: