Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch release of 13_0_X with #41467 for HLT #41475

Closed
missirol opened this issue May 1, 2023 · 33 comments
Closed

patch release of 13_0_X with #41467 for HLT #41475

missirol opened this issue May 1, 2023 · 33 comments

Comments

@missirol
Copy link
Contributor

missirol commented May 1, 2023

This issue is meant to clarify need and timeline of a 13_0_X patch release including #41467.

Right now, data-taking is disrupted by issues with the L1T menu (CMSLITOPS-411), which are causing hundreds of HLT crashes during collisions runs.

  • So far, two types of crashes have been identified (see this comment): those from L1TRawToDigi (L1T unpacker) and HLTRecHitInAllL1RegionsProducer<T> (HLT module making use of L1T objects). Each type accounts for roughly 50% of the crashes.

  • The root cause of the problem is under investigation. It may, or may not, be corrupted data sent from L1T to HLT (as one of the error messages from L1TRawToDigi suggests [1]).

  • While there is no solution in place for the crashes in L1TRawToDigi, guard HLTRecHitInAllL1RegionsProducer<T> against empty collection of L1T candidates [13_0_X] #41467 can fix the hundreds of crashes due to HLTRecHitInAllL1RegionsProducer<T>.

The conclusion is that HLT needs a patch release with #41467. There are at least two options.

  1. Build a patch release cherry-picking guard HLTRecHitInAllL1RegionsProducer<T> against empty collection of L1T candidates [13_0_X] #41467 on top of CMSSW_13_0_3 (the latter is what HLT currently uses online).
  2. Build a patch release with the latest version of CMSSW_13_0_X (would be 13_0_5_patch1).

Doing only 2) implies that HLT could switch to 13_0_5_patch1 only after T0 switches to 13_0_5, and I do not know the timeline of that. @malbouis (ORM) could clarify that.

@silviodonato @goys @fwyzard @trtomei @mzarucki @cms-sw/hlt-l2, do you think we should consider 1) ?

Is 1) acceptable to @cms-sw/orp-l2 ?

[1]

fu-c2b01-30-01 	3665707 	FATAL 	[2] Calling method for module L1TRawToDigi/'hltGtStage2Digis'
Exception Message:
Corrupt RAW data from FED 1404, AMC 1. BX number -6 in BX header is outside of the BX range [0,0] defined in the block header.
@cmsbuild
Copy link
Contributor

cmsbuild commented May 1, 2023

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@perrotta
Copy link
Contributor

perrotta commented May 1, 2023

@missirol T0 also has to switch rather sooner than later to a release version with several fixes included.

CMSSW_13_0_5 includes the fix for the segmentation violation on Prompt Reco for lowPtGsfEleGsfTracks (cms-sw/cmssw#41442).

It does not include the fixes for the crash in ak4PFJets (cms-sw/cmssw#41397), the issue of the high memory consumption (cms-sw/cmssw#41457), and the HLT related patches mentioned in this issue.

For the crash in ak4PFJets the following patch is available (not a full bug fix, but it allows running without crashing): #41474

For the high memory consumption there is no solution identified so far.

As soon as #41474 will get merged (as soon as the just merged master PR is succesfully tested in the IBs) we can build a CMSSW_13_0_5_patch1. If all this will be deemed enough, even if there is no solution yet for the high memory consumption in a few streams, we can make a T0 replay and deploy that new release. I would suggest to do so: that new release will have certainly less issues than current 13_0_3.

@malbouis
Copy link
Contributor

malbouis commented May 1, 2023

@missirol , @perrotta the replay to test CMSSW_13_0_5 is setup and will be launched soon. If it is successful, Tier0 could switch to 13_0_5.

@germanfgv , FYI.

tagging also @saumyaphor4252 as incoming ORM for this week.

@mzarucki
Copy link
Contributor

mzarucki commented May 1, 2023

Hi all,

From the FOG side, our preference would be to switch as soon as possible to reduce the large number of online crashes (causing quarantine of DAQ nodes, affecting data-taking). Therefore, option 2) would work if possible on a short timescale (also avoids switching releases twice in a relatively short time-period) - otherwise, we prefer option 1).

Best regards,
Mateusz on behalf of TSG FOG

@missirol
Copy link
Contributor Author

missirol commented May 1, 2023

Okay, unless others feel strongly, let's see if it's possible to have 13_0_5_patch* built later today, and 13_0_5 deployed at T0 soon. At that point, HLT could install 13_0_5_patch* and switch to that.

The larger issue is the approach we are following, and this goes back to #39680 (comment).

There should be a procedure by which a simple HLT bugfix which is orthogonal to T0 could be deployed at HLT in, say, less than 12h, without HLT depending on (or, putting pressure on) T0 when it does not need to (the fixes between 13_0_3 and 13_0_5 are certainly useful, but as far as we know HLT does not need them at the moment).

Of course, HLT does not mean to abuse this asking for separate patch releases at every occasion. In this particular case, we are talking about hundreds of HLT crashes, which could be fixed online by tomorrow with a 13_0_3_patch1, but they likely won't be as we wait to build and deploy 13_0_5_patch1.

@malbouis
Copy link
Contributor

malbouis commented May 1, 2023

@missirol I think it is a valid discussion to be re-opened. Indeed, for a change in release at Tier0 we do need to launch a replay, evaluate the results and only if all is successful, deploy it in production. This takes at least a full day and it is not healthy to rush things if not really needed.

In my personal opinion in case HLT needs a faster solution, then it might be better not to wait for a deployment at Tier0 first, since HLT can run patch releases newer than Tier0, as outlined in this policy from last year.

That being said, if HLT can afford to wait for a day or so before deploying CMSSW_13_0_5_patch1 (provided all gos well in the Tier0 replay), this might be the preferred way to go.

@missirol
Copy link
Contributor Author

missirol commented May 1, 2023

Thanks for clarifying, @malbouis.

What makes more sense to me is to fix what we know we can fix, as soon as we can, minimising dependencies between HLT/T0/DQM. This means having a 13_0_3_patch1 built today (with #41467 and #41470 on top of 13_0_3; yes, also #41470 if it is not a problem) [*]. HLT can try to deploy it as soon as possible (T0 and DQM do not have to do anything).

I do not know the current LHC schedule: I think there are collisions today and tomorrow, I don't know after tomorrow.

The alternative is that (1) if there are no collisions on Wednesday, and (2) if we accept a few extra hundred HLT crashes tomorrow, and (3) if T0 deploys 13_0_5 by Wednesday, then HLT could just deploy 13_0_5_patch1 on Wednesday without the need for 13_0_3_patch1.

[*] Technically, I don't know if this requires action on my side. I can prepare the necessary branch on top of 13_0_3 if that helps.

@perrotta
Copy link
Contributor

perrotta commented May 1, 2023

Dear all, the patch release with all the fixes we collected so far is building: #41478
It being a patch release, it won't take too much to have it ready, I imagine it can be well before this evening.

The fixes in it are (also) needed at T0, and that's why we are working on it even during holidays.

That said, if HLT believes that their patches are needed even earlier than that, a patch based on 13_0_3 + #41467 and #41470 can also get built quickly. I normally ask @smuzaffar to prepare the queue for it (branching off CMSSW_13_0_3, behind the HEAD): but if you @missirol know about how to do it, please go ahead and prepare the branch accordingly.

@missirol
Copy link
Contributor Author

missirol commented May 1, 2023

@perrotta, thanks for considering to make a patch release for HLT. I'm sorry it happened on a holiday.

I think all I can do is to create a branch with "#41467 + #41470" cherry-picked on top of CMSSW_13_0_3: this is in
https://github.com/missirol/cmssw/commits/devel_hltFixesOnTopOf_13_0_3

Should I make a PR to CMSSW_13_0_3_patchHLT ?

@missirol
Copy link
Contributor Author

missirol commented May 1, 2023

I opened #41479. (I can close it if somehow this is not the correct way to do it)

@smuzaffar
Copy link
Contributor

@perrotta , I would suggest to create a branch CMSSW_13_0_3_patch based on CMSSW_13_0_3 and then Open/Merge PRs #41467 + #41470 for both CMSSW_13_0_X and CMSSW_13_0_3_patch branches. Once these PRs are merge in CMSSW_13_0_3_patch then you can start building the patch release.

If This soulds reasonable then I can create the branch and open/backport these PRs for it

@perrotta
Copy link
Contributor

perrotta commented May 1, 2023

If This soulds reasonable then I can create the branch and open/backport these PRs for it

Yes, thank you @smuzaffar: that would help a lot!
Indeed creating such a branch is what I couldn't realize how to do... Are there instructions somewhere for doing so?
By the way, #41467 + #41470 are already merged in 13_0_X.

@perrotta
Copy link
Contributor

perrotta commented May 1, 2023

FYI: release CMSSW_13_0_5_patch1 is ready: https://github.com/cms-sw/cmssw/releases/tag/CMSSW_13_0_5_patch1

@smuzaffar
Copy link
Contributor

@perrotta , haven't you already created CMSSW_13_0_3_patchHLT based on CMSSW_13_0_3 ? I think that would work for building the patch release

@fwyzard
Copy link
Contributor

fwyzard commented May 1, 2023

Hi, and sorry if I keep coming back to this, but things really, really, look skewed in favour of the offline operations - while IMNSHO they should be skewed towards the online operations, if any.

Tier-0 needed some fixes and a patch release, so as @perrotta pointed out

The fixes in it are (also) needed at T0, and that's why we are working on it even during holidays.

And, soon after:

FYI: release CMSSW_13_0_5_patch1 is ready: https://github.com/cms-sw/cmssw/releases/tag/CMSSW_13_0_5_patch1

HLT also needs a different patch release, potentially on a shorter time scale due to the (small) data loss and (not-so-small) debugging effort incurred by the HLT crashes.
I really do acknowledge the efforts of the various people open to build the extra patch and looking into this, and so far the result is a discussion how to technically do it.

Can we take this opportunity to figure out a more sustainable way to build this kind of patch releases, that does not require the involvement of @smuzaffar every time ?

It would be nice if the release managers had a way of doing it semi-automatically (at least a way to create the necessary branch, and even better if there was a way to cherry pick the fixes from the main release branch).

It would also be great if the HLT L2s were allowed to do it, especially once offline moves to a newer release cycle, and 13.0.x is only used online... think of it as a way to bring in potential future release managers :-)

@malbouis
Copy link
Contributor

malbouis commented May 1, 2023

Dear all, as a heads up, the progressing of the replay for testing release 13_0_5 is showing a few paused jobs that are crashing due to high memory consumption, related to issue #41457 . It looks like it will not be possible to deploy a new release at Tier0 before this issue is fixed.

There will be a daily JointOps Meeting tomorrow at 10h30 CERN time where we can discuss how to proceed.

@missirol
Copy link
Contributor Author

missirol commented May 2, 2023

CMSSW_13_0_3_patch1 was built, and HLT will soon deploy it online. Thanks again to @cms-sw/orp-l2 and @smuzaffar for their availability to build this patch release on short notice.

The HLT fix discussed here in now available in CMSSW_13_0_3_patch1 and CMSSW_13_0_5_patch1.

Technically, this issue is solved, but #39680 (comment) and #41475 (comment) have to be addressed. Please let me know if this should be moved to a dedicated issue (and we close this one).

@silviodonato
Copy link
Contributor

Thank you again @cms-sw/orp-l2 @smuzaffar for the prompt action in building CMSSW_13_0_3_patch1.
What is the procedure in case we urgently need again a patch release in the future?
Would it be possible to simplify the build of these special patch releases starting from an existing release (eg. CMSSW_13_0_3 + #1234) instead of the active branch (eg. CMSSW_13_0_X)

@smuzaffar
Copy link
Contributor

smuzaffar commented May 11, 2023

Release/patch release build procedure is simple, for releases which do not follow the official release branch (e.g. CMSSW_13_0_X, CMSSW_13_1_X etc.) one needs to

  • Create a special branch based on previous release tag e.g. in this case CMSSW_13_0_3_patchHLT was created based on CMSSW_13_0_3 tag
  • Open PR(s) with changes on top of this special branch
  • Run PR tests explicitly requesting bot to use the release as base instead of IB e.g. please test for CMSSW_13_0_3
  • Once PR(s) are tested and merged then release managers can open the release build github issue. They have to explicitly instruct the bot which cmsdist/cmssw branch/atg to use for building this release e.g. see Build CMSSW_13_0_3_patch1 #41482 (comment) . for these releases following parameters are required
RELEASE_QUEUE: <release-queue>  #e.g. CMSSW_13_0_X
CMSSW_COMMIT: <cmssw-branch> #e.g. CMSSW_13_0_3_patchHLT
CMSDIST_COMMIT: REL/<old-full-release-name>/<architecture> # e.g. REL/CMSSW_13_0_3/el8_amd64_gcc11   
ARCHITECTURE: <architecture> #e.g. el8_amd64_gcc11   

These parameters are needed as bot does not know which cmsdist/pkgtools tags/branches to use. Note that one can only build this type of release for one architecture. Of multiple architecture (if needed), release managers need to open different github issues (with same github issue title)

Although bot can build these releases but we should avoid building such releases. It is really hard to maintain these special branch also it is not trivial to setup IBs for such branches (displaying 13_0_X IBs and 13_0_3+changes IBs on one IB dashboard page). It need some non-trivial changes on cms-bot and IB dashboard side.

@fwyzard
Copy link
Contributor

fwyzard commented May 11, 2023

@smuzaffar I understand that it is extra work to handle these special requests - but IMHO the goal should be to make handling them as simple as standard release, not to limit their use.

Some suggestions:

  • automate the creation of a CMSSW_X_Y_Z_patchN branch from any CMSSW_X_Y_Z release
  • make the bot could automatically derive the settings (queue, cmsdist, commits, etc.) from the CMSSW_X_Y_Z release when building these CMSSW_X_Y_Z_patchN patch releases

I'm sure we can eventually automate more in this process, but already this would be a good start in simplifying this workflow.

@smuzaffar
Copy link
Contributor

automate the creation of a CMSSW_X_Y_Z_patchN branch from any CMSSW_X_Y_Z release

@fwyzard , the branch needs to be created before starting the release process. This is needed for PR tests too. Release managers can easily create the branch by going to github web interface, select the correct tag and the type in the new branch name
image

make the bot could automatically derive the settings (queue, cmsdist, commits, etc.) from the CMSSW_X_Y_Z release when building these CMSSW_X_Y_Z_patchN patch releases

I agree, most of these bot should be able to get from the release name. It still needs at least CMSSW_COMMIT: CMSSW_X_Y_Z_patch_Branch though

@davidlange6
Copy link
Contributor

davidlange6 commented May 11, 2023 via email

@davidlange6
Copy link
Contributor

davidlange6 commented May 11, 2023 via email

@smuzaffar
Copy link
Contributor

smuzaffar commented May 11, 2023

yes @davidlange6 it is not trivial e.g. one can request to build a patch release using another patch release as base like

  • CMSSW_13_0_3_patch branch based on CMSSW_13_0_3
  • build CMSSW_13_0_3_patch1 using CMSSW_13_0_3_patch + PR1
  • build CMSSW_13_0_3_patch2 using CMSSW_13_0_3_patch + PR1 +PR2
  • we need to build CMSSW_13_0_3_patch1a using CMSSW_13_0_3_patch1 + PR3

That is why I think it is better for release managers to explicitly tell bot from where to start. I think CMSSW_COMMIT: CMSSW_X_Y_Z_patch_Branch should be enough for bot get the rest of the parameters.

Note that this is only true for patch releases as bot can then use cmsdist/pkgtools of the full release. If we need to build a full release based on an other release + some updated external (e.g some data file update) then we have to setup proper IB and release need to be called differently e.g. CMSSW_13_0_HLT_X IBs to build CMSSW_13_0_3_HLTn release

@smuzaffar
Copy link
Contributor

smuzaffar commented May 11, 2023

build a patch release using another patch release

May be for such releases it is better to just revert the change (PR2) and add the new fix (PR3) using the same CMSSW_13_0_3_patch branch. If that is acceptable then yes I can teach bot to look for default CMSSW_13_0_3_patch branch and if not found then fall back to CMSSW_13_0_X branch

@davidlange6
Copy link
Contributor

davidlange6 commented May 11, 2023 via email

@fwyzard
Copy link
Contributor

fwyzard commented May 11, 2023

I agree.

I would also suggest to use CMSSW_n_n_n_patchX as the default name for a new patch branch, e.g. CMSSW_13_0_1_patchX made from CMSSW_13_0_1.

We might also need different branches (e.g. CMSSW_13_0_1_hltpatch1 vs CMSSW_13_0_1_patch1), but I think these would be exceedingly rare cases that can be handled on a case-by-case basis.

@makortel
Copy link
Contributor

makortel commented Aug 9, 2023

Can this issue be closed now?

@perrotta
Copy link
Contributor

perrotta commented Aug 9, 2023

Can this issue be closed now?

Yes

@perrotta perrotta closed this as completed Aug 9, 2023
@missirol
Copy link
Contributor Author

missirol commented Aug 9, 2023

@cms-sw/orp-l2

For the sake of (my own) clarity, what is then the recommended procedure for a subsystem to request a patch release ? Opening a CMSSW issue specifying base release and PRs to be added, or else ? Is the subsystem also in charge of making the corresponding PR if needed (e.g. like in #41479) ? Is the recommended procedure written down somewhere ?

Do we also agree that by default the name should be CMSSW_X_Y_Z_patchN ? In #42268 there was some confusion on the name of the release (see also #42268 (comment)).

@perrotta
Copy link
Contributor

perrotta commented Aug 9, 2023

@cms-sw/orp-l2

For the sake of (my own) clarity, what is then the recommended procedure for a subsystem to request a patch release ? Opening a CMSSW issue specifying base release and PRs to be added, or else ?

Yes

Is the subsystem also in charge of making the corresponding PR if needed (e.g. like in #41479) ? Is the recommended procedure written down somewhere ?

Normally the release manager should be able to deal with it, but of course any help is welcome :-)

Do we also agree that by default the name should be CMSSW_X_Y_Z_patchN ? In #42268 there was some confusion on the name of the release (see also #42268 (comment)).

This is what we agreed...

Of course, there can be special cases that must be dealt with specifically, for example:

  • What to do if a CMSSW_X_Y_Z_patchN already exists and you want to base your branch to a CMSSW_X_Y_Z_patchM with M<N?- What to do if what is requested in patchN is not in the master? Probably the patch in the name can still be used but it must be made clear that what is there will not be there in the following release: a specific name can help

@missirol
Copy link
Contributor Author

Hi @perrotta , thanks for clarifying. Is the agreed procedure (going to be) written down somewhere (ORP twiki, or else) ? If not, I'm afraid these questions will come up over and over again. Thanks !

@mmusich
Copy link
Contributor

mmusich commented Oct 9, 2023

thanks for clarifying. Is the agreed procedure (going to be) written down somewhere (ORP twiki, or else) ? If not, I'm afraid these questions will come up over and over again. Thanks !

@cms-sw/orp-l2 just to close the loop on this item from the TSG side, was this ever addressed on the ORP side or shall we document it internally ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests