-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
patch release of 13_0_X
with #41467 for HLT
#41475
Comments
A new Issue was created by @missirol Marino Missiroli. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@missirol T0 also has to switch rather sooner than later to a release version with several fixes included. CMSSW_13_0_5 includes the fix for the segmentation violation on Prompt Reco for lowPtGsfEleGsfTracks (cms-sw/cmssw#41442). It does not include the fixes for the crash in ak4PFJets (cms-sw/cmssw#41397), the issue of the high memory consumption (cms-sw/cmssw#41457), and the HLT related patches mentioned in this issue. For the crash in ak4PFJets the following patch is available (not a full bug fix, but it allows running without crashing): #41474 For the high memory consumption there is no solution identified so far. As soon as #41474 will get merged (as soon as the just merged master PR is succesfully tested in the IBs) we can build a CMSSW_13_0_5_patch1. If all this will be deemed enough, even if there is no solution yet for the high memory consumption in a few streams, we can make a T0 replay and deploy that new release. I would suggest to do so: that new release will have certainly less issues than current 13_0_3. |
@missirol , @perrotta the replay to test CMSSW_13_0_5 is setup and will be launched soon. If it is successful, Tier0 could switch to 13_0_5. @germanfgv , FYI. tagging also @saumyaphor4252 as incoming ORM for this week. |
Hi all, From the FOG side, our preference would be to switch as soon as possible to reduce the large number of online crashes (causing quarantine of DAQ nodes, affecting data-taking). Therefore, option 2) would work if possible on a short timescale (also avoids switching releases twice in a relatively short time-period) - otherwise, we prefer option 1). Best regards, |
Okay, unless others feel strongly, let's see if it's possible to have The larger issue is the approach we are following, and this goes back to #39680 (comment). There should be a procedure by which a simple HLT bugfix which is orthogonal to T0 could be deployed at HLT in, say, less than 12h, without HLT depending on (or, putting pressure on) T0 when it does not need to (the fixes between Of course, HLT does not mean to abuse this asking for separate patch releases at every occasion. In this particular case, we are talking about hundreds of HLT crashes, which could be fixed online by tomorrow with a |
@missirol I think it is a valid discussion to be re-opened. Indeed, for a change in release at Tier0 we do need to launch a replay, evaluate the results and only if all is successful, deploy it in production. This takes at least a full day and it is not healthy to rush things if not really needed. In my personal opinion in case HLT needs a faster solution, then it might be better not to wait for a deployment at Tier0 first, since HLT can run patch releases newer than Tier0, as outlined in this policy from last year. That being said, if HLT can afford to wait for a day or so before deploying CMSSW_13_0_5_patch1 (provided all gos well in the Tier0 replay), this might be the preferred way to go. |
Thanks for clarifying, @malbouis. What makes more sense to me is to fix what we know we can fix, as soon as we can, minimising dependencies between HLT/T0/DQM. This means having a I do not know the current LHC schedule: I think there are collisions today and tomorrow, I don't know after tomorrow. The alternative is that (1) if there are no collisions on Wednesday, and (2) if we accept a few extra hundred HLT crashes tomorrow, and (3) if T0 deploys [*] Technically, I don't know if this requires action on my side. I can prepare the necessary branch on top of |
Dear all, the patch release with all the fixes we collected so far is building: #41478 The fixes in it are (also) needed at T0, and that's why we are working on it even during holidays. That said, if HLT believes that their patches are needed even earlier than that, a patch based on 13_0_3 + #41467 and #41470 can also get built quickly. I normally ask @smuzaffar to prepare the queue for it (branching off CMSSW_13_0_3, behind the HEAD): but if you @missirol know about how to do it, please go ahead and prepare the branch accordingly. |
@perrotta, thanks for considering to make a patch release for HLT. I'm sorry it happened on a holiday. I think all I can do is to create a branch with "#41467 + #41470" cherry-picked on top of Should I make a PR to |
I opened #41479. (I can close it if somehow this is not the correct way to do it) |
@perrotta , I would suggest to create a branch CMSSW_13_0_3_patch based on CMSSW_13_0_3 and then Open/Merge PRs #41467 + #41470 for both CMSSW_13_0_X and CMSSW_13_0_3_patch branches. Once these PRs are merge in CMSSW_13_0_3_patch then you can start building the patch release. If This soulds reasonable then I can create the branch and open/backport these PRs for it |
Yes, thank you @smuzaffar: that would help a lot! |
FYI: release CMSSW_13_0_5_patch1 is ready: https://github.com/cms-sw/cmssw/releases/tag/CMSSW_13_0_5_patch1 |
@perrotta , haven't you already created CMSSW_13_0_3_patchHLT based on CMSSW_13_0_3 ? I think that would work for building the patch release |
Hi, and sorry if I keep coming back to this, but things really, really, look skewed in favour of the offline operations - while IMNSHO they should be skewed towards the online operations, if any. Tier-0 needed some fixes and a patch release, so as @perrotta pointed out
And, soon after:
HLT also needs a different patch release, potentially on a shorter time scale due to the (small) data loss and (not-so-small) debugging effort incurred by the HLT crashes. Can we take this opportunity to figure out a more sustainable way to build this kind of patch releases, that does not require the involvement of @smuzaffar every time ? It would be nice if the release managers had a way of doing it semi-automatically (at least a way to create the necessary branch, and even better if there was a way to cherry pick the fixes from the main release branch). It would also be great if the HLT L2s were allowed to do it, especially once offline moves to a newer release cycle, and 13.0.x is only used online... think of it as a way to bring in potential future release managers :-) |
Dear all, as a heads up, the progressing of the replay for testing release 13_0_5 is showing a few paused jobs that are crashing due to high memory consumption, related to issue #41457 . It looks like it will not be possible to deploy a new release at Tier0 before this issue is fixed. There will be a daily JointOps Meeting tomorrow at 10h30 CERN time where we can discuss how to proceed. |
CMSSW_13_0_3_patch1 was built, and HLT will soon deploy it online. Thanks again to @cms-sw/orp-l2 and @smuzaffar for their availability to build this patch release on short notice. The HLT fix discussed here in now available in CMSSW_13_0_3_patch1 and CMSSW_13_0_5_patch1. Technically, this issue is solved, but #39680 (comment) and #41475 (comment) have to be addressed. Please let me know if this should be moved to a dedicated issue (and we close this one). |
Thank you again @cms-sw/orp-l2 @smuzaffar for the prompt action in building |
Release/patch release build procedure is simple, for releases which do not follow the official release branch (e.g. CMSSW_13_0_X, CMSSW_13_1_X etc.) one needs to
These parameters are needed as bot does not know which Although bot can build these releases but we should avoid building such releases. It is really hard to maintain these special branch also it is not trivial to setup IBs for such branches (displaying 13_0_X IBs and 13_0_3+changes IBs on one IB dashboard page). It need some non-trivial changes on cms-bot and IB dashboard side. |
@smuzaffar I understand that it is extra work to handle these special requests - but IMHO the goal should be to make handling them as simple as standard release, not to limit their use. Some suggestions:
I'm sure we can eventually automate more in this process, but already this would be a good start in simplifying this workflow. |
@fwyzard , the branch needs to be created before starting the release process. This is needed for PR tests too. Release managers can easily create the branch by going to github web interface, select the correct tag and the type in the new branch name
I agree, most of these bot should be able to get from the release name. It still needs at least |
I agree, most of these bot should be able to get from the release name. It still needs at least CMSSW_COMMIT: CMSSW_X_Y_Z_patch_Branch though
Follow the usual convention (or any convention for that matter) of CMSSW_X_Y_Z_patchX allows this to also be trivially deduced, no?
… |
Ok, I realize its not trivial – one has to check for the existance of such a branch and then fall back to the CMSSW_X_Y_X branch if its not found – perhaps still not complex?
|
yes @davidlange6 it is not trivial e.g. one can request to build a patch release using another patch release as base like
That is why I think it is better for release managers to explicitly tell bot from where to start. I think Note that this is only true for patch releases as bot can then use cmsdist/pkgtools of the full release. If we need to build a full release based on an other release + some updated external (e.g some data file update) then we have to setup proper IB and release need to be called differently e.g. CMSSW_13_0_HLT_X IBs to build CMSSW_13_0_3_HLTn release |
May be for such releases it is better to just revert the change (PR2) and add the new fix (PR3) using the same |
I would suggest that the only workflows to support are
Build CMSSW_13_0_3_patch2 using CMSSW_13_0_3_patchX (in case the branch exists)
Build CMSSW_13_0_3_patch2 using CMSSW_13_0_X
All others are unneeded and just add complexity in my opinion.
/issues/41475/1544437784
|
I agree. I would also suggest to use We might also need different branches (e.g. |
Can this issue be closed now? |
Yes |
@cms-sw/orp-l2 For the sake of (my own) clarity, what is then the recommended procedure for a subsystem to request a patch release ? Opening a CMSSW issue specifying base release and PRs to be added, or else ? Is the subsystem also in charge of making the corresponding PR if needed (e.g. like in #41479) ? Is the recommended procedure written down somewhere ? Do we also agree that by default the name should be |
Yes
Normally the release manager should be able to deal with it, but of course any help is welcome :-)
This is what we agreed... Of course, there can be special cases that must be dealt with specifically, for example:
|
Hi @perrotta , thanks for clarifying. Is the agreed procedure (going to be) written down somewhere (ORP twiki, or else) ? If not, I'm afraid these questions will come up over and over again. Thanks ! |
@cms-sw/orp-l2 just to close the loop on this item from the TSG side, was this ever addressed on the ORP side or shall we document it internally ? |
This issue is meant to clarify need and timeline of a
13_0_X
patch release including #41467.Right now, data-taking is disrupted by issues with the L1T menu (CMSLITOPS-411), which are causing hundreds of HLT crashes during collisions runs.
So far, two types of crashes have been identified (see this comment): those from
L1TRawToDigi
(L1T unpacker) andHLTRecHitInAllL1RegionsProducer<T>
(HLT module making use of L1T objects). Each type accounts for roughly 50% of the crashes.The root cause of the problem is under investigation. It may, or may not, be corrupted data sent from L1T to HLT (as one of the error messages from
L1TRawToDigi
suggests [1]).While there is no solution in place for the crashes in
L1TRawToDigi
, guardHLTRecHitInAllL1RegionsProducer<T>
against empty collection of L1T candidates [13_0_X
] #41467 can fix the hundreds of crashes due toHLTRecHitInAllL1RegionsProducer<T>
.The conclusion is that HLT needs a patch release with #41467. There are at least two options.
HLTRecHitInAllL1RegionsProducer<T>
against empty collection of L1T candidates [13_0_X
] #41467 on top ofCMSSW_13_0_3
(the latter is what HLT currently uses online).CMSSW_13_0_X
(would be13_0_5_patch1
).Doing only 2) implies that HLT could switch to
13_0_5_patch1
only after T0 switches to13_0_5
, and I do not know the timeline of that. @malbouis (ORM) could clarify that.@silviodonato @goys @fwyzard @trtomei @mzarucki @cms-sw/hlt-l2, do you think we should consider 1) ?
Is 1) acceptable to @cms-sw/orp-l2 ?
[1]
The text was updated successfully, but these errors were encountered: