-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DQM] Prompt Matrix reconfiguration #4619
Conversation
Changes for Prompt Matrix DQM reduction
Can one of the admins verify this patch? |
There are two options here:
I let @germanfgv and other Tier0 experts comment further. |
I agree that a 2018 pp collisions test would be better. We cannot start such test yet, but we can do it later during the week. In any case, it seems like RetVal shows some issues with cms-sw/cmssw#35605, so there is no point in running a replay. Also, please first propose this kind of changes in T0 Hypernews so everyone is aware. |
cms-sw/cmssw#35605 has been fixed, it was an unrelated proble for Run1 datasets. Hypernews announcement: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2278.html |
@jfernan2 I'm a bit confused. If we need cms-sw/cmssw#35605 to properly test this configuration, then that PR should be already merge into a release. We can only test CMSSW code that's available in /cvmfs/. If that's not the case, then what CMSSW release would you like us to use? |
OK @germanfgv sorry, I thought some PR could be added on top. So, @qliphy please, let's integrate cms-sw/cmssw#35605 to allow Tier0 for the test. |
Hi @jfernan2 this is not relevant for the pilot beams right? |
Right, this is not relevant. It is just for computing resources saving purpouses. It can wait. |
@germanfgv PR #35605 was merged in CMSSW_12_1_0_pre5 |
Hi @jfernan2 I think we wanted to wait for |
@tvami @jfernan2 @germanfgv CMSSW_12_1_0 is now ready. |
@germanfgv I think the best way to test the cpu usage reduction with the new DQM Matrix is to use a 2018 pp run, as specified in: #4619 (comment)
|
@francescobrivio I made the necessary changes. I'll start the replay. |
looks good! thanks @germanfgv ! |
run replay please |
There are 16 repack workflows. |
There are 16 repack workflows. |
There are 12 repack workflows. |
@francescobrivio it looks like there is an issue with the GTs. We are getting the following error, for both Express and Prompt:
Should we be using different GTs? |
There are 8 repack workflows. |
Hi @germanfgv
For 2018
|
GTs to fix issue with VeryForwardIdealGeometryRecord
@jfernan2 I'm not sure what you are referring to when you say that the memory issues have been clarified. Did I miss something? could you point me to the discussion? In any case, if there is a fix to the issue we can test to see if it works. Is it available in one of the pre-releases? |
I am sorry @germanfgv , I thought you were following the Tier0 thread that you initiated: |
@jfernan2 I think I missed those last few messages. For what I see, there is no changes that need to be tested, just the increase in T0 memory limit (We did discuss this in our T0 internal meeting). In any case, I'd like to confirm that 2GB/core is enough to run the jobs. For that, I'll redeploy the replay. |
run replay please |
Thanks @germanfgv |
Replay testing PR '[DQM] Prompt Matrix reconfiguration' |
There are 17 repack workflows. |
After this later retry, we still have one MET promt Reco jobs exceeding the 2GB/core memory limit. @jfernan2 for what I understand, this may improve in the next 12_1_X release, is that right? |
@germanfgv From the RECO side I do not know, from the DQM side there are not improvements expected in memory: this PR is already reducing the number of modules run, hence the memory, w.r.t. 12_0_X |
is the log for the job available? |
@slava77 would a backport of this PR reduce the memory? |
I think so |
@slava77 You can find the jobs tarball here: I'll put it in AFS in a few minutes. |
@slava77 In case you prefer: |
Thanks.
|
Hi @germanfgv |
@jfernan2 great! I'll manually run the test to see if it's enough. |
@jfernan2 The replay finished without problems. The 16GB of memory were enough to reconstruct the MET dataset. |
这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
Hi @germanfgv |
Hi @jfernan2, i'll make a clean PR adding these changes to the Replay config and the Production config. |
Following what was discussed here with a new configuration of PromptMatrix dmwm#4619 (comment)
Following what was discussed here with a new configuration of PromptMatrix dmwm#4619 (comment)
Hi @jfernan2 my understanding is that this was added to the prod config, so I believe this PR can be closed, do you agree? |
这是来自QQ邮箱的假期自动回复邮件。您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
Changes for Prompt Matrix DQM reduction:
Reduced:
Replay Request
Requestor
PPD-DQM
Describe the configuration
Release: CMSSW_12_1_X and PR [DQM] Prompt matrix redefinition cms-sw/cmssw#35605
Run: 317696
GTs:
expressGlobalTag: 120X_dataRun3_Express_Candidate_2021_09_30_18_52_55
promptrecoGlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33
alcap0GlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33
Additional changes:
This PR goes in sync with: [DQM] Prompt matrix redefinition cms-sw/cmssw#35605
Purpose of the test
Not sure if a replay is needed. PPD started a campaign to reduce the DQM load on Prompt matrix. This PR reduces the number of sequences as agreed on:
https://docs.google.com/presentation/d/1AF65xzq7T70Yt-0o_OV47YTFBv-kbvHZDlRmUSCZsAw/edit#slide=id.geedc047305_0_0
T0 Operations HyperNews thread
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2314.html
Thanks