Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DQM] Prompt Matrix reconfiguration #4619

Closed
wants to merge 13 commits into from
Closed

Conversation

jfernan2
Copy link

@jfernan2 jfernan2 commented Oct 11, 2021

Changes for Prompt Matrix DQM reduction:

Reduced:

  • L1TMon to SingleMuon, SingleElectron/EG and ZB
  • TAU to SingleMuon, SingleElectron/EG and TAU
  • TRK to SingleMuon, SingleElectron/EG, DoubleMuon, JetHT, JetMET and ZB, by creating commonReduced which excludes TRK
  • CTPPS to DoubleMuon, SingleElectron/EG y ZB

Replay Request

Requestor
PPD-DQM

Describe the configuration

Purpose of the test

Not sure if a replay is needed. PPD started a campaign to reduce the DQM load on Prompt matrix. This PR reduces the number of sequences as agreed on:

https://docs.google.com/presentation/d/1AF65xzq7T70Yt-0o_OV47YTFBv-kbvHZDlRmUSCZsAw/edit#slide=id.geedc047305_0_0

T0 Operations HyperNews thread
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2314.html

Thanks

Changes for Prompt Matrix DQM reduction
@cmsdmwmbot
Copy link

Can one of the admins verify this patch?

@francescobrivio
Copy link
Contributor

There are two options here:

  1. Replay recent cosmics runs as in Replay testing CMSSW_12_1_0_pre4 #4616:

    • Runs: 343082,344063
    • GTs:
      • expressGlobalTag: 120X_dataRun3_Express_v2
      • promptrecoGlobalTag: 120X_dataRun3_Prompt_v2
      • alcap0GlobalTag: 120X_dataRun3_Prompt_v2
  2. On the other I believe most of these sequences are not really exercised with cosmics so we could test it with 2018 pp collisions (same as in AlCaDB test of PCL workflows #4602)

    • Runs: 317696
    • GTs:
      • expressGlobalTag: 120X_dataRun3_Express_Candidate_2021_09_30_18_52_55
      • promptrecoGlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33
      • alcap0GlobalTag: 120X_dataRun3_Prompt_Candidate_2021_09_30_19_06_33

I let @germanfgv and other Tier0 experts comment further.

@germanfgv
Copy link
Contributor

I agree that a 2018 pp collisions test would be better. We cannot start such test yet, but we can do it later during the week. In any case, it seems like RetVal shows some issues with cms-sw/cmssw#35605, so there is no point in running a replay.

Also, please first propose this kind of changes in T0 Hypernews so everyone is aware.

@jfernan2 jfernan2 changed the title Update ReplayOfflineConfiguration.py [DQM] Prompt Matrix reconfiguration Oct 11, 2021
@jfernan2
Copy link
Author

cms-sw/cmssw#35605 has been fixed, it was an unrelated proble for Run1 datasets.

Hypernews announcement: https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2278.html

@germanfgv
Copy link
Contributor

@jfernan2 I'm a bit confused. If we need cms-sw/cmssw#35605 to properly test this configuration, then that PR should be already merge into a release. We can only test CMSSW code that's available in /cvmfs/.

If that's not the case, then what CMSSW release would you like us to use?

@jfernan2
Copy link
Author

OK @germanfgv sorry, I thought some PR could be added on top. So, @qliphy please, let's integrate cms-sw/cmssw#35605 to allow Tier0 for the test.
Thanks!

@tvami
Copy link
Contributor

tvami commented Oct 23, 2021

Hi @jfernan2 this is not relevant for the pilot beams right?
Since 12_1_0 is about to come out, I'd suggest to wait for that then, and run the replay even with the 121X GTs, maybe even on the pilot beam data?

@jfernan2
Copy link
Author

Right, this is not relevant. It is just for computing resources saving purpouses. It can wait.
Thanks

@jfernan2
Copy link
Author

jfernan2 commented Nov 3, 2021

@germanfgv PR #35605 was merged in CMSSW_12_1_0_pre5
https://github.com/cms-sw/cmssw/releases/tag/CMSSW_12_1_0_pre5
So, if you could perform the replay at some point it is apreciated
Thanks

@tvami
Copy link
Contributor

tvami commented Nov 3, 2021

Hi @jfernan2 I think we wanted to wait for CMSSW_12_1_0 to come out, i.e. tomorrow

@qliphy
Copy link
Contributor

qliphy commented Nov 5, 2021

@tvami @jfernan2 @germanfgv CMSSW_12_1_0 is now ready.

@germanfgv
Copy link
Contributor

@jfernan2 @tvami I updated the CMSSW version. Can you check GTs and runs before trigerring the replay?

@francescobrivio
Copy link
Contributor

francescobrivio commented Nov 5, 2021

@germanfgv I think the best way to test the cpu usage reduction with the new DQM Matrix is to use a 2018 pp run, as specified in: #4619 (comment)
So you should update in the configuration:

  • run number
  • GTs
  • different scenarios

@germanfgv
Copy link
Contributor

@francescobrivio I made the necessary changes. I'll start the replay.

@francescobrivio
Copy link
Contributor

looks good! thanks @germanfgv !

@germanfgv
Copy link
Contributor

run replay please

@cmsdmwmbot
Copy link

There are 16 repack workflows.
There are 4 express workflows.
There are 1016 filesets not closed.
There are 769 paused jobs in the replay.

@cmsdmwmbot
Copy link

There are 16 repack workflows.
There are 5 express workflows.
There are 1297 filesets not closed.
There are 3599 paused jobs in the replay.

@cmsdmwmbot
Copy link

There are 12 repack workflows.
There are 5 express workflows.
There are 1314 filesets not closed.
There are 4380 paused jobs in the replay.

@germanfgv
Copy link
Contributor

@francescobrivio it looks like there is an issue with the GTs. We are getting the following error, for both Express and Prompt:

An exception of category 'NoRecord' occurred while
   [0] Processing global begin Run run: 317696
   [1] Prefetching for module TotemTimingDQMSource/'totemTimingDQMSource'
   [2] Prefetching for EventSetup module CTPPSGeometryESModule/'ctppsGeometryESModule'
   [3] Calling method for EventSetup module CTPPSGeometryESModule/'ctppsGeometryESModule'
   [4] While getting dependent Record from Record VeryForwardRealGeometryRecord
Exception Message:
No "VeryForwardIdealGeometryRecord" record found in the EventSetup.

 The Record is delivered by an ESSource or ESProducer but there is no valid IOV for the synchronization value.
 Please check 
   a) if the synchronization value is reasonable and report to the hypernews if it is not.
   b) else check that all ESSources have been properly configured.

Should we be using different GTs?

@cmsdmwmbot
Copy link

There are 8 repack workflows.
There are 5 express workflows.
There are 728 filesets not closed.
There are 1 paused jobs in the replay.

@tvami
Copy link
Contributor

tvami commented Nov 6, 2021

Hi @germanfgv
yes, this VeryForwardIdealGeometryRecord was added recently.
Let's use these GTs

  • Express: 121X_dataRun3_Express_v11
  • Prompt: 121X_dataRun3_Prompt_v10

For 2018

  • Express: 121X_dataRun3_Express_Candidate_2021_11_06_15_02_30
  • Prompt: 121X_dataRun3_Prompt_Candidate_2021_11_06_15_02_45

GTs to fix issue with VeryForwardIdealGeometryRecord
@germanfgv
Copy link
Contributor

@jfernan2 I'm not sure what you are referring to when you say that the memory issues have been clarified. Did I miss something? could you point me to the discussion?

In any case, if there is a fix to the issue we can test to see if it works. Is it available in one of the pre-releases?

@jfernan2
Copy link
Author

I am sorry @germanfgv , I thought you were following the Tier0 thread that you initiated:
https://hypernews.cern.ch/HyperNews/CMS/get/tier0-Ops/2314/3/1/1/1/1/1/1/1/1/1/1/1/1/1/1/1.html

@germanfgv
Copy link
Contributor

@jfernan2 I think I missed those last few messages. For what I see, there is no changes that need to be tested, just the increase in T0 memory limit (We did discuss this in our T0 internal meeting). In any case, I'd like to confirm that 2GB/core is enough to run the jobs. For that, I'll redeploy the replay.

@germanfgv
Copy link
Contributor

run replay please

@jfernan2
Copy link
Author

Thanks @germanfgv

@cmsdmwmbot
Copy link

Replay testing PR '[DQM] Prompt Matrix reconfiguration'
An automatic replay has been requested by jfernan2.
Here is a brief description of the replay.
Github PR : #4619
PR author : jfernan2
Requestor : PPD-DQM
Injected runs : 317696
CMSSW release : CMSSW_12_1_0
Tier0 release : 3.0.1
ppScenario : ppEra_Run2_2018
Tier0 Config : https://cmst0.web.cern.ch/CMST0/tier0/offline_config/ReplayOfflineConfiguration_047.php
Contatiner ID : 1
Jenkins Build : https://cmssdt.cern.ch/dmwm-jenkins/job/DMWM-T0-PR-test-job/366/
Jira Issue : https://its.cern.ch/jira/browse/CMSTZDEV-702

@cmsdmwmbot
Copy link

There are 17 repack workflows.
There are 5 express workflows.
There are 1128 filesets not closed.
There are 1 paused jobs in the replay.

@germanfgv
Copy link
Contributor

After this later retry, we still have one MET promt Reco jobs exceeding the 2GB/core memory limit. @jfernan2 for what I understand, this may improve in the next 12_1_X release, is that right?

@jfernan2
Copy link
Author

@germanfgv From the RECO side I do not know, from the DQM side there are not improvements expected in memory: this PR is already reducing the number of modules run, hence the memory, w.r.t. 12_0_X

@slava77
Copy link

slava77 commented Nov 29, 2021

After this later retry, we still have one MET promt Reco jobs exceeding the 2GB/core memory limit. @jfernan2 for what I understand, this may improve in the next 12_1_X release, is that right?

is the log for the job available?
(I've forgotten how to find the replay details on the web with wmstats)

@jfernan2
Copy link
Author

@slava77 would a backport of this PR reduce the memory?
cms-sw/cmssw#36246
Thanks

@slava77
Copy link

slava77 commented Nov 29, 2021

@slava77 would a backport of this PR reduce the memory?
cms-sw/cmssw#36246

I think so

@germanfgv
Copy link
Contributor

After this later retry, we still have one MET promt Reco jobs exceeding the 2GB/core memory limit. @jfernan2 for what I understand, this may improve in the next 12_1_X release, is that right?

is the log for the job available? (I've forgotten how to find the replay details on the web with wmstats)

@slava77 You can find the jobs tarball here:
C1_vocms047.cern.ch-23926-0-log.tar.gz

I'll put it in AFS in a few minutes.

@germanfgv
Copy link
Contributor

germanfgv commented Nov 29, 2021

@slava77 In case you prefer:
/afs/cern.ch/user/c/cmst0/public/PausedJobs/DQMseq/job_23926/tarball

@slava77
Copy link

slava77 commented Nov 29, 2021

@slava77 In case you prefer: /afs/cern.ch/user/c/cmst0/public/PausedJobs/DQMseq/job_23926/tarball

Thanks.
I was mainly looking for

Job has exceeded maxPSS: 16000 MB
Job has PSS: 16823 MB

@jfernan2
Copy link
Author

jfernan2 commented Dec 6, 2021

Hi @germanfgv
I have switched to CMSSW_12_1_1 now that the memory issues have been reduced there

@germanfgv
Copy link
Contributor

@jfernan2 great! I'll manually run the test to see if it's enough.

@germanfgv
Copy link
Contributor

@jfernan2 The replay finished without problems. The 16GB of memory were enough to reconstruct the MET dataset.

@Jetmet
Copy link

Jetmet commented Dec 8, 2021 via email

@jfernan2
Copy link
Author

Hi @germanfgv
What are the next steps?
Thanks

@germanfgv
Copy link
Contributor

Hi @jfernan2, i'll make a clean PR adding these changes to the Replay config and the Production config.

jhonatanamado added a commit to jhonatanamado/T0 that referenced this pull request Jan 19, 2022
Following what was discussed here with a new configuration of PromptMatrix
dmwm#4619 (comment)
jhonatanamado added a commit to jhonatanamado/T0 that referenced this pull request Jan 20, 2022
Following what was discussed here with a new configuration of PromptMatrix
dmwm#4619 (comment)
@tvami
Copy link
Contributor

tvami commented Apr 21, 2022

Hi @jfernan2 my understanding is that this was added to the prod config, so I believe this PR can be closed, do you agree?

@Jetmet
Copy link

Jetmet commented Apr 21, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants