Memory usage in AlCaLumiPixelsCounts jobs for run 382300 #45306

davidlange6 · 2024-06-25T17:41:22Z

Tier-0 reports several jobs with high memory usage in run 382300. One example that reproduces is

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024F/AlCaHarvest/job_863561/02ce5b03-cdf6-4215-95c4-e4b3ef3ed8c1-0-1-logArchive.tar.gz

which goes to 3+ GB of RSS very quickly (eg, the start of event processing) and peaks around 6 GB.

This is writing 3 output files with (iiuc) a total of about 200 MB per lumi section and no event data

cmsbuild · 2024-06-25T17:41:42Z

cms-bot internal usage

cmsbuild · 2024-06-25T17:41:42Z

A new Issue was created by @davidlange6.

@antoniovilela, @Dr15Jones, @sextonkennedy, @smuzaffar, @makortel, @rappoccio can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones · 2024-06-25T17:47:12Z

assign alca

cmsbuild · 2024-06-25T17:47:32Z

New categories assigned: alca

@saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks

davidlange6 · 2024-06-26T10:57:37Z

i'm feeling confused - is this application doing more than copying out parts of the lumiblock information into a new edm file? [eg, removing the TriggerResults event products and some of the lumiblock products]?

Eg, the outputs appear to share common lumiproducts and are basically the same size as the input.. For example output file copies out

*Br    7 :recoPixelClusterCounts_alcaPCCIntegratorZeroBias_alcaPCCZeroBias_RECO.obj : *
*         | reco::PixelClusterCounts                                         *
*Entries :        3 : Total  Size= 1894371109 bytes  File Size  =  296267534 *
*Baskets :        3 : Basket Size=    4693387 bytes  Compression=   6.39     *
*............................................................................*
*Br    8 :recoPixelClusterCounts_alcaPCCIntegratorZeroBias_alcaPCCZeroBias_RECO.present : *
*         | Bool_t                                                           *
*Entries :        3 : Total  Size=       1248 bytes  File Size  =        471 *
*Baskets :        3 : Basket Size=       9386 bytes  Compression=   1.00     *
*............................................................................*

Dr15Jones · 2024-06-26T13:38:05Z

@davidlange6 Just to reiterate what @davidlange6 found, when we read back the LuminosityBlock, the reco::PixelClusterCounts object stored in the lumi requires on average 1.9GB/ 3 (averaging the total in memory size reported by ROOT by the 3 lumis in the file) so > 600MB. At the file boundary, the framework doesn't know if the new file being read contains more of the same LuminosityBlock from the last file we read so the framework has all the LuminosityBlock products from the last file in memory at the same time as it is reading the data products from the LuminosityBlock from the new file. So it needs ~ 1.2 GB or so for this.

If the reco::PixelClusterCounts for the different LuminosityBlocks are not roughly the same size (say one is 2x bigger than the others) then the memory requirements can get even worse.

It seems like reco::PixelClusterCounts is holder data PER EVENT which scales poorly as the number of events in a LuminosityBlock increases.

davidlange6 · 2024-06-26T14:09:27Z

@Dr15Jones - i do not think there is any per event data there. PixelClusterCounts is effectively holding two 2d histograms (hits per bx per roc/module) and and a 1D histogram (of events per bx).

duff-ae · 2024-06-26T14:36:49Z

@Dr15Jones @davidlange6 Dear all, BRIL RC is here. David is correct, we don't store per-event data, because for luminosity we are interested only in "effective" rates for every bx which we can later rescale to the luminosity. The biggest change compared to the previous release is the per roc data which increased the event and LS size. The data is extremely useful for precision luminosity measurement. We can try to remove some modules or update the thresholds to decrease the event size. But it would be helpful if you could provide us with some realistic "target" Tier-0 could tolerate.

davidlange6 · 2024-06-26T14:40:49Z

Ok, so we've understood what is new and creating problems.

Why is it useful to split the data from the input file into three pieces (eg, a data per lumi product)? Or do I miss some other functionality happening in this process?

Dr15Jones · 2024-06-26T14:53:36Z

I made a trivial 'auditing' analyzer or PixelClusterCounting and had it dump information each Lumi. For the files in question, the dumps were relatively consistent with values like

%MSG-s PixelClusterCountsAudit:  PixelClusterCountsAuditor:audit@beginLumi  26-Jun-2024 16:47:23 CEST Run: 382300 Lumi: 17
Branch: recoPixelClusterCounts_alcaPCCIntegratorRandom_alcaPCCRandom_RECO.
 readCounts: 6400944
 readRocCounts: 151398720
 readEvents: 3564
 readModID: 1796
 readRocID: 42480
%MSG
%MSG-s PixelClusterCountsAudit:  PixelClusterCountsAuditor:audit@beginLumi  26-Jun-2024 16:47:23 CEST Run: 382300 Lumi: 17
Branch: recoPixelClusterCounts_alcaPCCIntegratorZeroBias_alcaPCCZeroBias_RECO.
 readCounts: 6400944
 readRocCounts: 151423668
 readEvents: 3564
 readModID: 1796
 readRocID: 42487
%MSG

give the values are int which are 4 bytes in size, that is ~600MB for each readRocCounts.

duff-ae · 2024-06-26T15:30:36Z

@davidlange6 David, maybe I am missing something, what is the third file? I thought there were 2 files: for Zero-bias and Random data. I don't understand why they are the same.

davidlange6 · 2024-06-26T15:33:39Z

Maybe the third thing Is different, I did not check. I mean

process.ALCARECOStreamAlCaPCCRandomOutPath,
process.ALCARECOStreamAlCaPCCZeroBiasOutPath,
process.ALCARECOStreamRawPCCProducerOutPath

Ah - the output of ALCARECOStreamRawPCCProducerOutPath is indeed small (2% of the others)

duff-ae · 2024-06-27T09:13:16Z

@davidlange6 We have identified a few possible solutions to reduce the number of entries and will try to implement them as soon as possible. However, I have two questions:

What should be the target rate reduction factor to safely operate Tier0?
How much time do we realistically have to implement this fix?

I understand the urgency of finding a solution, but we want to avoid making any physically unmotivated cuts. Sorry for any inconveniences caused.

davidlange6 · 2024-06-27T09:33:38Z

what difference would a rate change make? These objects are presumably roughly the same size regardless of having 0.1Hz or 2000 Hz, no?

As I asked above, do we need the processing step at all? (maybe something to discuss with all groups on Monday's joint ops meeting)

duff-ae · 2024-06-27T09:55:52Z

apologies for the confusion, I didn't mean trigger rates. I meant we could mask, for instance, some of the BPix inner-most layers which might be less useful for us, and it can already decrease object size multiply. Or adjust the threshold to cut some potentially noisy pixels, and so on. But it would be really helpful if you could have some estimations on the reduction factor for the object (2 times? 10?)

davidlange6 · 2024-06-27T10:04:58Z

Not so much for me to answer - but nominally this workflow should run in 2GB and currently takes ~6.

Dr15Jones · 2024-06-27T12:03:45Z

Personally I’d say this data product should take less than 100MB (that would be 25M entries in the vector) and preferably closer to 10MB.

duff-ae · 2024-06-30T16:19:36Z

I've prepared a fix that should reduce readRocCounts by 3564 (effectively removing per bx granularity). readCounts will remain unchanged. PR: #45348

germanfgv · 2024-07-02T09:14:07Z

The unmerged files that were the input of the original job will be removed by the usual Tier0 workflow. I copied them to this location so they can be used for testing the fix:

/eos/user/c/cmst0/public/PausedJobs/Run2024F/AlCaHarvest/input

duff-ae · 2024-07-09T11:17:04Z

Dear all, the patch went to the CMSSW_14_0_11 release. Once it is tested at T0, please let us know if it resolves the issue.

makortel · 2024-08-07T13:30:22Z

@cms-sw/alca-l2 Since #45348 and #45369 have been merged (long ago), I guess we could close this issue?

srimanob · 2024-09-11T20:55:07Z

Kindly ping @cms-sw/alca-l2 to sign and close the issue. Thanks.

perrotta · 2024-09-12T05:17:59Z

+alca

cmsbuild · 2024-09-12T05:18:20Z

This issue is fully signed and ready to be closed.

makortel · 2024-09-12T13:20:04Z

@cmsbuild, please close

cmsbuild added the pending-assignment label Jun 25, 2024

cmsbuild added alca-pending pending-signatures and removed pending-assignment labels Jun 25, 2024

mmusich mentioned this issue Jun 26, 2024

Include per ROC histograms for Pixel Cluster Counting luminosity - CMSSW_14_0_X #45054

Merged

davidlange6 mentioned this issue Jun 26, 2024

Including per ROC histograms for Pixel Cluster Counting luminosity #44996

Merged

duff-ae mentioned this issue Jun 30, 2024

update to remove per bx roc data #45348

Merged

perrotta mentioned this issue Jul 3, 2024

Update to remove per bx roc data - 14_0_X #45369

Merged

cmsbuild added alca-approved fully-signed and removed alca-pending pending-signatures labels Sep 12, 2024

cmsbuild closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory usage in AlCaLumiPixelsCounts jobs for run 382300 #45306

Memory usage in AlCaLumiPixelsCounts jobs for run 382300 #45306

davidlange6 commented Jun 25, 2024

cmsbuild commented Jun 25, 2024 •

edited

Loading

cmsbuild commented Jun 25, 2024

Dr15Jones commented Jun 25, 2024

cmsbuild commented Jun 25, 2024

davidlange6 commented Jun 26, 2024

Dr15Jones commented Jun 26, 2024

davidlange6 commented Jun 26, 2024

duff-ae commented Jun 26, 2024

davidlange6 commented Jun 26, 2024

Dr15Jones commented Jun 26, 2024

duff-ae commented Jun 26, 2024

davidlange6 commented Jun 26, 2024

duff-ae commented Jun 27, 2024

davidlange6 commented Jun 27, 2024

duff-ae commented Jun 27, 2024

davidlange6 commented Jun 27, 2024

Dr15Jones commented Jun 27, 2024

duff-ae commented Jun 30, 2024

germanfgv commented Jul 2, 2024

duff-ae commented Jul 9, 2024

makortel commented Aug 7, 2024

srimanob commented Sep 11, 2024

perrotta commented Sep 12, 2024

cmsbuild commented Sep 12, 2024

makortel commented Sep 12, 2024

Memory usage in AlCaLumiPixelsCounts jobs for run 382300 #45306

Memory usage in AlCaLumiPixelsCounts jobs for run 382300 #45306

Comments

davidlange6 commented Jun 25, 2024

cmsbuild commented Jun 25, 2024 • edited Loading

cmsbuild commented Jun 25, 2024

Dr15Jones commented Jun 25, 2024

cmsbuild commented Jun 25, 2024

davidlange6 commented Jun 26, 2024

Dr15Jones commented Jun 26, 2024

davidlange6 commented Jun 26, 2024

duff-ae commented Jun 26, 2024

davidlange6 commented Jun 26, 2024

Dr15Jones commented Jun 26, 2024

duff-ae commented Jun 26, 2024

davidlange6 commented Jun 26, 2024

duff-ae commented Jun 27, 2024

davidlange6 commented Jun 27, 2024

duff-ae commented Jun 27, 2024

davidlange6 commented Jun 27, 2024

Dr15Jones commented Jun 27, 2024

duff-ae commented Jun 30, 2024

germanfgv commented Jul 2, 2024

duff-ae commented Jul 9, 2024

makortel commented Aug 7, 2024

srimanob commented Sep 11, 2024

perrotta commented Sep 12, 2024

cmsbuild commented Sep 12, 2024

makortel commented Sep 12, 2024

cmsbuild commented Jun 25, 2024 •

edited

Loading