-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory usage in AlCaLumiPixelsCounts jobs for run 382300 #45306
Comments
cms-bot internal usage |
A new Issue was created by @davidlange6. @antoniovilela, @Dr15Jones, @sextonkennedy, @smuzaffar, @makortel, @rappoccio can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign alca |
New categories assigned: alca @saumyaphor4252,@perrotta,@consuegs you have been requested to review this Pull request/Issue and eventually sign? Thanks |
i'm feeling confused - is this application doing more than copying out parts of the lumiblock information into a new edm file? [eg, removing the TriggerResults event products and some of the lumiblock products]? Eg, the outputs appear to share common lumiproducts and are basically the same size as the input.. For example output file copies out
|
@davidlange6 Just to reiterate what @davidlange6 found, when we read back the LuminosityBlock, the If the It seems like |
@Dr15Jones - i do not think there is any per event data there. PixelClusterCounts is effectively holding two 2d histograms (hits per bx per roc/module) and and a 1D histogram (of events per bx). |
@Dr15Jones @davidlange6 Dear all, BRIL RC is here. David is correct, we don't store per-event data, because for luminosity we are interested only in "effective" rates for every bx which we can later rescale to the luminosity. The biggest change compared to the previous release is the per roc data which increased the event and LS size. The data is extremely useful for precision luminosity measurement. We can try to remove some modules or update the thresholds to decrease the event size. But it would be helpful if you could provide us with some realistic "target" Tier-0 could tolerate. |
Ok, so we've understood what is new and creating problems. Why is it useful to split the data from the input file into three pieces (eg, a data per lumi product)? Or do I miss some other functionality happening in this process? |
I made a trivial 'auditing' analyzer or PixelClusterCounting and had it dump information each Lumi. For the files in question, the dumps were relatively consistent with values like
give the values are |
@davidlange6 David, maybe I am missing something, what is the third file? I thought there were 2 files: for Zero-bias and Random data. I don't understand why they are the same. |
Maybe the third thing Is different, I did not check. I mean process.ALCARECOStreamAlCaPCCRandomOutPath, Ah - the output of ALCARECOStreamRawPCCProducerOutPath is indeed small (2% of the others) |
@davidlange6 We have identified a few possible solutions to reduce the number of entries and will try to implement them as soon as possible. However, I have two questions:
I understand the urgency of finding a solution, but we want to avoid making any physically unmotivated cuts. Sorry for any inconveniences caused. |
what difference would a rate change make? These objects are presumably roughly the same size regardless of having 0.1Hz or 2000 Hz, no? As I asked above, do we need the processing step at all? (maybe something to discuss with all groups on Monday's joint ops meeting) |
apologies for the confusion, I didn't mean trigger rates. I meant we could mask, for instance, some of the BPix inner-most layers which might be less useful for us, and it can already decrease object size multiply. Or adjust the threshold to cut some potentially noisy pixels, and so on. But it would be really helpful if you could have some estimations on the reduction factor for the object (2 times? 10?) |
Not so much for me to answer - but nominally this workflow should run in 2GB and currently takes ~6. |
Personally I’d say this data product should take less than 100MB (that would be 25M entries in the vector) and preferably closer to 10MB. |
I've prepared a fix that should reduce readRocCounts by 3564 (effectively removing per bx granularity). readCounts will remain unchanged. PR: #45348 |
The unmerged files that were the input of the original job will be removed by the usual Tier0 workflow. I copied them to this location so they can be used for testing the fix:
|
Dear all, the patch went to the CMSSW_14_0_11 release. Once it is tested at T0, please let us know if it resolves the issue. |
Kindly ping @cms-sw/alca-l2 to sign and close the issue. Thanks. |
+alca |
This issue is fully signed and ready to be closed. |
@cmsbuild, please close |
Tier-0 reports several jobs with high memory usage in run 382300. One example that reproduces is
/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024F/AlCaHarvest/job_863561/02ce5b03-cdf6-4215-95c4-e4b3ef3ed8c1-0-1-logArchive.tar.gz
which goes to 3+ GB of RSS very quickly (eg, the start of event processing) and peaks around 6 GB.
This is writing 3 output files with (iiuc) a total of about 200 MB per lumi section and no event data
The text was updated successfully, but these errors were encountered: