-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow PromptReco_Run381067_JetMET1 error in CMSSW_14_0_7 #45089
Comments
cms-bot internal usage |
A new Issue was created by @mpresill. @makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
I looked at the log of the failure whose exception was in the issue description. The exception is reported after the input file was closed, and the job ultimately dies in a segfault with the following stack trace
Symptoms are the same as in #40132 (comment) . Was any of the jobs re-tried in Tier0? |
FYI @pcanal |
That is strange. It reports problem with the |
It was not retried yet. Should this be re-tried? |
Probably not worth it. I tested the job locally (with a local input file), and it fails in the same way. |
I noticed many printouts from |
FYI, there is new failed job for the same run and workflow PromptReco_Run381067_JetMET0 with same error.
|
Following #40132 (comment) I tested 14_0_7 with the backport of that (#40132 (comment)), but the behavior was the same
|
Please note that root-project/root#14627 should only solve the bug [1] from #40132 (comment): (request to expand to a negative size). The exception "TBufferFile::WriteByteCount bytecount too large (more than 1073741822)", ie bug [2] in the other issue, is not a bug, but rather a wanted exception, which tells you that your TFile contains a key that exceeds the maximum allowed size of 1 GB (root-project/root#6734) Workarounds would be to address root-project/root#6734, or to make your object a bit smaller, or to reroot the exception with a custom error handler so that it does not throw a fatal error. |
type root |
What exactly would the ROOT state be then at the point where it issues the error message |
I guess it would just skip saving that too-big-object into the TFile, and continue with the rest of objects. But it's just a guess, best thing would be try it out with a simple reproducer. |
FWIW, here is a stack trace to the where the exception gets thrown
How could we get more context on what is causing the "too big object"? |
There seems to be an object or class that is stored in your TTree, whose Streamer is too big, at some point when AutoSave is called. For example, if I do this:
I will get a similar crash as yours, though not exactly through the same path. If you'd try to store this histogram in the TTree directly and call AutoSave, you might get closer to what you are seeing. So maybe, you would need to do a |
One detail to be noted about the output are sizes of the output files at the time the job died
57 GB of |
Inspecting
|
The |
I tried to call
This is as far as I got (until July). I have the input file and the job configuration on |
The stack trace seems to imply that it is the |
Is there a way to find out the number of baskets? Or would there be some other way to confirm (or disprove) this case? |
There is now a second Tier0 job with similar symptoms.
see link for more details. |
You can find the tarball of this job here:
|
Maybe time to remind about this question. |
tagging also @cms-sw/ppd-l2 and @youyingli |
lots of skims are raw-reco outputs. This is one of them.. some other skims are quite a bit larger than this one on average in 2024. |
In case of Run 381067 Lumi 335 this skim selected ~60 % of the events of the |
So far in 2024, its 120TB compared to about 900 TB of raw data in JetMET1 and JetMET0 |
381067/335 appears to be the end of a set of rather anomalous set of lumi sections |
FYI, we failed the jobs reported by @mmusich in according with T0 |
Hi @makortel , to quickly answer your question in #45089 (comment)
So I would naively conclude that the 3.6M per event that you mention above is acceptable for that skim, or at least in accordance to what we've seen in the past (assuming I didn't make any mistakes). :-) I don't know if it was mentioned before, but just to include it here that this EXOHighMET skim was first introduced in 2022, this is the original PR: #37749 |
Thanks @malbouis. My next question is then if the rate (or acceptance fraction, ~60 %) of the skim is along expectation. Although @davidlange6 already wrote in #45089 (comment)
that I understand as there was something anomalous in the data that resulted in that high acceptance fraction. If this is the case, I would not invest effort in trying to to make writing of this large files to succeed. In this case, perhaps some protections would make sense so that the job wouldn't fail for the entire lumi? Would it be feasible to understand these conditions from the physics side and improve the filtering of this skim? I'm not sure if we could do something reliably at the framework level. |
tagging @afrankenthal as original author of #37749 |
assign pdmv |
New categories assigned: pdmv @AdrianoDee,@sunilUIET,@miquork,@kskovpen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Hi, I also looked at Run 2 EXOHighMET as /MET/Run2018D-HighMET-PromptReco-v2/RAW-RECO with 29.4TB and 7906297 events and the value is approximately 3.72 MB/event. So the 3.6 MB for that file should not be an issue. I'm not sure why many events are accumulated into a single file with a size of 50+ GB without any splitting. For DDT, we will contact EXO PAG and check if this skim is still needed or if they can add additional filters in the trigger part or more stringent selections in this skim. |
Hello, as @youyingli said I also don't understand the technical details of why the events are not getting properly split. This skim is a common EXO skim serving multiple analyses, so I'd guess it's very much still needed. Maybe there are filters we can implement without losing too much information, though. Needs to be discussed in EXO. |
We had 3 more paused jobs over the weekend. The tarballs can be found here:
We also copied the RAW input files here:
|
Dear all,
As reported in
cms talk
we have a paused job for the workflow
PromptReco_Run381067_JetMET1
in Run 381067, with the following error:/afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FatalRootError/ByteCountTooLarge/vocms013.cern.ch-3587397-3-log.tar.gz
/afs/cern.ch/user/m/mpresill/public/ORM_May29/CMSSW_14_0_7/src/job/WMTaskSpace/cmsRun1/cmsRun1-stdout_original.log
Matteo (ORM)
The text was updated successfully, but these errors were encountered: