-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal Root Error opening CMSSW_14 files with CMSSW_12_4_9_patch1 #46634
Comments
cms-bot internal usage |
A new Issue was created by @belforte. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Are these #41348 #45888 (comment) relevant ? |
The backport was merged in 12_4_20, while the example job used CMSSW_12_4_9_patch1. |
assign core |
New categories assigned: core @Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks |
No, those are for the other direction, >= 13_0_X reading files produced with 12_4_X. |
I don't think there is much we can do. From the software point of view the file is incompatible with the particular version of software, and it just happens in this specific case the software is can be fixed (by moving to a recent 12_4_X). Well, one possible (but not necessarily good) way would be to disallow the use of 12_4_X with X < 20 in CRAB (and similarly to other release cycles where the backport was included). |
(and just for the record, the exception category is |
So I misunderstood what 12_4_x meant. OK. And yes, I am perfectly aware that Was it naive to assume "Fatal Root Error = corrupted file" ? Shall I make an exclusion for @haozturk what do you think ? I fear false positive more than false negatives here. Stefano |
I wrote that comment only for clarification for a future framework-minded reader, who might wonder the technical details of
I assume you mean something along "Fatal Root Error" in the context of I'd guess the "Fatal Root Error" during file open or file read is still more often caused by data corruption than by something else. However, we can't reliably tell whether a given failure is caused by data corruption or a problem in the code. In case of decompression errors, our experience tells the cause has a high probability to be data corruption. In case of other errors, who knows. In this particular case, the symptoms matched to a known "problem in the code" case, and we can with relatively high confidence say that is the cause. But technically exactly the same symptoms could be caused also by data corruption. It just seems much less probable cause. So in a way we are building a "knowledge base" of the likely causes of various errors (which has some similarities to what we do in
To be practical, I think that would be a reasonable thing to do. I can't exclude that some day we have would have a case where this particular problem would be caused by corrupted data, but I hope the probability for such a case would be tiny. |
Yeah. I'll build the knowledge base thinghy. :-) |
+core |
This issue is fully signed and ready to be closed. |
We got a storm of exit code 8020 from CRAB jobs using CMSSW_12_4_9_patch1 to read
/ParkingDoubleMuonLowMass0/Run2024F-PromptReco-v1/MINIAOD
produced with CMSSW_14See an example in https://cmsweb.cern.ch:8443/scheddmon/0120/qinju/241106_091326:qinju_crab_0_Run2024Fv1_MINIAOD/job_out.33.0.txt
The error is
as in #43882 (thanks @AdrianoDee for pointing to that)
But looking in that issue, it says that fix has been backported to CMSSW_12_4_x .
Why did those jobs not exit with e.g. 8027 FormatIncompatibility ?
Exit with 8020 plus
Fatal Root Error
make those input files candidate for suspected file corruption and they would enter the new "automatic fix" pipeline. We'd rather not have whole good dataset go that way :-(The text was updated successfully, but these errors were encountered: