Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fatal Root Error opening CMSSW_14 files with CMSSW_12_4_9_patch1 #46634

Closed
belforte opened this issue Nov 8, 2024 · 14 comments
Closed

Fatal Root Error opening CMSSW_14 files with CMSSW_12_4_9_patch1 #46634

belforte opened this issue Nov 8, 2024 · 14 comments

Comments

@belforte
Copy link

belforte commented Nov 8, 2024

We got a storm of exit code 8020 from CRAB jobs using CMSSW_12_4_9_patch1 to read /ParkingDoubleMuonLowMass0/Run2024F-PromptReco-v1/MINIAOD produced with CMSSW_14

See an example in https://cmsweb.cern.ch:8443/scheddmon/0120/qinju/241106_091326:qinju_crab_0_Run2024Fv1_MINIAOD/job_out.33.0.txt

The error is

== CMSSW:       [c] Fatal Root Error: @SUB=TList::Clear
== CMSSW: A list is accessing an object (0x148a6d75a980) already deleted (list name = TList)

as in #43882 (thanks @AdrianoDee for pointing to that)

But looking in that issue, it says that fix has been backported to CMSSW_12_4_x .

Why did those jobs not exit with e.g. 8027 FormatIncompatibility ?

Exit with 8020 plus Fatal Root Error make those input files candidate for suspected file corruption and they would enter the new "automatic fix" pipeline. We'd rather not have whole good dataset go that way :-(

@belforte belforte changed the title Error opening CMSSW_14 files with CMSSW_12_4_9_patch1 Fatal Root Error opening CMSSW_14 files with CMSSW_12_4_9_patch1 Nov 8, 2024
@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2024

A new Issue was created by @belforte.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@vlimant
Copy link
Contributor

vlimant commented Nov 8, 2024

Are these #41348 #45888 (comment) relevant ?

@makortel
Copy link
Contributor

makortel commented Nov 8, 2024

But looking in that issue, it says that fix has been backported to CMSSW_12_4_x .

The backport was merged in 12_4_20, while the example job used CMSSW_12_4_9_patch1.

@makortel
Copy link
Contributor

makortel commented Nov 8, 2024

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2024

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

makortel commented Nov 8, 2024

Are these #41348 #45888 (comment) relevant ?

No, those are for the other direction, >= 13_0_X reading files produced with 12_4_X.

@makortel
Copy link
Contributor

makortel commented Nov 8, 2024

Exit with 8020 plus Fatal Root Error make those input files candidate for suspected file corruption and they would enter the new "automatic fix" pipeline. We'd rather not have whole good dataset go that way :-(

I don't think there is much we can do. From the software point of view the file is incompatible with the particular version of software, and it just happens in this specific case the software is can be fixed (by moving to a recent 12_4_X).

Well, one possible (but not necessarily good) way would be to disallow the use of 12_4_X with X < 20 in CRAB (and similarly to other release cycles where the backport was included).

@makortel
Copy link
Contributor

makortel commented Nov 8, 2024

(and just for the record, the exception category is FileOpenError, and the Fatal Root Error is reported only as a part of the exception context)

@belforte
Copy link
Author

belforte commented Nov 8, 2024

So I misunderstood what 12_4_x meant. OK.
Well... if it can't fixed, it stays broken. If we now accept that CMSSW can not always detect incompatible data/code, I will not mind adding a check inside CRAB, but need a compatibility matrix from you. Personally I am all for "always use the latest Z in CMSSW_X_Y_Z", that would make such check easy to do, but somehow this message needs to be adopted by Physics Coordination.

And yes, I am perfectly aware that exception category is FileOpenError, and the Fatal Root Error is reported only as a part of the exception context ! I have code which parses the excption context to help figuring out corrupted files from file not found or similar.

Was it naive to assume "Fatal Root Error = corrupted file" ? Shall I make an exclusion for already deleted (list name = TList) ?

@haozturk what do you think ? I fear false positive more than false negatives here.

Stefano

@makortel
Copy link
Contributor

makortel commented Nov 8, 2024

And yes, I am perfectly aware that exception category is FileOpenError, and the Fatal Root Error is reported only as a part of the exception context ! I have code which parses the excption context to help figuring out corrupted files from file not found or similar.

I wrote that comment only for clarification for a future framework-minded reader, who might wonder the technical details of FileOpenError and Fatal Root Error here, and the link to the job log doesn't work anymore.

Was it naive to assume "Fatal Root Error = corrupted file" ?

I assume you mean something along "Fatal Root Error" in the context of FileOpenError (8020) or FileReadError (8021) exceptions (and not FatalRootError (8022) exception itself).

I'd guess the "Fatal Root Error" during file open or file read is still more often caused by data corruption than by something else. However, we can't reliably tell whether a given failure is caused by data corruption or a problem in the code. In case of decompression errors, our experience tells the cause has a high probability to be data corruption. In case of other errors, who knows.

In this particular case, the symptoms matched to a known "problem in the code" case, and we can with relatively high confidence say that is the cause. But technically exactly the same symptoms could be caused also by data corruption. It just seems much less probable cause.

So in a way we are building a "knowledge base" of the likely causes of various errors (which has some similarities to what we do in InitRootHandlers for ROOT messages that we know, from experience, must or must not lead to application termination).

Shall I make an exclusion for already deleted (list name = TList) ?

To be practical, I think that would be a reasonable thing to do. I can't exclude that some day we have would have a case where this particular problem would be caused by corrupted data, but I hope the probability for such a case would be tiny.

@belforte
Copy link
Author

belforte commented Nov 8, 2024

Yeah. I'll build the knowledge base thinghy. :-)

@belforte belforte closed this as completed Nov 8, 2024
@makortel
Copy link
Contributor

makortel commented Nov 8, 2024

+core

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2024

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants