Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exit code 8028 not being reported anymore #42040

Closed
6 tasks done
makortel opened this issue Jun 21, 2023 · 16 comments
Closed
6 tasks done

Exit code 8028 not being reported anymore #42040

makortel opened this issue Jun 21, 2023 · 16 comments

Comments

@makortel
Copy link
Contributor

makortel commented Jun 21, 2023

It was noticed in https://cms-talk.web.cern.ch/t/not-possible-to-open-2022g-data-files-crab-or-manual-xroot/25738/8 that CMSSW_13_0_5_patch2 reported exit code 8020 (FileOpenError) after a fallback file open fails instead of the expected 8028 (FallbackFileOpenError). Digging in history revealed the #28911 had a mistake (that I overlooked in the review) by throwing FileOpenError also when a fallback file open fails.

#28911 was merged in 11_1_0_pre8, so the fix needs to be backported to

@makortel
Copy link
Contributor Author

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@stlammel
Copy link

Thanks Matti!

  • Stephan

@makortel
Copy link
Contributor Author

The problematic piece of code is here

for (std::vector<std::string>::const_iterator it = fNames.begin(); it != fNames.end(); ++it) {
try {
std::unique_ptr<char[]> name(gSystem->ExpandPathName(it->c_str()));
filePtr = std::make_shared<InputFile>(name.get(), " Initiating request to open file ", inputType);
usedFallback_ = (it != fNames.begin());
break;
} catch (cms::Exception const& e) {
if (!skipBadFiles && std::next(it) == fNames.end()) {
InputFile::reportSkippedFile((*it), logicalFileName());
Exception ex(errors::FileOpenError, "", e);
ex.addContext("Calling RootInputFileSequence::initTheFile()");
std::ostringstream out;
out << "Input file " << (*it) << " could not be opened.";
ex.addAdditionalInfo(out.str());
//report previous exceptions when use other names to open file
for (auto const& s : exInfo)
ex.addAdditionalInfo(s);
throw ex;
} else {
exInfo.push_back("Calling RootInputFileSequence::initTheFile(): fail to open the file with name " + (*it));
}
}
}

The catch block should throw errors::FileOpenError only if there are no fallback files (e.g. fNames.size() == 1, or exInfo.empty()), and otherwise throw errors::FallbackFileOpenError.

We should also add a test to ensure this behavior.

@makortel
Copy link
Contributor Author

@wddgit is looking into this.

https://github.com/cms-sw/cmssw/blob/master/FWCore/Services/test/test_sitelocalconfig.sh has tests that set a custom site-local-config.xml and storage.json.

I think it would be good to have one test with one <catalog> element under <data-access> that ensures the exit code 8020 is used when there are no fallbacks, and another test with two <catalog> elements to ensure 8028 is used then.

Here is an example how to check the exit code is as expected

cmsRun -j PoolGuidTest_jobreport.xml ${LOCAL_TEST_DIR}/PoolGUIDTest_cfg.py file:PoolInputTest.root && die 'PoolGUIDTest_cfg.py PoolInputTest.root did not throw an exception' 1
GUID_EXIT_CODE=$(edmFjrDump --exitCode PoolGuidTest_jobreport.xml)
if [ "x${GUID_EXIT_CODE}" != "x8034" ]; then
echo "Inconsistent GUID test reported exit code ${GUID_EXIT_CODE} which is different from the expected 8034"
exit 1
fi

@wddgit
Copy link
Contributor

wddgit commented Jul 12, 2023

PR #42249 should fix this on the master branch.

@wddgit
Copy link
Contributor

wddgit commented Jul 26, 2023

@stlammel

Hi Stephan. When I try to backport the fix to 12_4_X, the new unit test starts to fail. It worked successfully for 13_3_X, 13_2_X, 13_1_X, 13_0_X, 12_6_X and 12_5_X. The last one Matti asked me to backport to was 12_4_X... At first glance this appears to be a failure with the unit test not the bug fix.

I am investigating, but if you understand what is going on that would be helpful. For some reason it is trying 2 file paths with the first catalog instead of only one. And the second one gives a fallback error. Probably something in the catalog code changed from 12_4_X to 12_5_X.

===== Test "TestFileOpenErrorExitCode" ====
27-Jul-2023 00:01:00 CEST  Initiating request to open file root://eoscms.cern.ch//eos/cms/store/FileThatDoesNotExist.root
%MSG-w XrdAdaptorInternal:  file_open 27-Jul-2023 00:01:01 CEST pre-events
Failed to open file at URL root://eoscms.cern.ch:1094//eos/cms/store/FileThatDoesNotExist.root?xrdcl.requuid=d40babe1-e7ff-421c-8a7f-6f8c2e77f9e8.
%MSG
%MSG-w XrdAdaptorInternal:  file_open 27-Jul-2023 00:01:01 CEST pre-events
Failed to open file at URL root://eoscms.cern.ch:1094//eos/cms/store/FileThatDoesNotExist.root?tried=&xrdcl.requuid=99ca406b-6122-40cb-b389-b80751e6fa27.
%MSG
27-Jul-2023 00:01:01 CEST  Initiating request to open file root://xrootd-cms.infn.it//store/FileThatDoesNotExist.root
%MSG-w XrdAdaptorInternal:  file_open 27-Jul-2023 00:02:43 CEST pre-events
Failed to open file at URL root://cms-xrd-global.cern.ch:1094//store/FileThatDoesNotExist.root?tried=+1213xrootd-redic.pi.infn.it&xrdcl.requuid=f809f011-147c-4a7d-892f-fbb7d0af25f5.
%MSG
%MSG-w XrdAdaptorInternal:  file_open 27-Jul-2023 00:02:43 CEST pre-events
Failed to open file at URL root://cms-xrd-global.cern.ch:1094//store/FileThatDoesNotExist.root?tried=+1213xrootd-redic.pi.infn.it,&xrdcl.requuid=a99ffa36-d13d-449b-8db1-a0bc29aaa2b5.
%MSG
----- Begin Fatal Exception 27-Jul-2023 00:02:43 CEST-----------------------
An exception of category 'FallbackFileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling XrdFile::open()
Exception Message:
Failed to open the file 'root://xrootd-cms.infn.it//store/FileThatDoesNotExist.root'
   Additional Info:
      [a] Calling RootInputFileSequence::initTheFile(): fail to open the file with name root://eoscms.cern.ch//eos/cms/store/FileThatDoesNotExist.root
      [b] Input file root://xrootd-cms.infn.it//store/FileThatDoesNotExist.root could not be opened.
      [c] XrdCl::File::Open(name='root://xrootd-cms.infn.it//store/FileThatDoesNotExist.root', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] No servers are available to read the file.
' (errno=3011, code=400). No additional data servers were found.
      [d] Last URL tried: root://cms-xrd-global.cern.ch:1094//store/FileThatDoesNotExist.root?tried=+1213xrootd-redic.pi.infn.it,&xrdcl.requuid=a99ffa36-d13d-449b-8db1-a0bc29aaa2b5
      [e] Problematic data server: cms-xrd-global.cern.ch:1094
      [f] Disabled source: cms-xrd-global.cern.ch:1094
----- End Fatal Exception -------------------------------------------------
Exit code after first run of test_fileOpenErrorExitCode_cfg.py is 8028
Unexpected cmsRun exit code after FileOpenError, exit code from jobReport 8028 which is different from the expected 8020

---> test TestFileOpenErrorExitCode had ERRORS

@stlammel
Copy link

Hallo Dave,
the second open is the fallback to the global redirector. Is 12_4_X maybe a CMSSW using the old PhEDEx/storage.xml i.e. not the new code? 8028, i.e. "FileOpenError with fallback" would be the right error for what happened.

  • Stephan

@wddgit
Copy link
Contributor

wddgit commented Jul 27, 2023

I had to modify the unit test a little for 12_4_X. I think the difference was it uses the older code. But the bug fix is the same as in all the other backports.

All the backports are submitted now. 6 PRs waiting for approval.

@makortel
Copy link
Contributor Author

makortel commented Aug 2, 2023

Is 12_4_X maybe a CMSSW using the old PhEDEx/storage.xml i.e. not the new code?

Correct, the Rucio catalog code has not yet been backported to 12_4_X, and therefore, strictly speaking, the tests of the 12_4_X backport (#42396) are incorrect. Given that the Rucio catalog was planned to be backported to 12_4_X, redoing the tests of #42396 for TFC seems wasted effort.

@stlammel @nhduongvn Would it be feasible to resume backporting the Rucio catalog (#37278 (comment))?

@nhduongvn
Copy link
Contributor

Yes, I can backport new developments to 12_4_X sometime next week so that we can include the support of chaining as well.

@wddgit
Copy link
Contributor

wddgit commented Aug 21, 2023

For the 12_4_X bug fix backport only, we decided to go ahead with the backport of the bug fix without the unit test. The plan is to add the unit test either in the 12_4_X Rucio backport (when that is ready) or as a separate PR after that. Then we can use the same unit test in 12_4_X as we are using in master and the other bug fix backports. Given that we are already running the unit test in master and we've manually verified the bug fix works in 12_4_X, that seems good enough.

@makortel
Copy link
Contributor Author

Backports of the fixes have been merged

@makortel
Copy link
Contributor Author

+core

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants