Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem overwriting/unlink simulation output file #44369

Open
fabferro opened this issue Mar 11, 2024 · 20 comments
Open

Problem overwriting/unlink simulation output file #44369

fabferro opened this issue Mar 11, 2024 · 20 comments

Comments

@fabferro
Copy link
Contributor

I'm running the PPS Full Simulation with a particle gun, but when I run it for the second time I get the following error:
----- Begin Fatal Exception 11-Mar-2024 15:47:36 CET-----------------------
An exception of category 'FatalRootError' occurred while
[0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
Additional Info:
[a] Fatal Root Error: @sub=TStorageFactorySystem::Unlink
Unsupported

----- End Fatal Exception -------------------------------------------------

The error disappears if I delete the output root file and re-run the simulation.
It started to happen a few weeks ago, never happened before.
It happens in CMSSW_14_0_0 but also in other releases.
It happens both with lxplus and lxplus7.
The file I'm running is https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 11, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @fabferro.

@rappoccio, @makortel, @Dr15Jones, @smuzaffar, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

I'm not able to reproduce on lxplus8 or lxplus9 on either /tmp or on AFS.

Could you give more details, e.g. on what filesystem you are running? Are you using https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py exactly as it is, or do you change process.o1.fileName in any way?

@fabferro
Copy link
Contributor Author

I'm not able to reproduce on lxplus8 or lxplus9 on either /tmp or on AFS.

Could you give more details, e.g. on what filesystem you are running? Are you using https://github.com/cms-sw/cmssw/blob/master/SimPPS/Configuration/test/pg_step1_GEN_SIM_2021.py exactly as it is, or do you change process.o1.fileName in any way?

I ran it as it is. I tried modifying it but things don't change.

@fabferro
Copy link
Contributor Author

Trying some differential analysis:
I installed two brand new releases (14_0_0 and 13_3_2) on the same machine (lxplus958) in the same shell.
The problem appears only in 14_0_0 not in 13_3_2. I also ran a RECO script and it does the same.
The output root file can't be re-written, as if it was locked.

@fabferro
Copy link
Contributor Author

One more piece of information: it works fine with "pure" AFS, so it seems to be related to some bad interplay between EOS and CMSSW_14_0_0

@fabferro
Copy link
Contributor Author

The last working releases is CMSSW_14_0_0_pre1. _pre2 is the first one showing this issue

@makortel
Copy link
Contributor

I can reproduce when running the job on directory on EOS (via the FUSE mount). A major difference between 14_0_0_pre1 and pre2 is that pre1 used ROOT 6.26, and pre2 uses ROOT 6.30.

Here is a stack trace for the exception

(gdb) where
#0  0x00007ffff5ead0f1 in __cxxabiv1::__cxa_throw (obj=0x7fffa3fb6b80, tinfo=0x7ffff79a3650 <typeinfo for edm::Exception>, dest=0x7ffff796d010 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007ffff173099d in (anonymous namespace)::RootErrorHandlerImpl(int, char const*, char const*) [clone .cold] () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007ffff6ceea5b in ErrorHandler () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCore.so
#3  0x00007ffff6c3e214 in TObject::Error(char const*, char const*, ...) const () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCore.so
#4  0x00007ffff238a56d in TStorageFactorySystem::Unlink(char const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#5  0x00007ffff238dba5 in TStorageFactoryFile::Initialize(char const*, char const*) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#6  0x00007ffff238dd54 in TStorageFactoryFile::TStorageFactoryFile(char const*, char const*, char const*, int, int, bool) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolTFileAdaptor.so
#7  0x00007fffe94d20b9 in ?? ()
#8  0x00007fff00000000 in ?? ()
#9  0x00007fffa3aa1640 in ?? ()
#10 0x00007fffa3aa1640 in ?? ()
#11 0x00007fffffff2a90 in ?? ()
#12 0x00007fffbc5aa4f1 in ?? () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#13 0x00007fffffff2999 in ?? ()
#14 0x00007ffff3572920 in ?? ()
#15 0x00007fffea710062 in TClingCallFunc::IFacePtr() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libCling.so
#16 0x0000000400000000 in ?? ()
#17 0x00007fffe3ba22a0 in ?? ()
#18 0x00007fffa3a7c580 in ?? ()
#19 0x00007fffffff2a10 in ?? ()
#20 0x00007fffffff2ee0 in ?? ()
#21 0x00007fffffff2b50 in ?? ()
#22 0x00007ffff7154852 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libRIO.so
#23 0x00007ffff7153d69 in TFile::Open(char const*, char const*, char const*, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/external/el9_amd64_gcc12/lib/libRIO.so
#24 0x00007fffbc59a8cc in edm::RootOutputFile::RootOutputFile(edm::PoolOutputModule*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#25 0x00007fffbc589437 in edm::PoolOutputModule::reallyOpenFile() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#26 0x00007fffbc589591 in virtual thunk to edm::PoolOutputModule::openFile(edm::FileBlock const&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libIOPoolOutput.so
#27 0x00007ffff7deb0b8 in edm::Schedule::openOutputFiles(edm::FileBlock&) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#28 0x00007ffff7d4210d in edm::EventProcessor::openOutputFiles() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#29 0x00007ffff7d4776e in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_0/lib/el9_amd64_gcc12/libFWCoreFramework.so
#30 0x00000000004074f5 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#31 0x00007ffff6f0f96d in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_0_pre2_SKYLAKEAVX512-el9_amd64_gcc12/build/CMSSW_14_0_0_pre2_SKYLAKEAVX512-build/BUILD/el9_amd64_gcc12/external/tbb/v2021.9.0-c38983dfefd2a4afa504d4856ead176c/tbb-v2021.9.0/src/tbb/arena.cpp:688
#32 0x0000000000408ee2 in main::{lambda()#1}::operator()() const ()
#33 0x000000000040517c in main ()

The TStorageFactorySystem::Unlink() is called from

if (recreate) {
if (!gSystem->AccessPathName(path, kFileExists))
gSystem->Unlink(path);

and our TStorageFactorySystem::Unlink() is indeed implemented as
Int_t TStorageFactorySystem::Unlink(const char * /*name*/) {
Error("Unlink", "Unsupported");
return 1;
}

The TStorageFactorySystem is registered to ROOT in

mgr->AddHandler("TSystem", type, "TStorageFactorySystem", "IOPoolTFileAdaptor", "TStorageFactorySystem()");

mgr->AddHandler(
"TSystem", type, "TStorageFactorySystem", "IOPoolTFileAdaptor", "TStorageFactorySystem(const char *,Bool_t)");

As of why the underlying filesystem makes a difference, I have no clue at the moment.

@makortel
Copy link
Contributor

makortel commented Mar 12, 2024

Two possible workarounds

  1. Use AFS or "local disk" for running CMSSW instead of EOS
  2. Add process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root"))) to the configuration file

@makortel
Copy link
Contributor

With gdb I found that when running on EOS, the path in


is root://eoshome-m.cern.ch/${PWD}<filename>. I'd bet this somehow makes the ROOT's TUnixSystem to not unlink the file, and leading to our TStorageFactorySystem::Unlink() to be called.

I checked the behavior on 14_0_0_pre1, and the the path was just the <filename>.

@makortel
Copy link
Contributor

type root

@cmsbuild cmsbuild added the root label Mar 12, 2024
@makortel
Copy link
Contributor

makortel commented Mar 12, 2024

@pcanal Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/ (or similar) somewhere between 6.26 and 6.30?

@makortel
Copy link
Contributor

This workaround seems to work too

process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))

@makortel
Copy link
Contributor

Setting output file as file:<filename> does not have an impact.

@makortel
Copy link
Contributor

makortel commented Mar 12, 2024

Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/ (or similar) somewhere between 6.26 and 6.30?

I found root-project/root#11644. It pointed another workaround, adding

TFile.CrossProtocolRedirects: 0

to $HOME/.rootrc.

@pcanal
Copy link
Contributor

pcanal commented Mar 12, 2024

Did ROOT get an ability to find out if a local file is on (CERN) EOS, and in which case it prepends the file path with root://eoshome-m.cern.ch/ (or similar) somewhere between 6.26 and 6.30?

Yes in v6.28. (the PR you found).

@makortel
Copy link
Contributor

@pcanal Is there a way to choose the behavior per TFile? (I'm thinking like allowing this redirection for input files, but disabling it for output files) From the PR I'd guess "no".

@pcanal
Copy link
Contributor

pcanal commented Mar 13, 2024

If you know it is a local file and want to stay local, you use new TFile instead of TFile::Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Work in CMS
Development

No branches or pull requests

4 participants