Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on how TFileService is supposed to interact with eos #46024

Open
mmusich opened this issue Sep 17, 2024 · 14 comments
Open

Question on how TFileService is supposed to interact with eos #46024

mmusich opened this issue Sep 17, 2024 · 14 comments

Comments

@mmusich
Copy link
Contributor

mmusich commented Sep 17, 2024

I have a naive question concerning the expected behavior of TFileService when it's configured to (over-)write files on eos.
While trying to re-run some alignment related jobs @henriettepetersen reported a segmentation fault in SplitVertexResolution, stack trace below:

Thread 1 (Thread 0x7f0c5f6c8640 (LWP 947722) "cmsRun"):
#0  0x00007f0c5e9019ff in poll () from /lib64/libc.so.6
#1  0x00007f0c5a5bf09f in full_read.constprop () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007f0c5a5744ec in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#3  0x00007f0c5a574670 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f0c5ff8ba47 in TStreamerInfo::ForceWriteInfo(TFile*, bool) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el9_amd64_gcc12/lib/libRIO.so
#6  0x00007f0c607642ae in TTree::BuildStreamerInfo(TClass*, void*, bool) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el9_amd64_gcc12/lib/libTree.so
#7  0x00007f0c60775f72 in TTree::BronchExec(char const*, char const*, void*, bool, int, int) () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/external/el9_amd64_gcc12/lib/libTree.so
#8  0x00007f0bffa4acf5 in SplitVertexResolution::beginJob() () from /afs/cern.ch/cms/CAF/CMSALCA/ALCA_TRACKERALIGN/data/commonValidation/legacy_2024_releases/CMSSW_14_0_14/lib/el9_amd64_gcc12/pluginAlign
mentOfflinevalidationPlugins.so
#9  0x00007f0c60c1d322 in edm::Worker::beginJob() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/libFWCoreFramework.so
#10 0x00007f0c60c21a59 in edm::WorkerManager::beginJob(edm::ProductRegistry const&, edm::eventsetup::ESRecordsToProductResolverIndices const&, edm::ProcessBlockHelperBase const&) () from /cvmfs/cms.cern.
ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/libFWCoreFramework.so
#11 0x00007f0c60b4079f in edm::EventProcessor::beginJob() () from /cvmfs/cms.cern.ch/el9_amd64_gcc12/cms/cmssw/CMSSW_14_0_14/lib/el9_amd64_gcc12/libFWCoreFramework.so
#12 0x000000000040746c in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#13 0x00007f0c5fd8096d in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_0_3-el9_amd64_gcc12/build/CMSSW_14_0_3-build/BUILD/el9_amd64_gcc
12/external/tbb/v2021.9.0-d33db04d4520c6ff791eab900054e986/tbb-v2021.9.0/src/tbb/arena.cpp:688
#14 0x0000000000408ee2 in main::{lambda()#1}::operator()() const ()
#15 0x000000000040517c in main ()

(a reproducer is available at /afs/cern.ch/work/h/hpeterse/public/splitV_seg_fault, by copying locally the folder in any recent cmssw release and then running cmsRun validation_cfg.py config=validation.json).

The issue seems to be related to the fact that the file that we're trying to write already exists with the same name at the same location.
In particular the segmentation fault originates here:

tree_->Branch("event", &event_, 64000, 2);

I can circumvent the issue by commenting that line, but then when running I see the following warning:

Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/cms/store/group/alca_trackeralign/AlignmentValidation/AlignmentValidation/2024_CDE_ReReco_mp3949_splitV_379525/SplitV/single/GT/compare2024/379525/SplitV.root not opened in write mode

What's somehow puzzling to me, is that when the address of the output file is local (e.g. the $PWD) even if the file is already existing there, there is no issue whatsoever.
Also I would have thought that due to this:

tFileDirectory_ = TFileDirectory("", "", TFile::Open(fileName_.c_str(), "RECREATE"), "");

the file would have been overwritten anyway.
Also when trying to prepare a reproducer via a simple ROOT script:

#include "TFile.h"
#include "TTree.h"
#include <iostream>
#include "Alignment/OfflineValidation/src/pvTree.h"
#include "PhysicsTools/FWLite/interface/TFileService.h"
#include <vector>
#include <string>

int test_TTreeEOS() {
  // Define the file path to EOS (replace with your EOS path)
  const std::string eosFilePath = "/eos/cms/store/group/alca_trackeralign/musich/test.root";

  fwlite::TFileService outfile_ = fwlite::TFileService(eosFilePath);
    
  // Create a TTree and a branch  
  pvEvent event_;
  event_.pvs.clear();
  event_.nVtx = -1;

  TTree* tree_ = outfile_.make<TTree>("pvTree", "pvTree");
  tree_->Branch("event", &event_, 64000, 2);
  
  return 0;
}

I have found out that with this I can overwrite the remote file as many times as I want.
Am I missing something trivial ?

Cc: @TomasKello

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 17, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mmusich.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign CommonTools/UtilAlgos

@cmsbuild
Copy link
Contributor

New categories assigned: reconstruction

@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Since

tFileDirectory_ = TFileDirectory("", "", TFile::Open(fileName_.c_str(), "RECREATE"), "");

calls TFile::Open(), the I/O gets rerouted through our StorageFactory layer, which is also visible in the error message

Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/cms/store/group/alca_trackeralign/AlignmentValidation/AlignmentValidation/2024_CDE_ReReco_mp3949_splitV_379525/SplitV/single/GT/compare2024/379525/SplitV.root not opened in write mode

With the StorageFactory, root:// URLs lead to our XrdAdaptor layer to be used for the actual I/O. On a quick look the XrdAdaptor code looks like it should be able to deal with writing files too, but I'd guess the writing part hasn't been tested much (since we use xrootd predominantly for reading data).

It may be worth of noting here that writing to (CERN) EOS through the FUSE mount has an "interesting" behavior as well #44369 (ROOT internally transforms the local-looking path into a root:// URL, while the StorageFactory layer continues act like the file would be local.

@makortel
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Would you be able to try if adding

process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))

to the job configuration would impact the behavior? (this prevents CMSSW to register the StorageFactory + XrdAdaptor for root protocol)

@makortel
Copy link
Contributor

type root

@makortel
Copy link
Contributor

@pcanal Could there be some error condition (or other assumption) in TStreamerInfo::ForceWriteInfo() that could lead it to segfault instead of reporting an error when writing to CERN EOS through xrootd?

@pcanal
Copy link
Contributor

pcanal commented Sep 17, 2024

I can see two possibility. One is that the ROOT build being used does not have the code from root-project/root#13842.

The other, more likely, is that writing in a file open in read-only mode might not be failing elegantly .... i.e.

Warning in <TStorageFactoryFile::Write>: file root://eoscms.cern.ch//eos/....SplitV.root not opened in write mode

when/if the file was open with "RECREATE" indicates that something 'bad' happened during the TFile::Open (and undefined behavior might be a consequence thereof).

One possibility is that the file is seen/thought-of as non-writeable (for example issue with permissions) and that some part of the logic in or around TFile::Open is silently falling back to opening the file in read-only mode.

@mmusich
Copy link
Contributor Author

mmusich commented Sep 18, 2024

Would you be able to try if adding

process.add_(cms.Service("AdaptorConfig", native=cms.untracked.vstring("root")))

to the job configuration would impact the behavior? (this prevents CMSSW to register the StorageFactory + XrdAdaptor for root protocol)

indeed, adding this line in the configuration file, the segmentation fault is prevented. Thank you.

@jfernan2
Copy link
Contributor

@mmusich is this issue solved? Thanks

@mmusich
Copy link
Contributor Author

mmusich commented Oct 30, 2024

@jfernan2

is this issue solved?

I am not sure. With the workaround at #46024 (comment) this particular instance of the problem is solved, though I can't say if that's a design feature or a bug.

@makortel makortel moved this from New to Work in CMS in ROOT prioritization Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Work in CMS
Development

No branches or pull requests

5 participants