Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow PromptReco_Run381067_JetMET1 error in CMSSW_14_0_7 #45089

Open
mpresill opened this issue May 29, 2024 · 42 comments
Open

Workflow PromptReco_Run381067_JetMET1 error in CMSSW_14_0_7 #45089

mpresill opened this issue May 29, 2024 · 42 comments

Comments

@mpresill
Copy link

Dear all,
As reported in
cms talk
we have a paused job for the workflow PromptReco_Run381067_JetMET1 in Run 381067, with the following error:

28-May-2024 14:59:57 UTC  Closed file root://eoscms.cern.ch//eos/cms/tier0/store/data/Run2024E/JetMET1/RAW/v1/000/381/067/00000/e186cde3-4166-4946-9343-f0c908376153.root?eos.app=cmst0
----- Begin Fatal Exception 28-May-2024 15:00:07 UTC-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)
  • tarball with all the logs are here:
    /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FatalRootError/ByteCountTooLarge/vocms013.cern.ch-3587397-3-log.tar.gz
  • full stack trace can be found here: /afs/cern.ch/user/m/mpresill/public/ORM_May29/CMSSW_14_0_7/src/job/WMTaskSpace/cmsRun1/cmsRun1-stdout_original.log
  • I was not able to reproduce the error offline so far, but I am now testing the full sets of events locally.

Matteo (ORM)

@cmsbuild
Copy link
Contributor

cmsbuild commented May 29, 2024

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @mpresill.

@makortel, @Dr15Jones, @sextonkennedy, @antoniovilela, @rappoccio, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@francescobrivio
Copy link
Contributor

assign core

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

I looked at the log of the failure whose exception was in the issue description. The exception is reported after the input file was closed, and the job ultimately dies in a segfault with the following stack trace

Thread 1 (Thread 0x153d50106640 (LWP 1030) "cmsRun"):
#3  0x0000153d4a5c2720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000153d50937a63 in TDirectoryFile::WriteKeys() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#6  0x0000153d509348bc in TDirectoryFile::SaveSelf(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#7  0x0000153d50932d27 in TDirectoryFile::Save() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#8  0x0000153d50932bfc in TDirectoryFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#9  0x0000153d509506d6 in TFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#10 0x0000153d4a4d013e in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x0000153d4a4d0169 in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x0000153cab9c0a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x0000153cab9c38e5 in edm::RootOutputFile::~RootOutputFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x0000153cab9c4deb in edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x0000153cab9c51b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x0000153cab9fa6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x0000153cab9fa811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x0000153d513eccb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x0000153d513f6d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x0000153d51354749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x0000153d51346df7 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()

Symptoms are the same as in #40132 (comment) .

Was any of the jobs re-tried in Tier0?

@makortel
Copy link
Contributor

FYI @pcanal

@pcanal
Copy link
Contributor

pcanal commented May 29, 2024

That is strange. It reports problem with the TFile object itself. I.e. one potential cause is that the TFile was already closed (or is being closed by another thread).

@mpresill
Copy link
Author

I looked at the log of the failure whose exception was in the issue description. The exception is reported after the input file was closed, and the job ultimately dies in a segfault with the following stack trace

Thread 1 (Thread 0x153d50106640 (LWP 1030) "cmsRun"):
#3  0x0000153d4a5c2720 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x0000153d50937a63 in TDirectoryFile::WriteKeys() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#6  0x0000153d509348bc in TDirectoryFile::SaveSelf(bool) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#7  0x0000153d50932d27 in TDirectoryFile::Save() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#8  0x0000153d50932bfc in TDirectoryFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#9  0x0000153d509506d6 in TFile::Close(char const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/external/el8_amd64_gcc12/lib/libRIO.so
#10 0x0000153d4a4d013e in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x0000153d4a4d0169 in TStorageFactoryFile::~TStorageFactoryFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x0000153cab9c0a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x0000153cab9c38e5 in edm::RootOutputFile::~RootOutputFile() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x0000153cab9c4deb in edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x0000153cab9c51b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x0000153cab9fa6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x0000153cab9fa811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x0000153d513eccb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x0000153d513f6d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x0000153d5132a2a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x0000153d51354749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x0000153d51346df7 in edm::EventProcessor::~EventProcessor() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()

Symptoms are the same as in #40132 (comment) .

Was any of the jobs re-tried in Tier0?

It was not retried yet. Should this be re-tried?

@makortel
Copy link
Contributor

Was any of the jobs re-tried in Tier0?

It was not retried yet. Should this be re-tried?

Probably not worth it. I tested the job locally (with a local input file), and it fails in the same way.

@makortel
Copy link
Contributor

I noticed many printouts from PileupJetIdProducer in the log (being responsible of 767 MB of the 844 MB of the cmsRun1-stdout.log). I opened a separate issue #45099 about that.

@mpresill
Copy link
Author

mpresill commented May 31, 2024

FYI, there is new failed job for the same run and workflow PromptReco_Run381067_JetMET0 with same error.

  • tarball with error log for this:
    /afs/cern.ch/user/c/cmst0/public/PausedJobs/Run2024E/FatalRootError/ByteCountTooLarge/vocms013.cern.ch-3587397-3-log.tar.gz

@makortel
Copy link
Contributor

makortel commented Jun 1, 2024

Following #40132 (comment) I tested 14_0_7 with the backport of that (#40132 (comment)), but the behavior was the same

01-Jun-2024 00:21:10 CEST  Closed file file:e186cde3-4166-4946-9343-f0c908376153.root
----- Begin Fatal Exception 01-Jun-2024 00:21:20 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling EventProcessor::runToCompletion (which does almost everything after beginJob and before endJob)
   Additional Info:
      [a] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)

----- End Fatal Exception -------------------------------------------------

<cut>

Thread 1 (Thread 0x7f98df012680 (LWP 316619) "cmsRun"):
#3  0x00007f98dab6a720 in sig_dostack_then_abort () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007f98e1e0a48a in TDirectoryFile::WriteKeys (this=0x7f98caf31580) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:2187
#6  0x00007f98e1e08228 in TDirectoryFile::SaveSelf (this=0x7f98caf31580, force=false) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1634
#7  0x00007f98e1e07c3c in TDirectoryFile::Save (this=0x7f98caf31580) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1552
#8  0x00007f98e1e03c59 in TDirectoryFile::Close (this=0x7f98caf31580, option=0x7f98da1a6a26 "") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:568
#9  0x00007f98e1e1d640 in TFile::Close (this=0x7f98caf31580, option=0x7f98da1a6a26 "") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TFile.cxx:970
#10 0x00007f98da1a413e in TStorageFactoryFile::~TStorageFactoryFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#11 0x00007f98da1a4169 in TStorageFactoryFile::~TStorageFactoryFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolTFileAdaptor.so
#12 0x00007f9840175a97 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#13 0x00007f98401788e5 in edm::RootOutputFile::~RootOutputFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#14 0x00007f9840179deb in edm::PoolOutputModule::~PoolOutputModule() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#15 0x00007f984017a1b8 in virtual thunk to edm::PoolOutputModule::~PoolOutputModule() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#16 0x00007f98401af6f3 in std::_Sp_counted_deleter<edm::one::OutputModuleBase*, std::default_delete<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#17 0x00007f98401af811 in std::_Sp_counted_ptr_inplace<edm::maker::ModuleHolderT<edm::one::OutputModuleBase>, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginIOPoolOutputPlugins.so
#18 0x00007f98e2bc72a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#19 0x00007f98e2c89cb2 in std::_Rb_tree<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > >, std::_Select1st<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, edm::propagate_const<std::shared_ptr<edm::maker::ModuleHolder> > > >*) () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#20 0x00007f98e2c93d10 in std::_Sp_counted_ptr_inplace<edm::ModuleRegistry, std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#21 0x00007f98e2bc72a7 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#22 0x00007f98e2bf1749 in edm::Schedule::~Schedule() [clone .lto_priv.0] () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#23 0x00007f98e2be3df7 in edm::EventProcessor::~EventProcessor() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x0000000000405891 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#25 0x0000000000405199 in main ()

@ferdymercury
Copy link

ferdymercury commented Jun 1, 2024

The first error in each job is the one reported by the (CMSSW) exception message, i.e.

* `Request to expand to a negative size, likely due to an integer overflow: 0x8001588e for a max of 0x7ffffffe.`

Please note that root-project/root#14627 should only solve the bug [1] from #40132 (comment): (request to expand to a negative size).

The exception "TBufferFile::WriteByteCount bytecount too large (more than 1073741822)", ie bug [2] in the other issue, is not a bug, but rather a wanted exception, which tells you that your TFile contains a key that exceeds the maximum allowed size of 1 GB (root-project/root#6734)

Workarounds would be to address root-project/root#6734, or to make your object a bit smaller, or to reroot the exception with a custom error handler so that it does not throw a fatal error.

@makortel
Copy link
Contributor

makortel commented Jun 3, 2024

type root

@cmsbuild cmsbuild added the root label Jun 3, 2024
@makortel
Copy link
Contributor

makortel commented Jun 3, 2024

to reroot the exception with a custom error handler so that it does not throw a fatal error.

What exactly would the ROOT state be then at the point where it issues the error message bytecount too large but the execution would continue without the conversion of the error message into an exception?

@ferdymercury
Copy link

ferdymercury commented Jun 3, 2024

I guess it would just skip saving that too-big-object into the TFile, and continue with the rest of objects. But it's just a guess, best thing would be try it out with a simple reproducer.
Maybe trying to save a TH2 with a lot of bins in X and Y does the trick for the reproducer.

@makortel makortel moved this to Needs debugging in ROOT prioritization Jun 3, 2024
@makortel
Copy link
Contributor

makortel commented Jun 4, 2024

FWIW, here is a stack trace to the where the exception gets thrown

(gdb) where
#0  0x00007ffff55222f1 in __cxxabiv1::__cxa_throw (obj=0x7ffff1885280, tinfo=0x7ffff79a6610 <typeinfo for edm::Exception>, dest=0x7ffff7970020 <edm::Exception::~Exception()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:81
#1  0x00007fffefb2c979 in (anonymous namespace)::RootErrorHandlerImpl(int, char const*, char const*) [clone .cold] () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/pluginFWCoreServicesPlugins.so
#2  0x00007ffff690bc54 in ErrorHandler(Int_t, const char *, const char *, typedef __va_list_tag __va_list_tag *) (level=3000, location=0x7ffff4001f3c "TBufferFile::WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)", ap=0x7fffffff2218)
    at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/foundation/src/TError.cxx:148
#3  0x00007ffff6848126 in TObject::DoError (this=0x7fffe568ce80, level=3000, location=0x7ffff7026d6b "WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)", va=0x7fffffff2218) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/base/src/TObject.cxx:943
#4  0x00007ffff684838f in TObject::Error (this=0x7fffe568ce80, location=0x7ffff7026d6b "WriteByteCount", fmt=0x7ffff7026d48 "bytecount too large (more than %d)") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/base/src/TObject.cxx:980
#5  0x00007ffff6d898b7 in TBufferFile::SetByteCount (this=0x7fffe568ce80, cntpos=243, packInVersion=true) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:349
#6  0x00007ffff68db51b in TObjArray::Streamer (this=0x7fffea5b7758, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/cont/src/TObjArray.cxx:489
#7  0x00007ffff692f31a in TClass::StreamerTObjectInitialized (pThis=0x7fffee723c00, object=0x7fffea5b7758, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/meta/src/TClass.cxx:6817
#8  0x00007ffff784270b in TClass::Streamer (this=0x7fffee723c00, obj=0x7fffea5b7758, b=..., onfile_class=0x0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/core/meta/inc/TClass.h:610
#9  0x00007ffff6d8e832 in TBufferFile::WriteFastArray (this=0x7fffe568ce80, start=0x7fffea5b7758, cl=0x7fffee723c00, n=1, streamer=0x0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:2354
#10 0x00007ffff700808c in TStreamerInfo::WriteBufferAux<char**> (this=0x7fff9de73a00, b=..., arr=@0x7fffffff2f30: 0x7fffffff2f28, compinfo=0x7fff9e4397c8, first=0, last=1, narr=1, eoffset=0, arrayMode=0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TStreamerInfoWriteBuffer.cxx:628
#11 0x00007ffff6e7e814 in TStreamerInfoActions::GenericWriteAction (buf=..., addr=0x7fffea5b7600, config=0x7fff9e4397b0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TStreamerInfoActions.cxx:202
#12 0x00007ffff6d94a07 in TStreamerInfoActions::TConfiguredAction::operator() (this=0x7fffbab24100, buffer=..., object=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/inc/TStreamerInfoActions.h:123
#13 0x00007ffff6d927ec in TBufferFile::ApplySequence (this=0x7fffe568ce80, sequence=..., obj=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:3679
#14 0x00007ffff6d925d4 in TBufferFile::WriteClassBuffer (this=0x7fffe568ce80, cl=0x7fff9df9ff00, pointer=0x7fffea5b7600) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TBufferFile.cxx:3648
#15 0x00007ffff78c3650 in TTree::Streamer (this=0x7fffea5b7600, b=...) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/tree/tree/src/TTree.cxx:9623
#16 0x00007ffff6e4c82f in TKey::TKey (this=0x7fffde449940, obj=0x7fffea5b7600, name=0x7fffea5b7619 "Events", bufsize=265751, motherDir=0x7fffdff2d980) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TKey.cxx:249
#17 0x00007ffff6e11b27 in TFile::CreateKey (this=0x7fffdff2d980, mother=0x7fffdff2d980, obj=0x7fffea5b7600, name=0x7fffea5b7619 "Events", bufsize=265751) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TFile.cxx:1031
#18 0x00007ffff6dfd754 in TDirectoryFile::WriteTObject (this=0x7fffdff2d980, obj=0x7fffea5b7600, name=0x0, option=0x7ffff78e0f89 "", bufsize=0) at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/io/io/src/TDirectoryFile.cxx:1965
#19 0x00007ffff78aae26 in TTree::AutoSave (this=0x7fffea5b7600, option=0x7fff551533e8 "FlushBaskets") at /build/muz/140xRoot/w/BUILD/el8_amd64_gcc12/lcg/root/6.30.03-055353a34b1a9cd5e46334f7a05af86c/root-6.30.03/tree/tree/src/TTree.cxx:1522
#20 0x00007fff551452b9 in edm::RootOutputFile::finishEndFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#21 0x00007fff55130a55 in edm::PoolOutputModule::finishEndFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#22 0x00007fff55130ba3 in edm::PoolOutputModule::reallyCloseFile() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libIOPoolOutput.so
#23 0x00007ffff7c848e5 in edm::Schedule::closeOutputFiles() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#24 0x00007ffff7bda60d in edm::EventProcessor::closeOutputFiles() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#25 0x00007ffff7bdfc9a in edm::EventProcessor::runToCompletion() () from /build/muz/140xRoot/w/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_7_ROOT15687/lib/el8_amd64_gcc12/libFWCoreFramework.so
#26 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#27 0x00007ffff5d719ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:688
#28 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#29 0x000000000040517c in main ()

How could we get more context on what is causing the "too big object"?

@ferdymercury
Copy link

ferdymercury commented Jun 4, 2024

There seems to be an object or class that is stored in your TTree, whose Streamer is too big, at some point when AutoSave is called.

For example, if I do this:

TFile a("/tmp/big.root","RECREATE");
TH2D h("h", "h", 100000, 0, 10, 100000, 0, 10);
h.Write();

I will get a similar crash as yours, though not exactly through the same path. If you'd try to store this histogram in the TTree directly and call AutoSave, you might get closer to what you are seeing.

So maybe, you would need to do a tree->Print() or to revise what branches and classes you have stored in this tree. One of them has a Streamer that stores too much data in it.

@makortel
Copy link
Contributor

makortel commented Jun 4, 2024

One detail to be noted about the output are sizes of the output files at the time the job died

16M     SKIMStreamEXODelayedJetMET.root
8.8M    SKIMStreamEXODisappTrk.root
57G     SKIMStreamEXOHighMET.root
6.6G    SKIMStreamEXOSoftDisplacedVertices.root
1.3G    SKIMStreamJetHTJetPlusHOFilter.root
1.3M    SKIMStreamLogErrorMonitor.root
545M    SKIMStreamLogError.root
56K     SKIMStreamTeVJet.root
572M    write_ALCARECO.root
6.0G    write_AOD.root
21M     write_DQMIO.root
829M    write_MINIAOD.root
64M     write_NANOAOD.root

57 GB of SKIMStreamEXOHighMET.root is highly suspicious (@cms-sw/pdmv-l2)

@makortel
Copy link
Contributor

makortel commented Jun 4, 2024

Inspecting SKIMStreamEXOHighMET.root it seems the job was writing that one when it failed

[mkortela@cmsdev42 /build/mkortela/debug/issue45089/CMSSW_14_0_7/src]$ root -l -n SKIMStreamEXOHighMET.root
root [0]
Attaching file SKIMStreamEXOHighMET.root as _file0...
Warning in <TFile::Init>: file SKIMStreamEXOHighMET.root probably not closed, trying to recover
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:MetaData at address 61136680973
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:ParameterSets at address 61136746357
Info in <TFile::Recover>: SKIMStreamEXOHighMET.root, recovered key TTree:Parentage at address 61136752149
Warning in <TFile::Init>: successfully recovered 3 keys
(TFile *) 0x4a153f0
root [1] .ls
TFile**         SKIMStreamEXOHighMET.root
 TFile*         SKIMStreamEXOHighMET.root
  KEY: TTree    MetaData;1
  KEY: TTree    ParameterSets;1
  KEY: TTree    Parentage;1
root [2]

@makortel
Copy link
Contributor

makortel commented Jun 5, 2024

57 GB of SKIMStreamEXOHighMET.root is highly suspicious

The SKIMStreamEXOHighMET selects 16084 events in the job, so the average event size is 3.6 MB. Is this really justified? @cms-sw/pdmv-l2

@makortel
Copy link
Contributor

makortel commented Jun 6, 2024

you would need to do a tree->Print()

I tried to call TTree::Print() a little bit before the TTree::AutoSave() is called, but that leads to the error message (that we turn to an exception) as well

----- Begin Fatal Exception 06-Jun-2024 08:03:20 CEST-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Calling RootOutputFile::finishEndFile() while closing SKIMStreamEXOHighMET.root
   Additional Info:
      [a] While calling Print() for Events
      [b] Fatal Root Error: @SUB=TBufferFile::WriteByteCount
bytecount too large (more than 1073741822)

----- End Fatal Exception -------------------------------------------------

This is as far as I got (until July). I have the input file and the job configuration on cmsdev42 in /build/mkortela/debug/issue45089/CMSSW_14_0_7/src in case that would be of help.

@pcanal
Copy link
Contributor

pcanal commented Jun 6, 2024

How could we get more context on what is causing the "too big object"?

The stack trace seems to imply that it is the TTree itself that is too large. This could happens in two cases
(a) (unlikely) Some large basket(s) are somehow not flush properly and store with the TTree
(b) (more likely) The TTree reached the maximum numbers of baskets in total (each baskets cost the storage of its size and location). (50 millions-ish lead go over the 1GB threshold)
(c) something I am missing :)

@makortel
Copy link
Contributor

(b) (more likely) The TTree reached the maximum numbers of baskets in total (each baskets cost the storage of its size and location). (50 millions-ish lead go over the 1GB threshold)

Is there a way to find out the number of baskets? Or would there be some other way to confirm (or disprove) this case?

@mmusich
Copy link
Contributor

mmusich commented Jul 24, 2024

There is now a second Tier0 job with similar symptoms.
We have a paused job for the workflow PromptReco_Run383367_JetMET0 in Run 383367, with the following error:

"An exception of category 'FatalRootError' occurred while [0] Calling RootOutputFile::finishEndFile() while closing SKIMStreamEXOHighMET.root Additional Info: [a] While calling writeTree() for Events [b] Fatal Root Error: @SUB=TBufferFile::WriteByteCount bytecount too large (more than 1073741822)"

see link for more details.

@germanfgv
Copy link
Contributor

There is now a second Tier0 job with similar symptoms. We have a paused job for the workflow PromptReco_Run383367_JetMET0 in Run 383367, with the following error:

"An exception of category 'FatalRootError' occurred while [0] Calling RootOutputFile::finishEndFile() while closing SKIMStreamEXOHighMET.root Additional Info: [a] While calling writeTree() for Events [b] Fatal Root Error: @SUB=TBufferFile::WriteByteCount bytecount too large (more than 1073741822)"

see link for more details.

You can find the tarball of this job here:

/eos/user/c/cmst0/public/PausedJobs/Run2024F/ByteCountTooLarge/job_5131750/vocms0313.cern.ch-5131750-3-log.tar.gz

@makortel
Copy link
Contributor

57 GB of SKIMStreamEXOHighMET.root is highly suspicious

The SKIMStreamEXOHighMET selects 16084 events in the job, so the average event size is 3.6 MB. Is this really justified? @cms-sw/pdmv-l2

Maybe time to remind about this question.

@mmusich
Copy link
Contributor

mmusich commented Jul 24, 2024

Maybe time to remind about this question.

tagging also @cms-sw/ppd-l2 and @youyingli

@malbouis
Copy link
Contributor

Thanks @makortel and @mmusich . We (PPD) will follow this up with PdmV/DDT and EXO.

@davidlange6
Copy link
Contributor

lots of skims are raw-reco outputs. This is one of them.. some other skims are quite a bit larger than this one on average in 2024.

@makortel
Copy link
Contributor

In case of Run 381067 Lumi 335 this skim selected ~60 % of the events of the JetMET1 PD.

@davidlange6
Copy link
Contributor

So far in 2024, its 120TB compared to about 900 TB of raw data in JetMET1 and JetMET0

@davidlange6
Copy link
Contributor

381067/335 appears to be the end of a set of rather anomalous set of lumi sections

@anpicci
Copy link

anpicci commented Jul 30, 2024

FYI, we failed the jobs reported by @mmusich in according with T0

@malbouis
Copy link
Contributor

malbouis commented Aug 5, 2024

Hi @makortel , to quickly answer your question in #45089 (comment)
When looking at the sizes of those skims for the past years/Eras, we can see:

  • skim from 2022F: /JetMET/Run2022G-EXOHighMET-PromptReco-v1/RAW-RECO (3.8TB) ==> # of events: 766776 ==> 4.96 M / event
  • skim from 2023D: /JetMET0/Run2023D-EXOHighMET-PromptReco-v1/RAW-RECO (4.9TB) ==> # of events: 1047355 ==> 4.67 MB / event
  • skim from 2024F: /JetMET0/Run2024F-EXOHighMET-PromptReco-v1/RAW-RECO ==> # of events: 7010213 (34.9TB) ==> 4.97 MB per event

So I would naively conclude that the 3.6M per event that you mention above is acceptable for that skim, or at least in accordance to what we've seen in the past (assuming I didn't make any mistakes). :-)

I don't know if it was mentioned before, but just to include it here that this EXOHighMET skim was first introduced in 2022, this is the original PR: #37749

@makortel
Copy link
Contributor

makortel commented Aug 5, 2024

Thanks @malbouis. My next question is then if the rate (or acceptance fraction, ~60 %) of the skim is along expectation.

Although @davidlange6 already wrote in #45089 (comment)

381067/335 appears to be the end of a set of rather anomalous set of lumi sections

that I understand as there was something anomalous in the data that resulted in that high acceptance fraction. If this is the case, I would not invest effort in trying to to make writing of this large files to succeed. In this case, perhaps some protections would make sense so that the job wouldn't fail for the entire lumi? Would it be feasible to understand these conditions from the physics side and improve the filtering of this skim? I'm not sure if we could do something reliably at the framework level.

@mmusich
Copy link
Contributor

mmusich commented Aug 5, 2024

Would it be feasible to understand these conditions from the physics side and improve the filtering of this skim? I'm not sure if we could do something reliably at the framework level.

tagging @afrankenthal as original author of #37749

@makortel
Copy link
Contributor

makortel commented Aug 5, 2024

assign pdmv

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 5, 2024

New categories assigned: pdmv

@AdrianoDee,@sunilUIET,@miquork,@kskovpen you have been requested to review this Pull request/Issue and eventually sign? Thanks

@youyingli
Copy link
Contributor

57 GB of SKIMStreamEXOHighMET.root is highly suspicious

The SKIMStreamEXOHighMET selects 16084 events in the job, so the average event size is 3.6 MB. Is this really justified? @cms-sw/pdmv-l2

Hi, I also looked at Run 2 EXOHighMET as /MET/Run2018D-HighMET-PromptReco-v2/RAW-RECO with 29.4TB and 7906297 events and the value is approximately 3.72 MB/event. So the 3.6 MB for that file should not be an issue. I'm not sure why many events are accumulated into a single file with a size of 50+ GB without any splitting. For DDT, we will contact EXO PAG and check if this skim is still needed or if they can add additional filters in the trigger part or more stringent selections in this skim.

@afrankenthal
Copy link
Contributor

Hello, as @youyingli said I also don't understand the technical details of why the events are not getting properly split. This skim is a common EXO skim serving multiple analyses, so I'd guess it's very much still needed. Maybe there are filters we can implement without losing too much information, though. Needs to be discussed in EXO.

@jeyserma
Copy link

We had 3 more paused jobs over the weekend. The tarballs can be found here:

/eos/user/c/cmst0/public/PausedJobs/Run2024G/ByteCountTooLarge/job_673494/vocms0314.cern.ch-673494-3-log.tar.gz
/eos/user/c/cmst0/public/PausedJobs/Run2024G/ByteCountTooLarge/job_673495/vocms0314.cern.ch-673495-3-log.tar.gz
/eos/user/c/cmst0/public/PausedJobs/Run2024G/ByteCountTooLarge/job_673917/vocms0314.cern.ch-673917-3-log.tar.gz

We also copied the RAW input files here:

/eos/user/c/cmst0/public/PausedJobs/Run2024G/PromptReco_Run384382_JetMET0/
/eos/user/c/cmst0/public/PausedJobs/Run2024G/PromptReco_Run384382_JetMET1/

@makortel makortel moved this from Needs debugging in CMS to Work in ROOT in ROOT prioritization Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Work in ROOT
Development

No branches or pull requests