Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CMSSW tests failing again with Fatal Root Error: @SUB=Minuit2 #43577

Closed
aandvalenzuela opened this issue Dec 15, 2023 · 24 comments
Closed

CMSSW tests failing again with Fatal Root Error: @SUB=Minuit2 #43577

aandvalenzuela opened this issue Dec 15, 2023 · 24 comments

Comments

@aandvalenzuela
Copy link
Contributor

Hello,

Since we moved to ROOT 6.30, we are seeing again errors like the one reported in #42979 throwing the following exception:

----- Begin Fatal Exception 15-Dec-2023 07:53:04 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing global end Run run: 1
   [1] Calling method for module DQMGenericClient/'postProcessorMuonTrack'
   Additional Info:
      [a] Fatal Root Error: @SUB=Minuit2
VariableMetricBuilder Initial matrix not pos.def.

In this case, we see the errors on multiple archs:

  • el8_aarch64_gcc12: RelVals 25202.2 and 1365.0.
  • el8_ppc64le_gcc12: RelVal 25202.15.
  • slc7_amd64_gcc12: Unit test PrimaryVertex (module Alignment/OfflineValidation).

#43106 fixed this issue in the past by using likelihood fit instead of chi-square, but it seems to be back in ROOT 6.30.

Thanks!

FYI @guitargeek, @smuzaffar

@cmsbuild
Copy link
Contributor

cmsbuild commented Dec 15, 2023

cms-bot internal usage

@cmsbuild
Copy link
Contributor

A new Issue was created by @aandvalenzuela Andrea Valenzuela.

@Dr15Jones, @antoniovilela, @smuzaffar, @sextonkennedy, @rappoccio, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign core

@makortel
Copy link
Contributor

type root

@cmsbuild
Copy link
Contributor

New categories assigned: core

@Dr15Jones,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Did we see these errors the last time the ROOT 6.30 build was tested on these architectures? On the other hand, since the failures are not widespread, maybe that hints towards a random component in the cause?

@makortel
Copy link
Contributor

makortel commented Dec 15, 2023

On el8_aarch64_gcc12 CMSSW_14_0_X_2023-12-13-2300 two workflows crashed in a way that looks possibly related

11024.0 step 3

Other TBB threads are in tbb::detail::r1::futex_wait()

Thread 1 (Thread 0x4000269bd080 (LWP 63750) "cmsRun"):
#3  0x000040002bf5ce8c in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00004001acff001c in ?? ()
#6  0x0000400039d3c590 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#7  0x0000400039d3c590 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#8  0x0000400039d3cc3c in cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#9  0x0000400039e18844 in cling::MetaSema::actOnxCommand(llvm::StringRef, llvm::StringRef, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#10 0x0000400039e25134 in cling::MetaParser::isXCommand(cling::MetaSema::ActionResult&, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#11 0x0000400039e26ff4 in cling::MetaParser::isCommand(cling::MetaSema::ActionResult&, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#12 0x0000400039e122ec in cling::MetaProcessor::process(llvm::StringRef, cling::Interpreter::CompilationResult&, cling::Value*, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#13 0x0000400039c602dc in HandleInterpreterException(cling::MetaProcessor*, char const*, cling::Interpreter::CompilationResult&, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#14 0x0000400039c72d40 in TCling::ProcessLine(char const*, TInterpreter::EErrorCode*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#15 0x0000400039c732b4 in TCling::ProcessLineSynch(char const*, TInterpreter::EErrorCode*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#16 0x0000400025f398a8 in TApplication::ExecuteFile(char const*, int*, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#17 0x0000400039c71dd8 in TCling::ExecuteMacro(char const*, TInterpreter::EErrorCode*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#18 0x0000400025f8ec5c in TROOT::Macro(char const*, int*, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#19 0x0000400025f72a90 in TPluginManager::LoadHandlerMacros(char const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#20 0x0000400025f730ec in TPluginManager::LoadHandlersFromPluginDirs(char const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#21 0x0000400025f74164 in TPluginManager::FindHandler(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#22 0x000040002579899c in ROOT::Math::Factory::CreateMinimizer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#23 0x000040002579a270 in ROOT::Fit::FitConfig::CreateMinimizer() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#24 0x00004000257b1dc0 in ROOT::Fit::Fitter::DoInitMinimizer() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#25 0x00004000257b827c in bool ROOT::Fit::Fitter::DoMinimization<ROOT::Fit::PoissonLikelihoodFCN<ROOT::Math::IBaseFunctionMultiDimTempl<double>, ROOT::Math::IParametricFunctionMultiDimTempl<double> > >(std::unique_ptr<ROOT::Fit::PoissonLikelihoodFCN<ROOT::Math::IBaseFunctionMultiDimTempl<double>, ROOT::Math::IParametricFunctionMultiDimTempl<double> >, std::default_delete<ROOT::Fit::PoissonLikelihoodFCN<ROOT::Math::IBaseFunctionMultiDimTempl<double>, ROOT::Math::IParametricFunctionMultiDimTempl<double> > > >, ROOT::Math::IBaseFunctionMultiDimTempl<double> const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#26 0x00004000257b5590 in ROOT::Fit::Fitter::DoBinnedLikelihoodFit(bool, ROOT::EExecutionPolicy const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#27 0x000040002c60a3a4 in TFitResultPtr HFit::Fit<TH1>(TH1*, TF1*, Foption_t&, ROOT::Math::MinimizerOptions const&, char const*, ROOT::Fit::DataRange&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libHist.so
#28 0x000040002c5fff14 in ROOT::Fit::FitObject(TH1*, TF1*, Foption_t&, ROOT::Math::MinimizerOptions const&, char const*, ROOT::Fit::DataRange&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libHist.so
#29 0x000040002c6c97e0 in TH1::Fit(TF1*, char const*, char const*, double, double) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libHist.so
#30 0x000040006b49055c in PVFitter::runFitter() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/libRecoVertexBeamSpotProducer.so
#31 0x000040006b4871d8 in BeamFitter::runPVandTrkFitter() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/libRecoVertexBeamSpotProducer.so
#32 0x00004000c5aa0e94 in AlcaBeamMonitor::globalEndLuminosityBlock(edm::LuminosityBlock const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/pluginAlcaBeamMonitor.so
#33 0x00004000c5aad9fc in virtual thunk to edm::one::impl::LuminosityBlockCacheHolder<edm::one::EDProducerBase, alcabeammonitor::BeamSpotInfo>::doEndLuminosityBlock_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/pluginAlcaBeamMonitor.so
#34 0x0000400024dadcc4 in edm::one::EDProducerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/libFWCoreFramework.so

Current Modules:
Module: AlcaBeamMonitor:AlcaBeamMonitor (crashed)
Module: none
Module: none
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_aarch64_gcc12/CMSSW_14_0_X_2023-12-13-2300/pyRelValMatrixLogs/run/11024.0_TTbar_13+2018PU/step3_TTbar_13+2018PU.log#/

11025.0 step 3

Other TBB threads are in tbb::detail::r1::futex_wait

Thread 3 (Thread 0x400053389250 (LWP 2727347) "cmsRun"):
#3  0x000040000fdece8c in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00004001aafe001c in ?? ()
#6  0x000040001dcac590 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#7  0x000040001dcac590 in cling::Interpreter::RunFunction(clang::FunctionDecl const*, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#8  0x000040001dcacc3c in cling::Interpreter::EvaluateInternal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, cling::CompilationOptions, cling::Value*, cling::Transaction**, unsigned long) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#9  0x000040001dd88844 in cling::MetaSema::actOnxCommand(llvm::StringRef, llvm::StringRef, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#10 0x000040001dd95134 in cling::MetaParser::isXCommand(cling::MetaSema::ActionResult&, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#11 0x000040001dd96ff4 in cling::MetaParser::isCommand(cling::MetaSema::ActionResult&, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#12 0x000040001dd822ec in cling::MetaProcessor::process(llvm::StringRef, cling::Interpreter::CompilationResult&, cling::Value*, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#13 0x000040001dbd02dc in HandleInterpreterException(cling::MetaProcessor*, char const*, cling::Interpreter::CompilationResult&, cling::Value*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#14 0x000040001dbe2d40 in TCling::ProcessLine(char const*, TInterpreter::EErrorCode*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#15 0x000040001dbe32b4 in TCling::ProcessLineSynch(char const*, TInterpreter::EErrorCode*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#16 0x0000400009e898a8 in TApplication::ExecuteFile(char const*, int*, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#17 0x000040001dbe1dd8 in TCling::ExecuteMacro(char const*, TInterpreter::EErrorCode*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCling.so
#18 0x0000400009edec5c in TROOT::Macro(char const*, int*, bool) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#19 0x0000400009ec2a90 in TPluginManager::LoadHandlerMacros(char const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#20 0x0000400009ec30ec in TPluginManager::LoadHandlersFromPluginDirs(char const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#21 0x0000400009ec4164 in TPluginManager::FindHandler(char const*, char const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libCore.so
#22 0x00004000096e899c in ROOT::Math::Factory::CreateMinimizer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#23 0x00004000096ea270 in ROOT::Fit::FitConfig::CreateMinimizer() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#24 0x0000400009701dc0 in ROOT::Fit::Fitter::DoInitMinimizer() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#25 0x000040000970827c in bool ROOT::Fit::Fitter::DoMinimization<ROOT::Fit::PoissonLikelihoodFCN<ROOT::Math::IBaseFunctionMultiDimTempl<double>, ROOT::Math::IParametricFunctionMultiDimTempl<double> > >(std::unique_ptr<ROOT::Fit::PoissonLikelihoodFCN<ROOT::Math::IBaseFunctionMultiDimTempl<double>, ROOT::Math::IParametricFunctionMultiDimTempl<double> >, std::default_delete<ROOT::Fit::PoissonLikelihoodFCN<ROOT::Math::IBaseFunctionMultiDimTempl<double>, ROOT::Math::IParametricFunctionMultiDimTempl<double> > > >, ROOT::Math::IBaseFunctionMultiDimTempl<double> const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#26 0x0000400009705590 in ROOT::Fit::Fitter::DoBinnedLikelihoodFit(bool, ROOT::EExecutionPolicy const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libMathCore.so
#27 0x000040001054a3a4 in TFitResultPtr HFit::Fit<TH1>(TH1*, TF1*, Foption_t&, ROOT::Math::MinimizerOptions const&, char const*, ROOT::Fit::DataRange&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libHist.so
#28 0x000040001053ff14 in ROOT::Fit::FitObject(TH1*, TF1*, Foption_t&, ROOT::Math::MinimizerOptions const&, char const*, ROOT::Fit::DataRange&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libHist.so
#29 0x00004000106097e0 in TH1::Fit(TF1*, char const*, char const*, double, double) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/external/el8_aarch64_gcc12/lib/libHist.so
#30 0x000040004f3c055c in PVFitter::runFitter() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/libRecoVertexBeamSpotProducer.so
#31 0x000040004f3b71d8 in BeamFitter::runPVandTrkFitter() () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/libRecoVertexBeamSpotProducer.so
#32 0x00004000ab9f0e94 in AlcaBeamMonitor::globalEndLuminosityBlock(edm::LuminosityBlock const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/pluginAlcaBeamMonitor.so
#33 0x00004000ab9fd9fc in virtual thunk to edm::one::impl::LuminosityBlockCacheHolder<edm::one::EDProducerBase, alcabeammonitor::BeamSpotInfo>::doEndLuminosityBlock_(edm::LuminosityBlock const&, edm::EventSetup const&) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/pluginAlcaBeamMonitor.so
#34 0x0000400008cfdcc4 in edm::one::EDProducerBase::doEndLuminosityBlock(edm::LumiTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms-ib.cern.ch/sw/aarch64/nweek-02815/el8_aarch64_gcc12/cms/cmssw/CMSSW_14_0_X_2023-12-13-2300/lib/el8_aarch64_gcc12/libFWCoreFramework.so

Module: AlcaBeamMonitor:AlcaBeamMonitor (crashed)
Module: none
Module: none
Module: none

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_aarch64_gcc12/CMSSW_14_0_X_2023-12-13-2300/pyRelValMatrixLogs/run/11025.0_ZEE_13+2018PU/step3_ZEE_13+2018PU.log#/

@makortel
Copy link
Contributor

makortel commented Dec 15, 2023

  • slc7_amd64_gcc12: Unit test PrimaryVertex (module Alignment/OfflineValidation).

Somehow this unit test failure seems to be specific to slc7. The test started to fail on CMSSW_14_0_X_2023-12-13-1100 (where we deployed ROOT 6.30), and has failed on every slc7 IB since then, but not in any other IB.

@guitargeek
Copy link
Contributor

If this happens so often, then maybe we should come back to @Dr15Jones suggestion in #42979 (comment).

Since ROOT 6.30, the default Minimizer is Minuit 2. Unlike the legacy Minuit, it logs fit failures in Root Errors that CMSSW turns into exceptions by default.

Maybe it's not reasonable to expect that all fits in the DQM plots should succeed?

@makortel
Copy link
Contributor

  • slc7_amd64_gcc12: Unit test PrimaryVertex (module Alignment/OfflineValidation).

Somehow this unit test failure seems to be specific to slc7. The test started to fail on CMSSW_14_0_X_2023-12-13-1100 (where we deployed ROOT 6.30), and has failed on every slc7 IB since then, but not in any other IB.

I see @smuzaffar fixed this particular problem in #43588 by using the likelihood fit instead of chi-square (as was done earlier for DQMGenericClient).

@makortel
Copy link
Contributor

Maybe it's not reasonable to expect that all fits in the DQM plots should succeed?

@cms-sw/dqm-l2 Could you comment?

@makortel
Copy link
Contributor

Here is one on slc7_amd64_gcc12 CMSSW_14_0_X_2024-01-11-2300 workflow 25208.0 step 4

----- Begin Fatal Exception 12-Jan-2024 02:59:52 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing end ProcessBlock
   [1] Calling method for module PFJetDQMPostProcessor/'pfJetDQMPostProcessor'
   Additional Info:
      [a] Fatal Root Error: @SUB=Minuit2
VariableMetricBuilder Initial matrix not pos.def.

----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc12/CMSSW_14_0_X_2024-01-11-2300/pyRelValMatrixLogs/run/25208.0_SMS-T1tttt_mGl-1500_mLSP-100_13/step4_SMS-T1tttt_mGl-1500_mLSP-100_13.log#/

@guitargeek
Copy link
Contributor

There is also some discussion here:
#43106 (comment)

@dan131riley
Copy link

el8_amd64_gcc12 CMSSW_14_0_X_2024-01-11-1100 11602.0 step4

----- Begin Fatal Exception 11-Jan-2024 22:34:14 CET-----------------------
An exception of category 'FatalRootError' occurred while
   [0] Processing end ProcessBlock
   [1] Calling method for module PFClient_JetRes/'pfJetResClient'
   Additional Info:
      [a] Fatal Root Error: @SUB=Minuit2
VariableMetricBuilder Initial matrix not pos.def.

----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/buildlogs/raw/el8_amd64_gcc12/CMSSW_14_0_X_2024-01-11-1100/pyRelValMatrixLogs/run/11602.0_SingleElectronPt35+2021/step4_SingleElectronPt35+2021.log

@makortel
Copy link
Contributor

So, should we continue changing fits from chi-square to likelihood as we find these issues?

@makortel
Copy link
Contributor

@vgvassilev Do you have any thoughts about the stack traces that include cling above (#43577 (comment))? They continue appearing randomly on ARM.

@vgvassilev
Copy link
Contributor

Can we run with valgrind to make sure there are no obvious memory errors?

@makortel
Copy link
Contributor

Can we run with valgrind to make sure there are no obvious memory errors?

I can give a try on slc7 x86

@makortel
Copy link
Contributor

Given that the exception message mentioned in the issue description was demoted to a an error message in #43726, I would close this issue as the test should no longer fail because of this exception, and follow up the crashes on ARM in a separate issue.

@makortel
Copy link
Contributor

New issue is in #43802

@makortel
Copy link
Contributor

+core

@makortel
Copy link
Contributor

@please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor

@cmsbuild, please close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants