Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wfs with HLT as separate step #37603

Closed
wants to merge 7 commits into from
Closed

Conversation

kskovpen
Copy link
Contributor

PR description:

Introduce an additional set of Run 3 wfs where the HLT step is separated out from DIGI, following the studies mentioned in #37564. As suggested in that discussion, the GEN and DIGI2RAW are dropped from the output file at the HLT step. While it makes perfect sense to also drop the SIM collections, the HARVESTING step crashes in EcalDQMonitorClient:ecalMonitorClient. Maybe @cms-sw/dqm-l2 would have an idea why it happens.

PR validation:

Ran the new wfs.

if this PR is a backport please specify the original PR and why you need to backport that PR:

Not a backport.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37603/29370

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @kskovpen for master.

It involves the following packages:

  • Configuration/PyReleaseValidation (pdmv, upgrade)

@jordan-martins, @bbilin, @wajidalikhan, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen can you please review it and eventually sign? Thanks.
@makortel, @kpedro88, @Martin-Grunewald, @missirol, @fabiocos, @slomeo this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@kskovpen
Copy link
Contributor Author

test parameters:

  • workflow = 12424.0

@kskovpen
Copy link
Contributor Author

please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-37603/29371

@cmsbuild
Copy link
Contributor

Pull request #37603 was updated. @jordan-martins, @bbilin, @wajidalikhan, @cmsbuild, @AdrianoDee, @srimanob, @kskovpen can you please check and sign again.

@kskovpen
Copy link
Contributor Author

test parameters:

  • workflow = 12424.0

@kskovpen
Copy link
Contributor Author

please test

@jfernan2
Copy link
Contributor

@kskovpen could you please quote the error you get in EcalDQMonitorClient:ecalMonitorClient and a recipe to reproduce it?
Thanks

'-n':'10',
'--eventcontent':'FEVTDEBUGHLT',
'--geometry' : geom,
'--outputCommands' : '"drop *_*_*_GEN,drop *_*_*_DIGI2RAW"'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jfernan2 ! You can reproduce the DQM crash by replacing the drop statements here with:
"drop *_*_*_GEN,drop *_*_*_SIM,drop *_*_*_DIGI2RAW" and running 12424.0. The error message at the last step is:

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

Sun Apr 17 20:58:01 CEST 2022
Thread 2 (Thread 0x7f071393e700 (LWP 1782) "cmsRun"):
#0 0x00007f073c5e41d9 in waitpid () from /lib64/libpthread.so.0
#1 0x00007f07362665e7 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f073626712a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f073cbe2bf4 in std::execute_native_thread_routine (__p=0x7f07323e4600) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4 0x00007f073c5dcea5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f073c305b0d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f073a458540 (LWP 1577) "cmsRun"):
#0 0x00007f073c2faddd in poll () from /lib64/libc.so.6
#1 0x00007f073626689f in full_read.constprop () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#2 0x00007f07362671fc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#3 0x00007f0736269a3b in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginFWCoreServicesPlugins.so
#4
#5 0x00007f06ee55dcbc in EcalCondObjectContainer::find(unsigned int) const () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libDQMEcalMonitorClient.so
#6 0x00007f06ee55ba65 in ecaldqm::IntegrityClient::producePlots(ecaldqm::DQWorkerClient::ProcessType) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libDQMEcalMonitorClient.so
#7 0x00007f06ee5bb5b4 in EcalDQMonitorClient::runWorkers(dqm::implementation::IGetter&, ecaldqm::DQWorkerClient::ProcessType) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginDQMEcalMonitorClientPlugins.so
#8 0x00007f06ee5bbf8d in EcalDQMonitorClient::dqmEndJob(dqm::implementation::IBooker&, dqm::implementation::IGetter&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginDQMEcalMonitorClientPlugins.so
#9 0x00007f06ee5bea34 in non-virtual thunk to DQMEDHarvester::endProcessBlockProduce(edm::ProcessBlock&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/pluginDQMEcalMonitorClientPlugins.so
#10 0x00007f073ed9f1e0 in edm::one::EDProducerBase::doEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#11 0x00007f073ed88a80 in edm::WorkerTedm::one::EDProducerBase::implDoEndProcessBlock(edm::ProcessBlockPrincipal const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#12 0x00007f073ec95567 in decltype ({parm#1}()) edm::convertException::wrap<edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}>(edm::Worker::runModule<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*)::{lambda()#1}) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#13 0x00007f073ec95960 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >(std::__exception_ptr::exception_ptr const*, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3>::Context const*) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#14 0x00007f073ec95f0a in void edm::SerialTaskQueueChain::actionToRun<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#15 0x00007f073ec95fe1 in edm::SerialTaskQueue::QueuedTask<edm::SerialTaskQueueChain::push<edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&>(tbb::detail::d1::task_group&, edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::ProcessBlockPrincipal, (edm::BranchActionType)3> >::execute()::{lambda()#1}&)::{lambda()#1}>::execute() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#16 0x00007f073eee6055 in tbb::detail::d1::function_taskedm::SerialTaskQueue::spawn(edm::SerialTaskQueue::TaskBase&)::{lambda()#1}::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreConcurrency.so
#17 0x00007f073d444a59 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7f06b7054c00, this=0x7f0738eafe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0-slc7_amd64_gcc11/build/CMSSW_12_3_0-build/BUILD/slc7_amd64_gcc11/external/tbb/v2021.4.0-0929d4245541a9360696e439234c1bfc/tbb-v2021.4.0/src/tbb/task_dispatcher.h:322
#18 tbb::detail::r1::task_dispatcher::local_wait_for_alltbb::detail::r1::external_waiter (waiter=..., t=, this=0x7f0738eafe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0-slc7_amd64_gcc11/build/CMSSW_12_3_0-build/BUILD/slc7_amd64_gcc11/external/tbb/v2021.4.0-0929d4245541a9360696e439234c1bfc/tbb-v2021.4.0/src/tbb/task_dispatcher.h:463
#19 tbb::detail::r1::task_dispatcher::execute_and_wait (t=, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0-slc7_amd64_gcc11/build/CMSSW_12_3_0-build/BUILD/slc7_amd64_gcc11/external/tbb/v2021.4.0-0929d4245541a9360696e439234c1bfc/tbb-v2021.4.0/src/tbb/task_dispatcher.cpp:168
#20 0x00007f073ec624c3 in edm::EventProcessor::endProcessBlock(bool, bool) () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#21 0x00007f073ec667f9 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_3_0/lib/slc7_amd64_gcc11/libFWCoreFramework.so
#22 0x000000000040a18d in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#23 0x00007f073d432898 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_3_0-slc7_amd64_gcc11/build/CMSSW_12_3_0-build/BUILD/slc7_amd64_gcc11/external/tbb/v2021.4.0-0929d4245541a9360696e439234c1bfc/tbb-v2021.4.0/src/tbb/arena.cpp:698
#24 0x000000000040afd9 in main::{lambda()#1}::operator()() const ()
#25 0x00000000004096fc in main ()

Current Modules:

Module: EcalDQMonitorClient:ecalMonitorClient (crashed)

@@ -20,6 +20,10 @@
'2021PU',
'2021Design',
'2021DesignPU',
'2021HLT',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not change the order here. New workflow should go at the end, i.e. after 2024.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, will append it to the end.

@srimanob
Copy link
Contributor

@kskovpen
Thanks for the PR. Do we need to define new workflow, or we can just assign new offset for it? If new offset, you don't need to define new Digi step, just drop HLT, and add it as separated step in relvals_upgrade.

@kskovpen
Copy link
Contributor Author

Thanks @srimanob. I also thought that probably defining a full batch of new wfs would be an overkill. Anyhow, I can put it in the offset wfs.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-875d06/23978/summary.html
COMMIT: 1024290
CMSSW: CMSSW_12_4_X_2022-04-16-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/37603/23978/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-875d06/23978/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-875d06/23978/git-merge-result

RelVals

----- Begin Fatal Exception 17-Apr-2022 19:09:27 UTC-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling File::sysopen()
Exception Message:
Failed to open the file 'step3.root'
   Additional Info:
      [a] Input file file:step3.root could not be opened.
      [b] open() failed with system error 'No such file or directory' (error code 2)
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 17-Apr-2022 19:09:28 UTC-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling File::sysopen()
Exception Message:
Failed to open the file 'step3.root'
   Additional Info:
      [a] Input file file:step3.root could not be opened.
      [b] open() failed with system error 'No such file or directory' (error code 2)
----- End Fatal Exception -------------------------------------------------
----- Begin Fatal Exception 17-Apr-2022 19:11:09 UTC-----------------------
An exception of category 'FileOpenError' occurred while
   [0] Constructing the EventProcessor
   [1] Constructing input source of type PoolSource
   [2] Calling RootInputFileSequence::initTheFile()
   [3] Calling StorageFactory::open()
   [4] Calling File::sysopen()
Exception Message:
Failed to open the file 'step3.root'
   Additional Info:
      [a] Input file file:step3.root could not be opened.
      [b] open() failed with system error 'No such file or directory' (error code 2)
----- End Fatal Exception -------------------------------------------------
Expand to see more relval errors ...

@kskovpen
Copy link
Contributor Author

OK, epic fail. I will create a few offset wfs.

@kskovpen
Copy link
Contributor Author

Update: instead of creating a bunch of alternative wfs, add one test wf (11634.601) where the HLT step is separated out from DIGI.

@cmsbuild
Copy link
Contributor

@@ -3486,7 +3486,7 @@ def gen2021HiMix(fragment,howMuch):
defaultDataSets['2026D49']='CMSSW_12_0_0_pre4-113X_mcRun4_realistic_v7_2026D49noPU-v'
defaultDataSets['2026D76']='CMSSW_12_0_0_pre4-113X_mcRun4_realistic_v7_2026D76noPU-v'
defaultDataSets['2026D77']='CMSSW_12_1_0_pre2-113X_mcRun4_realistic_v7_2026D77noPU-v'
defaultDataSets['2026D88']='CMSSW_12_3_0_pre5-123X_mcRun4_realistic_v4_2026D88noPU-v'
#defaultDataSets['2026D88']='CMSSW_12_2_0_pre3-122X_mcRun4_realistic_v4_2026D88noPU-v'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to disable this?

@@ -171,6 +171,7 @@ def condition(self, fragment, stepList, key, hasHarvest):
'GenSimHLBeamSpotHGCALCloseBy',
'Digi',
'DigiTrigger',
'HLT',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HLT seems to be very common name. Can it be more specific, i.e. HLTRun3?

@srimanob
Copy link
Contributor

I see that Run3 FS is removed. Should you try to with a clean IB release, ie. most recent one CMSSW_12_4_X_2022-04-20-1100?

@kskovpen
Copy link
Contributor Author

I am going to submit another PR. This one clashed with some specific 12_3_0 code. Closing this one.

@kskovpen kskovpen closed this Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants