Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable concurrent lumis and IOVs by default when number of streams is at least 2 #34231

Merged
merged 7 commits into from
Jul 3, 2021

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Jun 23, 2021

PR description:

Enable concurrent lumis and IOVs by default when number of streams is at least 2.

With the changes in this PR, the number of concurrent luminosity blocks will default to 2 if the number of streams is greater than 1. The number of concurrent IOVs will default to the number of concurrent luminosity blocks. These new defaults will apply if these parameters are not specified in the configuration or are specified to be zero. Previously, these parameters would default to 1.

If the number of concurrent luminosity blocks is greater than the number of streams, then it will be reset to the number of streams.

If the number of concurrent IOVs is greater than the number of concurrent luminosity blocks, then it will be reset to the number of concurrent luminosity blocks.

Merging this code will cause some existing configurations to start running processes with multiple concurrent luminosity blocks and IOVS. This is a significant change that will probably break some modules. I've tested this PR works in the Framework code. While testing this PR, I did not test outside the Framework beyond the limited runTheMatrix tests. Concurrent luminosity blocks have been implemented in the Framework since 2018 and concurrent IOVs have been implemented since 2019. There has been an ongoing effort since then to get code outside the Framework to support concurrent luminosity blocks and IOVs. Multiple people have participated in this. runTheMatrix has been running process with more than one concurrent luminosity block for a few months now. Known problems have been fixed. The plan of the Core group is that this PR represents the next step in this migration. It is expected there may be failures uncovered when this is merged. The way forward is approving this PR so we can identify those failures and then be able fix them.

PR validation:

Added new unit tests for the above. Fixed existing tests as necessary.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34231/23497

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

FWCore/Framework
FWCore/ParameterSet
IOMC/RandomEngine

@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @fabiocos this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@wddgit
Copy link
Contributor Author

wddgit commented Jun 23, 2021

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3d08d1/16214/summary.html
COMMIT: d844110
CMSSW: CMSSW_12_0_X_2021-06-23-1600/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34231/16214/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test test_PixelBaryCentreTool had ERRORS

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1264 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2785631
  • DQMHistoTests: Total failures: 3675
  • DQMHistoTests: Total nulls: 19
  • DQMHistoTests: Total successes: 2781915
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -45.703 KiB( 37 files compared)
  • DQMHistoSizes: changed ( 140.53 ): -44.531 KiB Hcal/DigiRunHarvesting
  • DQMHistoSizes: changed ( 140.53 ): -1.172 KiB RPC/DCSInfo
  • Checked 160 log files, 37 edm output root files, 38 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

The differences are all in 140.53 that had different input files. The test_PixelBaryCentreTool test fails already in IBs.

@makortel
Copy link
Contributor

The changes look good to me, but I'd like @Dr15Jones to take a look as well (once he's back).

@wddgit Could you change the PR description for something along "Enable concurrent lumis and IOVs by default when number of streams is at least 2" to be more clear on the impact for release notes?

@wddgit wddgit changed the title Modify options setting concurrent lumis and IOVs (COULD BREAK THINGS!) Enable concurrent lumis and IOVs by default when number of streams is at least 2 Jun 24, 2021
@wddgit
Copy link
Contributor Author

wddgit commented Jun 24, 2021

I replaced the title with your suggested text and also added it as the first line of the description. That is better. Thanks.

nConcurrentLumis = 1;
nConcurrentRuns = 1;
}
if (nThreads > 1 or nStreams > 1) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to be within an else block from if(dumpOptions)? That way we would only print the info once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This seems better. We don't need the LogInfo stuff in the unit test output.

I viewed the dumpOptions as something for Framework unit tests. I am wondering if it could get used for other purposes... I tried to leave the other output unmodified, just in case there is something that uses it and depends on it being in the log file.

(I'll push all the commits with changes when I get them all done.)

nStreams = 1;
nConcurrentLumis = 1;
nConcurrentRuns = 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a LogWarning if this changes the settings. I know we didn't have that before, but that was probably bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, I added the LogWarning. Thanks.

@@ -41,7 +41,9 @@ namespace edm {

std::shared_ptr<EventSetupProvider> EventSetupsController::makeProvider(ParameterSet& iPSet,
ActivityRegistry* activityRegistry,
ParameterSet const* eventSetupPset) {
ParameterSet const* eventSetupPset,
unsigned int nConcurrentLumis,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we change this to maxConcurrency instead? That better explains what is happening without forcing in the code at this point the policy that maxConcurrency == nConcurrentLumis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some improvements here. The behavior is exactly the same, but hopefully it is more readable and understandable. Below are the changes in EventProcessor.cc with appropriate changes also in the lower level code (I'll push all these commits as soon as I finish all of them). I noticed we no longer need the separate function setMaxConcurrentIOVs since I moved the code that resets things if there is a looper.

+++ b/FWCore/Framework/src/EventProcessor.cc
@@ -414,6 +414,9 @@ namespace edm {
         edm::LogInfo("ThreadStreamSetup") << "setting # threads " << nThreads << "\nsetting # streams " << nStreams;
       }
     }
+    // The number of concurrent IOVs is configured individually for each record in
+    // the class NumberOfConcurrentIOVs to values less than or equal to this.
+    unsigned int maxConcurrentIOVs = nConcurrentLumis;
 
     //Check that relationships between threading parameters makes sense
     /*
@@ -455,7 +458,7 @@ namespace edm {
     // intialize the event setup provider
     ParameterSet const& eventSetupPset(optionsPset.getUntrackedParameterSet("eventSetup"));
     esp_ = espController_->makeProvider(
-        *parameterSet, items.actReg_.get(), &eventSetupPset, nConcurrentLumis, dumpOptions);
+        *parameterSet, items.actReg_.get(), &eventSetupPset, maxConcurrentIOVs, dumpOptions);
 
     // initialize the looper, if any
     if (!loopers.empty()) {
@@ -466,7 +469,6 @@ namespace edm {
       // in presence of looper do not delete modules
       deleteNonConsumedUnscheduledModules_ = false;
     }
-    espController_->setMaxConcurrentIOVs(nStreams, nConcurrentLumis);

@@ -88,7 +88,9 @@ namespace edm {

std::shared_ptr<EventSetupProvider> makeProvider(ParameterSet&,
ActivityRegistry*,
ParameterSet const* eventSetupPset = nullptr);
ParameterSet const* eventSetupPset = nullptr,
unsigned int nConcurrentLumis = 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxConcurrency here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

@@ -16,11 +16,16 @@ namespace edm {

NumberOfConcurrentIOVs::NumberOfConcurrentIOVs() : numberConcurrentIOVs_(1) {}

void NumberOfConcurrentIOVs::readConfigurationParameters(ParameterSet const* eventSetupPset) {
void NumberOfConcurrentIOVs::readConfigurationParameters(ParameterSet const* eventSetupPset,
unsigned int nConcurrentLumis,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxConcurrency here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment above

@@ -36,6 +36,11 @@
<use name="FWCore/Utilities"/>
</bin>

<bin name="TestFWCoreFrameworkOptions" file="TestDriver.cpp">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is scram support now for specifying running a .sh script as part of a test without doing the TestDriver.cpp approach. I suggest switching to that since it reduces the number of executables we need.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smuzaffar I am not familiar with this (I've never seen it or forgot...). Is there is an example or documentation somewhere? Does it work the same way and set the same environmental variables as the old way? Here is the script I want to run:

#!/bin/bash

 function die { echo Failure $1: status $2 ; exit $2 ; }

 pushd ${LOCAL_TMP_DIR}

 echo testOptions1_cfg.py
 cmsRun -p ${LOCAL_TEST_DIR}/testOptions1_cfg.py >& testOptions1.log || die "cmsRun testOptions1_cfg.py" $?
 grep "Number of Streams = 1" testOptions1.log || die "Failed number of streams test" $?
 grep "Number of Concurrent Lumis = 1" testOptions1.log || die "Failed number of concurrent lumis test" $?
 grep "Number of Concurrent IOVs = 1" testOptions1.log || die "Failed number of concurrent IOVs test" $?

 echo testOptions2_cfg.py
 cmsRun -p ${LOCAL_TEST_DIR}/testOptions2_cfg.py >& testOptions2.log || die "cmsRun testOptions2_cfg.py" $?
 grep "Number of Streams = 5" testOptions2.log || die "Failed number of streams test" $?
 grep "Number of Concurrent Lumis = 4" testOptions2.log || die "Failed number of concurrent lumis test" $?
 grep "Number of Concurrent IOVs = 3" testOptions2.log || die "Failed number of concurrent IOVs test" $?

 echo testOptions3_cfg.py
 cmsRun -p ${LOCAL_TEST_DIR}/testOptions3_cfg.py >& testOptions3.log || die "cmsRun testOptions3_cfg.py" $?
 grep "Number of Streams = 6" testOptions3.log || die "Failed number of streams test" $?
 grep "Number of Concurrent Lumis = 2" testOptions3.log || die "Failed number of concurrent lumis test" $?
 grep "Number of Concurrent IOVs = 2" testOptions3.log || die "Failed number of concurrent IOVs test" $?

 echo testOptions4_cfg.py
 cmsRun -p ${LOCAL_TEST_DIR}/testOptions4_cfg.py >& testOptions4.log || die "cmsRun testOptions4_cfg.py" $?
 grep "Number of Streams = 6" testOptions4.log || die "Failed number of streams test" $?
 grep "Number of Concurrent Lumis = 6" testOptions4.log || die "Failed number of concurrent lumis test" $?
 grep "Number of Concurrent IOVs = 6" testOptions4.log || die "Failed number of concurrent IOVs test" $?

 echo testOptions5_cfg.py
 cmsRun -p ${LOCAL_TEST_DIR}/testOptions5_cfg.py >& testOptions5.log || die "cmsRun testOptions5_cfg.py" $?
 grep "Number of Streams = 1" testOptions5.log || die "Failed number of streams test" $?
 grep "Number of Concurrent Lumis = 1" testOptions5.log || die "Failed number of concurrent lumis test" $?
 grep "Number of Concurrent IOVs = 1" testOptions5.log || die "Failed number of concurrent IOVs test" $?

 echo testOptions6_cfg.py
 cmsRun -p ${LOCAL_TEST_DIR}/testOptions6_cfg.py >& testOptions6.log || die "cmsRun testOptions6_cfg.py" $?
 # Cannot run the grep tests because by default the options are not dumped.
 # You can however run this manually with a debugger and check (which was done)
 # And also just run it and see it doesn't crash...

 rm testOptions1.log
 rm testOptions2.log
 rm testOptions3.log
 rm testOptions4.log
 rm testOptions5.log
 rm testOptions6.log

 popd

 exit 0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here

<test name="testFWCoreSharedMemoryMonitorThreadSignals" command="test_monitor_thread_signals.sh">
<flags PRE_TEST="testFWCoreSharedMemoryMonitorThread"/>
</test>

<test name="testFWCoreFrameworkNonEventOrdering" command="test_non_event_ordering.sh"/>
<test name="testFWCoreFramework1ThreadESPrefetch" command="run_test_1_thread_es_prefetching.sh"/>
<test name="testFWCoreFrameworkModuleDeletion" command="run_module_delete_tests.sh"/>

<test name="TestFWCoreIntegrationInterProcess" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/test_TestInterProcessProd_cfg.py"/>
<test name="TestFWCoreIntegrationPutOrMerge" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/putOrMergeTest_cfg.py"/>
<test name="TestFWCoreIntegrationInputSourceAlias" command="cmsRun ${LOCALTOP}//src/FWCore/Integration/test/inputSource_alias_Test_cfg.py"/>

Copy link
Contributor

@smuzaffar smuzaffar Jun 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wddgit , it works nearly the same way ( see examples of it https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PythonAnalysis/test/BuildFile.xml ) but it does not set the env variables set here https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/TestHelper.cc#L149-L164 . But most of these you can calculate within the script itself. So basically you just need to use

<test name="TestFWCoreFrameworkOptions" command="your-script-in-test.sh args"/>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wddgit , if the tests in your script do not depend on each other then better to have a small script e.g.

  • runtest.sh
 echo $1
 cmsRun -p ${CMSSW_BASE}/src/FWCore/Framework/test/$1 >& ${1}.log || die "cmsRun$1" $?
 grep "Number of Streams = $2" ${1}.log || die "Failed number of streams test" $?
 grep "Number of Concurrent Lumis = $3" ${1}.log || die "Failed number of concurrent lumis test" $?
 grep "Number of Concurrent IOVs = $4" ${1}.log || die "Failed number of concurrent IOVs test" $?

and then add multiple tests e.g.

<test name="TestFWCoreFrameworkOptions1" command="runtest.sh testOptions1_cfg.py 1 1 1"/>
<test name="TestFWCoreFrameworkOptions2" command="runtest.sh testOptions2_cfg.py 5 4 3"/>
...

this way scram can run these in parallel. Only thing you need to make sure that these tests do not write in to same output file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly did what was suggested here. The main difference is instead of putting so much in the BuildFile I passed only a single index to identify the test and put the config file names and expected results in arrays in the bash script. It seemed to be too much to put expected results in a BuildFile.

Overall, this seems much better. Thanks.

)
process.options = dict (
dumpOptions = True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to do process.options.dumpOptions = True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It works. It is a little more concise. Thanks.


process.options = dict(
dumpOptions = True,
numberOfThreads = 6,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably keep the # threads to 4 or less as many IB VMs are only 4 threaded machines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I reduced the number of threads to 4 when it was greater than 4 in this file and the other configurations as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering. For each individual test I limited this to 4 concurrent threads. But does this resolve all the problems if scram is running multiple tests concurrently? I was wondering how that works. Does each test run on a different VM?

@@ -61,6 +73,8 @@ namespace edm {
description.addUntracked<std::vector<std::string>>("canDeleteEarly", emptyVector)
->setComment("Branch names of products that the Framework can try to delete before the end of the Event");

description.addUntracked<bool>("dumpOptions", false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should setComment here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 152 to 155
LogAbsolute("Options") << "Number of Threads = " << nThreads;
LogAbsolute("Options") << "Number of Streams = " << nStreams;
LogAbsolute("Options") << "Number of Concurrent Lumis = " << nConcurrentLumis;
LogAbsolute("Options") << "Number of Concurrent Runs = " << nConcurrentRuns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
LogAbsolute("Options") << "Number of Threads = " << nThreads;
LogAbsolute("Options") << "Number of Streams = " << nStreams;
LogAbsolute("Options") << "Number of Concurrent Lumis = " << nConcurrentLumis;
LogAbsolute("Options") << "Number of Concurrent Runs = " << nConcurrentRuns;
LogAbsolute("Options") << "Number of Threads = " << nThreads
<< "\nNumber of Streams = " << nStreams
<< "\nNumber of Concurrent Lumis = " << nConcurrentLumis
<< "\nNumber of Concurrent Runs = " << nConcurrentRuns;

This is much less work for the logging system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks.

wddgit added 4 commits June 29, 2021 20:26
The unit tests are still testing the same things.
Use new scram method of running a bash script, more
concise way to set parameters, renumber the config file
names to start at 0, keep the number of threads less than
4. Tests can run concurrently now. Deleted the last configuration
as it was doing the same thing as the first.
Also formatted a message logger call in a more efficient way
@makortel
Copy link
Contributor

makortel commented Jul 2, 2021

+1

The unit test fails in IBs, RelVals-INPUT are caused by timeouts, and the comparison differences are in MessageLogger.

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 2, 2021

This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Jul 3, 2021

+1

@qliphy
Copy link
Contributor

qliphy commented Jul 3, 2021

merge

@cmsbuild cmsbuild merged commit 3dba761 into cms-sw:master Jul 3, 2021
@Martin-Grunewald
Copy link
Contributor

Martin-Grunewald commented Jul 6, 2021

Looks like this PR creates problems in HLT validation tests, from the IB onward where this PR was merged, see, eg:

https://cmssdt.cern.ch/SDT/jenkins-artifacts/HLT-Validation/CMSSW_12_0_X_2021-07-03-1100/slc7_amd64_gcc900/runIB.log

which shows several TSG tests failing with exit status: 134

@Dr15Jones
Copy link
Contributor

@Martin-Grunewald wrote

Looks like this PR creates problems in HLT validation tests, from the IB onward where this PR was merged, see, eg:

The problem appears to be in the DQM: see https://cmssdt.cern.ch/SDT/jenkins-artifacts/HLT-Validation/CMSSW_12_0_X_2021-07-05-1100/slc7_amd64_gcc900/RelVal_RECO_PIon_MC.log

where the relevant bits of the log are

cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/0605d3d065eefb97a5ac4cfc511ce86e/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/src/DQMServices/Core/src/DQMStore.cc:433: void dqm::implementation::DQMStore::initLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t): Assertion `!assertLegacySafe_' failed.

and the traceback

#8  0x00002b3a2701e252 in __assert_fail () from /lib64/libc.so.6
#9  0x00002b3a3290868c in dqm::implementation::DQMStore::initLumi(unsigned int, unsigned int, unsigned long) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libDQMServicesCore.so
#10 0x00002b3a32908762 in dqm::implementation::DQMStore::initLumi(unsigned int, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libDQMServicesCore.so
#11 0x00002b3a32908c0a in std::_Function_handler<void (edm::GlobalContext const&), dqm::implementation::DQMStore::DQMStore(edm::ParameterSet const&, edm::ActivityRegistry&)::{lambda(edm::GlobalContext const&)#2}>::_M_invoke(std::_Any_data const&, edm::GlobalContext const&) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libDQMServicesCore.so
#12 0x00002b3a248f2598 in void edm::signalslot::Signal<void (edm::GlobalContext const&)>::emit<edm::GlobalContext const&>(edm::GlobalContext const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00002b3a248fd400 in void edm::GlobalSchedule::processOneGlobalAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0>::TransitionInfoType&, edm::ServiceToken const&, bool) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#14 0x00002b3a248fd902 in void edm::beginGlobalTransitionAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0> >(edm::WaitingTaskHolder, edm::Schedule&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0>::TransitionInfoType&, edm::ServiceToken const&, std::vector<edm::SubProcess, std::allocator<edm::SubProcess> >&) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#15 0x00002b3a248e4014 in edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}::operator()() () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so

@makortel
Copy link
Contributor

makortel commented Jul 6, 2021

Thanks @Martin-Grunewald. The failing tests are

14:03:36 cmsRun RelVal_RECO_GRun_MC.py >& RelVal_RECO_GRun_MC.log
14:09:07 exit status: 134

14:30:11 cmsRun RelVal_RECO_PIon_MC.py >& RelVal_RECO_PIon_MC.log
14:35:13 exit status: 134

14:43:09 cmsRun RelVal_RECO_PRef_MC.py >& RelVal_RECO_PRef_MC.log
14:47:56 exit status: 134

13:51:37 cmsRun RelVal_RECO_HIon_DATA.py >& RelVal_RECO_HIon_DATA.log
13:54:50 exit status: 134

They all fail with the following assertion

cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/838f283a9208e8bb4a3e4f29303019ca/opt/cmssw/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_12_0_X_2021-07-03-1100/src/DQMServices/Core/src/DQMStore.cc:433: void dqm::implementation::DQMStore::initLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t): Assertion `!assertLegacySafe_' failed.

From https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/README.md#the-dqmstore it is a configuration option of DQMStore (defaulting to true). My vague recollection is that the option exists to support some DQM code that does not work with concurrent lumis.

None of these jobs print out a list of modules incompatible with concurrent lumis, so likely the assertion in DQMStore could be disabled here.

@cms-sw/dqm-l2 How about setting DQMStore.assertLegacySafe = False by default, and disabling the assertion only for those jobs that really need it? (we should migrate away from such "legacy code" anyway) As a quick fix that option could be set in these jobs explicitly.

@makortel
Copy link
Contributor

makortel commented Jul 6, 2021

@smuzaffar @Martin-Grunewald It seems that failures in HLT validation jobs are not signaled in the IB dashboard. Could we change that so that any problems there would become visible there (similar to how existence of failures in FWLite tests are shown)?

@smuzaffar
Copy link
Contributor

@smuzaffar @Martin-Grunewald It seems that failures in HLT validation jobs are not signaled in the IB dashboard. Could we change that so that any problems there would become visible there (similar to how existence of failures in FWLite tests are shown)?

that is correct @makortel . The tests are internally run by HLT driver script, I guess I can just search for 'exit status: ' lines in the logs to mark it passded/failed ... @Martin-Grunewald will this be enough or is there any thing else which will help identifying the state of the test?

@Martin-Grunewald
Copy link
Contributor

In the overall log file of the overall test (jenkins.log, runIB.log) look for that indeed!

@makortel
Copy link
Contributor

makortel commented Jul 6, 2021

Thanks @smuzaffar!

@jfernan2
Copy link
Contributor

jfernan2 commented Jul 6, 2021

@cms-sw/dqm-l2 How about setting DQMStore.assertLegacySafe = False by default, and disabling the assertion only for
those jobs that really need it? (we should migrate away from such "legacy code" anyway) As a quick fix that option could be set
in these jobs explicitly.

Hi, I have created a PR with this change:
#34366
Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants