Enable concurrent lumis and IOVs by default when number of streams is at least 2 #34231

wddgit · 2021-06-23T21:43:38Z

PR description:

Enable concurrent lumis and IOVs by default when number of streams is at least 2.

With the changes in this PR, the number of concurrent luminosity blocks will default to 2 if the number of streams is greater than 1. The number of concurrent IOVs will default to the number of concurrent luminosity blocks. These new defaults will apply if these parameters are not specified in the configuration or are specified to be zero. Previously, these parameters would default to 1.

If the number of concurrent luminosity blocks is greater than the number of streams, then it will be reset to the number of streams.

If the number of concurrent IOVs is greater than the number of concurrent luminosity blocks, then it will be reset to the number of concurrent luminosity blocks.

Merging this code will cause some existing configurations to start running processes with multiple concurrent luminosity blocks and IOVS. This is a significant change that will probably break some modules. I've tested this PR works in the Framework code. While testing this PR, I did not test outside the Framework beyond the limited runTheMatrix tests. Concurrent luminosity blocks have been implemented in the Framework since 2018 and concurrent IOVs have been implemented since 2019. There has been an ongoing effort since then to get code outside the Framework to support concurrent luminosity blocks and IOVs. Multiple people have participated in this. runTheMatrix has been running process with more than one concurrent luminosity block for a few months now. Known problems have been fixed. The plan of the Core group is that this PR represents the next step in this migration. It is expected there may be failures uncovered when this is merged. The way forward is approving this PR so we can identify those failures and then be able fix them.

PR validation:

Added new unit tests for the above. Fixed existing tests as necessary.

cmsbuild · 2021-06-23T21:50:35Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34231/23497

This PR adds an extra 88KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File FWCore/Framework/src/EventProcessor.cc modified in PR(s): Implement persistence for ProcessBlock products #33969
- File FWCore/Framework/test/BuildFile.xml modified in PR(s): Make EventSetup get to throw if called without a token when EDModule consumed any ES product #31746

cmsbuild · 2021-06-23T21:50:56Z

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

FWCore/Framework
FWCore/ParameterSet
IOMC/RandomEngine

@makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @fabiocos this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

wddgit · 2021-06-23T21:53:00Z

please test

cmsbuild · 2021-06-24T03:14:33Z

-1

Failed Tests: UnitTests
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-3d08d1/16214/summary.html
COMMIT: d844110
CMSSW: CMSSW_12_0_X_2021-06-23-1600/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34231/16214/install.sh to create a dev area with all the needed externals and cmssw changes.

Unit Tests

I found errors in the following unit tests:

---> test test_PixelBaryCentreTool had ERRORS

Comparison Summary

The workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons

Summary:

No significant changes to the logs found
Reco comparison results: 1264 differences found in the comparisons
DQMHistoTests: Total files compared: 38
DQMHistoTests: Total histograms compared: 2785631
DQMHistoTests: Total failures: 3675
DQMHistoTests: Total nulls: 19
DQMHistoTests: Total successes: 2781915
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: -45.703 KiB( 37 files compared)
DQMHistoSizes: changed ( 140.53 ): -44.531 KiB Hcal/DigiRunHarvesting
DQMHistoSizes: changed ( 140.53 ): -1.172 KiB RPC/DCSInfo
Checked 160 log files, 37 edm output root files, 38 DQM output files
TriggerResults: no differences found

makortel · 2021-06-24T13:49:21Z

The differences are all in 140.53 that had different input files. The test_PixelBaryCentreTool test fails already in IBs.

makortel · 2021-06-24T13:52:01Z

The changes look good to me, but I'd like @Dr15Jones to take a look as well (once he's back).

@wddgit Could you change the PR description for something along "Enable concurrent lumis and IOVs by default when number of streams is at least 2" to be more clear on the impact for release notes?

wddgit · 2021-06-24T13:58:08Z

I replaced the title with your suggested text and also added it as the first line of the description. That is better. Thanks.

Dr15Jones · 2021-06-29T15:37:38Z

FWCore/Framework/src/EventProcessor.cc

+      nConcurrentLumis = 1;
+      nConcurrentRuns = 1;
+    }
+    if (nThreads > 1 or nStreams > 1) {


Should we move this to be within an else block from if(dumpOptions)? That way we would only print the info once.

Done. This seems better. We don't need the LogInfo stuff in the unit test output.

I viewed the dumpOptions as something for Framework unit tests. I am wondering if it could get used for other purposes... I tried to leave the other output unmodified, just in case there is something that uses it and depends on it being in the log file.

(I'll push all the commits with changes when I get them all done.)

Dr15Jones · 2021-06-29T15:39:28Z

FWCore/Framework/src/EventProcessor.cc

+      nStreams = 1;
+      nConcurrentLumis = 1;
+      nConcurrentRuns = 1;
+    }


We should probably add a LogWarning if this changes the settings. I know we didn't have that before, but that was probably bad.

Done, I added the LogWarning. Thanks.

Dr15Jones · 2021-06-29T15:42:19Z

FWCore/Framework/src/EventSetupsController.cc

@@ -41,7 +41,9 @@ namespace edm {

    std::shared_ptr<EventSetupProvider> EventSetupsController::makeProvider(ParameterSet& iPSet,
                                                                            ActivityRegistry* activityRegistry,
-                                                                            ParameterSet const* eventSetupPset) {
+                                                                            ParameterSet const* eventSetupPset,
+                                                                            unsigned int nConcurrentLumis,


What if we change this to maxConcurrency instead? That better explains what is happening without forcing in the code at this point the policy that maxConcurrency == nConcurrentLumis.

I made some improvements here. The behavior is exactly the same, but hopefully it is more readable and understandable. Below are the changes in EventProcessor.cc with appropriate changes also in the lower level code (I'll push all these commits as soon as I finish all of them). I noticed we no longer need the separate function setMaxConcurrentIOVs since I moved the code that resets things if there is a looper.

+++ b/FWCore/Framework/src/EventProcessor.cc @@ -414,6 +414,9 @@ namespace edm { edm::LogInfo("ThreadStreamSetup") << "setting # threads " << nThreads << "\nsetting # streams " << nStreams; } } + // The number of concurrent IOVs is configured individually for each record in + // the class NumberOfConcurrentIOVs to values less than or equal to this. + unsigned int maxConcurrentIOVs = nConcurrentLumis; //Check that relationships between threading parameters makes sense /* @@ -455,7 +458,7 @@ namespace edm { // intialize the event setup provider ParameterSet const& eventSetupPset(optionsPset.getUntrackedParameterSet("eventSetup")); esp_ = espController_->makeProvider( - *parameterSet, items.actReg_.get(), &eventSetupPset, nConcurrentLumis, dumpOptions); + *parameterSet, items.actReg_.get(), &eventSetupPset, maxConcurrentIOVs, dumpOptions); // initialize the looper, if any if (!loopers.empty()) { @@ -466,7 +469,6 @@ namespace edm { // in presence of looper do not delete modules deleteNonConsumedUnscheduledModules_ = false; } - espController_->setMaxConcurrentIOVs(nStreams, nConcurrentLumis);

Dr15Jones · 2021-06-29T15:42:42Z

FWCore/Framework/src/EventSetupsController.h

@@ -88,7 +88,9 @@ namespace edm {

      std::shared_ptr<EventSetupProvider> makeProvider(ParameterSet&,
                                                       ActivityRegistry*,
-                                                       ParameterSet const* eventSetupPset = nullptr);
+                                                       ParameterSet const* eventSetupPset = nullptr,
+                                                       unsigned int nConcurrentLumis = 0,


maxConcurrency here as well.

See comment above

Dr15Jones · 2021-06-29T15:43:52Z

FWCore/Framework/src/NumberOfConcurrentIOVs.cc

@@ -16,11 +16,16 @@ namespace edm {

    NumberOfConcurrentIOVs::NumberOfConcurrentIOVs() : numberConcurrentIOVs_(1) {}

-    void NumberOfConcurrentIOVs::readConfigurationParameters(ParameterSet const* eventSetupPset) {
+    void NumberOfConcurrentIOVs::readConfigurationParameters(ParameterSet const* eventSetupPset,
+                                                             unsigned int nConcurrentLumis,


maxConcurrency here.

See comment above

Dr15Jones · 2021-06-29T15:45:56Z

FWCore/Framework/test/BuildFile.xml

@@ -36,6 +36,11 @@
  <use name="FWCore/Utilities"/>
 </bin>

+<bin name="TestFWCoreFrameworkOptions" file="TestDriver.cpp">


There is scram support now for specifying running a .sh script as part of a test without doing the TestDriver.cpp approach. I suggest switching to that since it reduces the number of executables we need.

@smuzaffar I am not familiar with this (I've never seen it or forgot...). Is there is an example or documentation somewhere? Does it work the same way and set the same environmental variables as the old way? Here is the script I want to run:

#!/bin/bash function die { echo Failure $1: status $2 ; exit $2 ; } pushd ${LOCAL_TMP_DIR} echo testOptions1_cfg.py cmsRun -p ${LOCAL_TEST_DIR}/testOptions1_cfg.py >& testOptions1.log || die "cmsRun testOptions1_cfg.py" $? grep "Number of Streams = 1" testOptions1.log || die "Failed number of streams test" $? grep "Number of Concurrent Lumis = 1" testOptions1.log || die "Failed number of concurrent lumis test" $? grep "Number of Concurrent IOVs = 1" testOptions1.log || die "Failed number of concurrent IOVs test" $? echo testOptions2_cfg.py cmsRun -p ${LOCAL_TEST_DIR}/testOptions2_cfg.py >& testOptions2.log || die "cmsRun testOptions2_cfg.py" $? grep "Number of Streams = 5" testOptions2.log || die "Failed number of streams test" $? grep "Number of Concurrent Lumis = 4" testOptions2.log || die "Failed number of concurrent lumis test" $? grep "Number of Concurrent IOVs = 3" testOptions2.log || die "Failed number of concurrent IOVs test" $? echo testOptions3_cfg.py cmsRun -p ${LOCAL_TEST_DIR}/testOptions3_cfg.py >& testOptions3.log || die "cmsRun testOptions3_cfg.py" $? grep "Number of Streams = 6" testOptions3.log || die "Failed number of streams test" $? grep "Number of Concurrent Lumis = 2" testOptions3.log || die "Failed number of concurrent lumis test" $? grep "Number of Concurrent IOVs = 2" testOptions3.log || die "Failed number of concurrent IOVs test" $? echo testOptions4_cfg.py cmsRun -p ${LOCAL_TEST_DIR}/testOptions4_cfg.py >& testOptions4.log || die "cmsRun testOptions4_cfg.py" $? grep "Number of Streams = 6" testOptions4.log || die "Failed number of streams test" $? grep "Number of Concurrent Lumis = 6" testOptions4.log || die "Failed number of concurrent lumis test" $? grep "Number of Concurrent IOVs = 6" testOptions4.log || die "Failed number of concurrent IOVs test" $? echo testOptions5_cfg.py cmsRun -p ${LOCAL_TEST_DIR}/testOptions5_cfg.py >& testOptions5.log || die "cmsRun testOptions5_cfg.py" $? grep "Number of Streams = 1" testOptions5.log || die "Failed number of streams test" $? grep "Number of Concurrent Lumis = 1" testOptions5.log || die "Failed number of concurrent lumis test" $? grep "Number of Concurrent IOVs = 1" testOptions5.log || die "Failed number of concurrent IOVs test" $? echo testOptions6_cfg.py cmsRun -p ${LOCAL_TEST_DIR}/testOptions6_cfg.py >& testOptions6.log || die "cmsRun testOptions6_cfg.py" $? # Cannot run the grep tests because by default the options are not dumped. # You can however run this manually with a debugger and check (which was done) # And also just run it and see it doesn't crash... rm testOptions1.log rm testOptions2.log rm testOptions3.log rm testOptions4.log rm testOptions5.log rm testOptions6.log popd exit 0

Here

cmssw/FWCore/SharedMemory/test/BuildFile.xml

Lines 17 to 19 in 98e05a9

<test name="testFWCoreSharedMemoryMonitorThreadSignals" command="test_monitor_thread_signals.sh">

<flags PRE_TEST="testFWCoreSharedMemoryMonitorThread"/>

</test>

cmssw/FWCore/Framework/test/BuildFile.xml

Lines 393 to 395 in 98e05a9

<test name="testFWCoreFrameworkNonEventOrdering" command="test_non_event_ordering.sh"/>

<test name="testFWCoreFramework1ThreadESPrefetch" command="run_test_1_thread_es_prefetching.sh"/>

<test name="testFWCoreFrameworkModuleDeletion" command="run_module_delete_tests.sh"/>

cmssw/FWCore/Integration/test/BuildFile.xml

Lines 447 to 451 in 98e05a9

<test name="TestFWCoreIntegrationInterProcess" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/test_TestInterProcessProd_cfg.py"/>

<test name="TestFWCoreIntegrationPutOrMerge" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/putOrMergeTest_cfg.py"/>

<test name="TestFWCoreIntegrationInputSourceAlias" command="cmsRun ${LOCALTOP}//src/FWCore/Integration/test/inputSource_alias_Test_cfg.py"/>

@wddgit , it works nearly the same way ( see examples of it https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PythonAnalysis/test/BuildFile.xml ) but it does not set the env variables set here https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/TestHelper.cc#L149-L164 . But most of these you can calculate within the script itself. So basically you just need to use

<test name="TestFWCoreFrameworkOptions" command="your-script-in-test.sh args"/>

@wddgit , if the tests in your script do not depend on each other then better to have a small script e.g.

runtest.sh

echo $1 cmsRun -p ${CMSSW_BASE}/src/FWCore/Framework/test/$1 >& ${1}.log || die "cmsRun$1" $? grep "Number of Streams = $2" ${1}.log || die "Failed number of streams test" $? grep "Number of Concurrent Lumis = $3" ${1}.log || die "Failed number of concurrent lumis test" $? grep "Number of Concurrent IOVs = $4" ${1}.log || die "Failed number of concurrent IOVs test" $?

and then add multiple tests e.g.

<test name="TestFWCoreFrameworkOptions1" command="runtest.sh testOptions1_cfg.py 1 1 1"/> <test name="TestFWCoreFrameworkOptions2" command="runtest.sh testOptions2_cfg.py 5 4 3"/> ...

this way scram can run these in parallel. Only thing you need to make sure that these tests do not write in to same output file

I mostly did what was suggested here. The main difference is instead of putting so much in the BuildFile I passed only a single index to identify the test and put the config file names and expected results in arrays in the bash script. It seemed to be too much to put expected results in a BuildFile.

Overall, this seems much better. Thanks.

Dr15Jones · 2021-06-29T15:47:53Z

FWCore/Framework/test/testOptions1_cfg.py

+)
+process.options = dict (
+    dumpOptions = True
+)


You should be able to do process.options.dumpOptions = True

Done. It works. It is a little more concise. Thanks.

Dr15Jones · 2021-06-29T15:51:06Z

FWCore/Framework/test/testOptions2_cfg.py

+
+process.options = dict(
+    dumpOptions = True,
+    numberOfThreads = 6,


Should probably keep the # threads to 4 or less as many IB VMs are only 4 threaded machines.

Done. I reduced the number of threads to 4 when it was greater than 4 in this file and the other configurations as well.

Just wondering. For each individual test I limited this to 4 concurrent threads. But does this resolve all the problems if scram is running multiple tests concurrently? I was wondering how that works. Does each test run on a different VM?

Dr15Jones · 2021-06-29T15:54:09Z

FWCore/ParameterSet/src/validateTopLevelParameterSets.cc

@@ -61,6 +73,8 @@ namespace edm {
    description.addUntracked<std::vector<std::string>>("canDeleteEarly", emptyVector)
        ->setComment("Branch names of products that the Framework can try to delete before the end of the Event");

+    description.addUntracked<bool>("dumpOptions", false);


Should setComment here as well.

Dr15Jones · 2021-06-29T15:57:25Z

FWCore/ParameterSet/src/validateTopLevelParameterSets.cc

+    LogAbsolute("Options") << "Number of Threads = " << nThreads;
+    LogAbsolute("Options") << "Number of Streams = " << nStreams;
+    LogAbsolute("Options") << "Number of Concurrent Lumis = " << nConcurrentLumis;
+    LogAbsolute("Options") << "Number of Concurrent Runs = " << nConcurrentRuns;


Suggested change

LogAbsolute("Options") << "Number of Threads = " << nThreads;

LogAbsolute("Options") << "Number of Streams = " << nStreams;

LogAbsolute("Options") << "Number of Concurrent Lumis = " << nConcurrentLumis;

LogAbsolute("Options") << "Number of Concurrent Runs = " << nConcurrentRuns;

LogAbsolute("Options") << "Number of Threads = " << nThreads

<< "\nNumber of Streams = " << nStreams

<< "\nNumber of Concurrent Lumis = " << nConcurrentLumis

<< "\nNumber of Concurrent Runs = " << nConcurrentRuns;

This is much less work for the logging system.

Done. Thanks.

The unit tests are still testing the same things. Use new scram method of running a bash script, more concise way to set parameters, renumber the config file names to start at 0, keep the number of threads less than 4. Tests can run concurrently now. Deleted the last configuration as it was doing the same thing as the first.

Also formatted a message logger call in a more efficient way

makortel · 2021-07-02T16:10:07Z

+1

The unit test fails in IBs, RelVals-INPUT are caused by timeouts, and the comparison differences are in MessageLogger.

cmsbuild · 2021-07-02T16:10:29Z

This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

qliphy · 2021-07-03T02:02:58Z

+1

qliphy · 2021-07-03T02:03:03Z

merge

Martin-Grunewald · 2021-07-06T06:35:29Z

Looks like this PR creates problems in HLT validation tests, from the IB onward where this PR was merged, see, eg:

https://cmssdt.cern.ch/SDT/jenkins-artifacts/HLT-Validation/CMSSW_12_0_X_2021-07-03-1100/slc7_amd64_gcc900/runIB.log

which shows several TSG tests failing with exit status: 134

Dr15Jones · 2021-07-06T13:57:12Z

@Martin-Grunewald wrote

Looks like this PR creates problems in HLT validation tests, from the IB onward where this PR was merged, see, eg:

The problem appears to be in the DQM: see https://cmssdt.cern.ch/SDT/jenkins-artifacts/HLT-Validation/CMSSW_12_0_X_2021-07-05-1100/slc7_amd64_gcc900/RelVal_RECO_PIon_MC.log

where the relevant bits of the log are

cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/0605d3d065eefb97a5ac4cfc511ce86e/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/src/DQMServices/Core/src/DQMStore.cc:433: void dqm::implementation::DQMStore::initLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t): Assertion `!assertLegacySafe_' failed.

and the traceback

#8  0x00002b3a2701e252 in __assert_fail () from /lib64/libc.so.6
#9  0x00002b3a3290868c in dqm::implementation::DQMStore::initLumi(unsigned int, unsigned int, unsigned long) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libDQMServicesCore.so
#10 0x00002b3a32908762 in dqm::implementation::DQMStore::initLumi(unsigned int, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libDQMServicesCore.so
#11 0x00002b3a32908c0a in std::_Function_handler<void (edm::GlobalContext const&), dqm::implementation::DQMStore::DQMStore(edm::ParameterSet const&, edm::ActivityRegistry&)::{lambda(edm::GlobalContext const&)#2}>::_M_invoke(std::_Any_data const&, edm::GlobalContext const&) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libDQMServicesCore.so
#12 0x00002b3a248f2598 in void edm::signalslot::Signal<void (edm::GlobalContext const&)>::emit<edm::GlobalContext const&>(edm::GlobalContext const&) const () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#13 0x00002b3a248fd400 in void edm::GlobalSchedule::processOneGlobalAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0> >(edm::WaitingTaskHolder, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0>::TransitionInfoType&, edm::ServiceToken const&, bool) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#14 0x00002b3a248fd902 in void edm::beginGlobalTransitionAsync<edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0> >(edm::WaitingTaskHolder, edm::Schedule&, edm::OccurrenceTraits<edm::LuminosityBlockPrincipal, (edm::BranchActionType)0>::TransitionInfoType&, edm::ServiceToken const&, std::vector<edm::SubProcess, std::allocator<edm::SubProcess> >&) () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so
#15 0x00002b3a248e4014 in edm::EventProcessor::beginLumiAsync(edm::IOVSyncValue const&, std::shared_ptr<void> const&, edm::WaitingTaskHolder)::{lambda(edm::LimitedTaskQueue::Resumer)#1}::operator()(edm::LimitedTaskQueue::Resumer)::{lambda()#1}::operator()() () from /cvmfs/cms-ib.cern.ch/nweek-02688/slc7_amd64_gcc900/cms/cmssw/CMSSW_12_0_X_2021-07-04-2300/lib/slc7_amd64_gcc900/libFWCoreFramework.so

makortel · 2021-07-06T14:04:26Z

Thanks @Martin-Grunewald. The failing tests are

14:03:36 cmsRun RelVal_RECO_GRun_MC.py >& RelVal_RECO_GRun_MC.log
14:09:07 exit status: 134

14:30:11 cmsRun RelVal_RECO_PIon_MC.py >& RelVal_RECO_PIon_MC.log
14:35:13 exit status: 134

14:43:09 cmsRun RelVal_RECO_PRef_MC.py >& RelVal_RECO_PRef_MC.log
14:47:56 exit status: 134

13:51:37 cmsRun RelVal_RECO_HIon_DATA.py >& RelVal_RECO_HIon_DATA.log
13:54:50 exit status: 134

They all fail with the following assertion

cmsRun: /data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/838f283a9208e8bb4a3e4f29303019ca/opt/cmssw/slc7_amd64_gcc900/cms/cmssw-patch/CMSSW_12_0_X_2021-07-03-1100/src/DQMServices/Core/src/DQMStore.cc:433: void dqm::implementation::DQMStore::initLumi(edm::RunNumber_t, edm::LuminosityBlockNumber_t, uint64_t): Assertion `!assertLegacySafe_' failed.

From https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/README.md#the-dqmstore it is a configuration option of DQMStore (defaulting to true). My vague recollection is that the option exists to support some DQM code that does not work with concurrent lumis.

None of these jobs print out a list of modules incompatible with concurrent lumis, so likely the assertion in DQMStore could be disabled here.

@cms-sw/dqm-l2 How about setting DQMStore.assertLegacySafe = False by default, and disabling the assertion only for those jobs that really need it? (we should migrate away from such "legacy code" anyway) As a quick fix that option could be set in these jobs explicitly.

makortel · 2021-07-06T14:06:34Z

@smuzaffar @Martin-Grunewald It seems that failures in HLT validation jobs are not signaled in the IB dashboard. Could we change that so that any problems there would become visible there (similar to how existence of failures in FWLite tests are shown)?

smuzaffar · 2021-07-06T14:24:45Z

@smuzaffar @Martin-Grunewald It seems that failures in HLT validation jobs are not signaled in the IB dashboard. Could we change that so that any problems there would become visible there (similar to how existence of failures in FWLite tests are shown)?

that is correct @makortel . The tests are internally run by HLT driver script, I guess I can just search for 'exit status: ' lines in the logs to mark it passded/failed ... @Martin-Grunewald will this be enough or is there any thing else which will help identifying the state of the test?

Martin-Grunewald · 2021-07-06T14:32:02Z

In the overall log file of the overall test (jenkins.log, runIB.log) look for that indeed!

makortel · 2021-07-06T14:44:29Z

Thanks @smuzaffar!

jfernan2 · 2021-07-06T17:04:47Z

@cms-sw/dqm-l2 How about setting DQMStore.assertLegacySafe = False by default, and disabling the assertion only for
those jobs that really need it? (we should migrate away from such "legacy code" anyway) As a quick fix that option could be set
in these jobs explicitly.

Hi, I have created a PR with this change:
#34366
Thanks for the suggestion!

I was supposed to do this at the same time as cms-sw#35302 that followed cms-sw#34231 and the conclusion in cms-sw#33436

Modify options setting concurrent lumis and IOVs

d844110

cmsbuild added this to the CMSSW_12_0_X milestone Jun 23, 2021

cmsbuild added code-checks-pending core-pending orp-pending pending-signatures tests-pending labels Jun 23, 2021

cmsbuild added code-checks-approved and removed code-checks-pending labels Jun 23, 2021

cmsbuild added tests-started and removed tests-pending labels Jun 23, 2021

cmsbuild added tests-rejected and removed tests-started labels Jun 24, 2021

wddgit changed the title ~~Modify options setting concurrent lumis and IOVs (COULD BREAK THINGS!)~~ Enable concurrent lumis and IOVs by default when number of streams is at least 2 Jun 24, 2021

Dr15Jones suggested changes Jun 29, 2021

View reviewed changes

wddgit added 4 commits June 29, 2021 20:26

Modify MessageLogger printout related to threads

d4b083e

improve interface for setting nConcurrentIOVs

5baf5c3

Add comment in parameter description

72d1599

Also formatted a message logger call in a more efficient way

cmsbuild added code-checks-pending tests-pending and removed code-checks-approved tests-rejected labels Jun 30, 2021

cmsbuild added tests-rejected and removed tests-started labels Jul 2, 2021

cmsbuild added core-approved fully-signed and removed core-pending pending-signatures labels Jul 2, 2021

cmsbuild mentioned this pull request Jul 2, 2021

Fix Framework unit test script bugs #34321

Merged

cmsbuild added orp-approved and removed orp-pending labels Jul 3, 2021

cmsbuild merged commit 3dba761 into cms-sw:master Jul 3, 2021

jfernan2 mentioned this pull request Jul 6, 2021

[DQM] Set assertLegacySafe to False #34366

Merged

smuzaffar mentioned this pull request Jul 7, 2021

HLT validation tests #34387

Closed

makortel mentioned this pull request Jul 14, 2021

Enable concurrent IOVs by default in cmsRun when concurrent lumis are enabled cms-sw/framework-team#119

Closed

makortel mentioned this pull request Sep 16, 2021

Remove logic for setting nConcurrentLumis from cmsDriver in favor of similar logic now in cmsRun #35302

Merged

wddgit deleted the modifyOptions branch November 16, 2021 21:56

makortel mentioned this pull request Dec 16, 2021

Hydjet++ 2.4.3 integration #36316

Merged

makortel added a commit to makortel/cmssw that referenced this pull request Mar 30, 2022

Enable concurrent IOVs by default in ConfigBuilder

054a62f

I was supposed to do this at the same time as cms-sw#35302 that followed cms-sw#34231 and the conclusion in cms-sw#33436

makortel mentioned this pull request Mar 30, 2022

Enable concurrent IOVs by default in ConfigBuilder #37419

Merged

	<test name="testFWCoreSharedMemoryMonitorThreadSignals" command="test_monitor_thread_signals.sh">
	<flags PRE_TEST="testFWCoreSharedMemoryMonitorThread"/>
	</test>

	<test name="testFWCoreFrameworkNonEventOrdering" command="test_non_event_ordering.sh"/>
	<test name="testFWCoreFramework1ThreadESPrefetch" command="run_test_1_thread_es_prefetching.sh"/>
	<test name="testFWCoreFrameworkModuleDeletion" command="run_module_delete_tests.sh"/>

	<test name="TestFWCoreIntegrationInterProcess" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/test_TestInterProcessProd_cfg.py"/>

	<test name="TestFWCoreIntegrationPutOrMerge" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/putOrMergeTest_cfg.py"/>

	<test name="TestFWCoreIntegrationInputSourceAlias" command="cmsRun ${LOCALTOP}//src/FWCore/Integration/test/inputSource_alias_Test_cfg.py"/>

Enable concurrent lumis and IOVs by default when number of streams is at least 2 #34231

Enable concurrent lumis and IOVs by default when number of streams is at least 2 #34231

Conversation

wddgit commented Jun 23, 2021 • edited Loading

PR description:

PR validation:

cmsbuild commented Jun 23, 2021

cmsbuild commented Jun 23, 2021

wddgit commented Jun 23, 2021

cmsbuild commented Jun 24, 2021

Unit Tests

Comparison Summary

makortel commented Jun 24, 2021

makortel commented Jun 24, 2021

wddgit commented Jun 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smuzaffar Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makortel commented Jul 2, 2021

cmsbuild commented Jul 2, 2021

qliphy commented Jul 3, 2021

qliphy commented Jul 3, 2021

Martin-Grunewald commented Jul 6, 2021 • edited Loading

Dr15Jones commented Jul 6, 2021

makortel commented Jul 6, 2021

makortel commented Jul 6, 2021

smuzaffar commented Jul 6, 2021

Martin-Grunewald commented Jul 6, 2021

makortel commented Jul 6, 2021

jfernan2 commented Jul 6, 2021

wddgit commented Jun 23, 2021 •

edited

Loading

smuzaffar Jun 29, 2021 •

edited

Loading

Martin-Grunewald commented Jul 6, 2021 •

edited

Loading