-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable concurrent lumis and IOVs by default when number of streams is at least 2 #34231
Conversation
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34231/23497
|
A new Pull Request was created by @wddgit (W. David Dagenhart) for master. It involves the following packages: FWCore/Framework @makortel, @smuzaffar, @cmsbuild, @Dr15Jones can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
please test |
-1 Failed Tests: UnitTests Unit TestsI found errors in the following unit tests: ---> test test_PixelBaryCentreTool had ERRORS Comparison SummaryThe workflows 140.53 have different files in step1_dasquery.log than the ones found in the baseline. You may want to check and retrigger the tests if necessary. You can check it in the "files" directory in the results of the comparisons Summary:
|
The differences are all in 140.53 that had different input files. The |
The changes look good to me, but I'd like @Dr15Jones to take a look as well (once he's back). @wddgit Could you change the PR description for something along "Enable concurrent lumis and IOVs by default when number of streams is at least 2" to be more clear on the impact for release notes? |
I replaced the title with your suggested text and also added it as the first line of the description. That is better. Thanks. |
nConcurrentLumis = 1; | ||
nConcurrentRuns = 1; | ||
} | ||
if (nThreads > 1 or nStreams > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this to be within an else
block from if(dumpOptions)
? That way we would only print the info once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This seems better. We don't need the LogInfo stuff in the unit test output.
I viewed the dumpOptions as something for Framework unit tests. I am wondering if it could get used for other purposes... I tried to leave the other output unmodified, just in case there is something that uses it and depends on it being in the log file.
(I'll push all the commits with changes when I get them all done.)
nStreams = 1; | ||
nConcurrentLumis = 1; | ||
nConcurrentRuns = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add a LogWarning
if this changes the settings. I know we didn't have that before, but that was probably bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, I added the LogWarning. Thanks.
@@ -41,7 +41,9 @@ namespace edm { | |||
|
|||
std::shared_ptr<EventSetupProvider> EventSetupsController::makeProvider(ParameterSet& iPSet, | |||
ActivityRegistry* activityRegistry, | |||
ParameterSet const* eventSetupPset) { | |||
ParameterSet const* eventSetupPset, | |||
unsigned int nConcurrentLumis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we change this to maxConcurrency
instead? That better explains what is happening without forcing in the code at this point the policy that maxConcurrency == nConcurrentLumis
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some improvements here. The behavior is exactly the same, but hopefully it is more readable and understandable. Below are the changes in EventProcessor.cc with appropriate changes also in the lower level code (I'll push all these commits as soon as I finish all of them). I noticed we no longer need the separate function setMaxConcurrentIOVs since I moved the code that resets things if there is a looper.
+++ b/FWCore/Framework/src/EventProcessor.cc
@@ -414,6 +414,9 @@ namespace edm {
edm::LogInfo("ThreadStreamSetup") << "setting # threads " << nThreads << "\nsetting # streams " << nStreams;
}
}
+ // The number of concurrent IOVs is configured individually for each record in
+ // the class NumberOfConcurrentIOVs to values less than or equal to this.
+ unsigned int maxConcurrentIOVs = nConcurrentLumis;
//Check that relationships between threading parameters makes sense
/*
@@ -455,7 +458,7 @@ namespace edm {
// intialize the event setup provider
ParameterSet const& eventSetupPset(optionsPset.getUntrackedParameterSet("eventSetup"));
esp_ = espController_->makeProvider(
- *parameterSet, items.actReg_.get(), &eventSetupPset, nConcurrentLumis, dumpOptions);
+ *parameterSet, items.actReg_.get(), &eventSetupPset, maxConcurrentIOVs, dumpOptions);
// initialize the looper, if any
if (!loopers.empty()) {
@@ -466,7 +469,6 @@ namespace edm {
// in presence of looper do not delete modules
deleteNonConsumedUnscheduledModules_ = false;
}
- espController_->setMaxConcurrentIOVs(nStreams, nConcurrentLumis);
@@ -88,7 +88,9 @@ namespace edm { | |||
|
|||
std::shared_ptr<EventSetupProvider> makeProvider(ParameterSet&, | |||
ActivityRegistry*, | |||
ParameterSet const* eventSetupPset = nullptr); | |||
ParameterSet const* eventSetupPset = nullptr, | |||
unsigned int nConcurrentLumis = 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxConcurrency
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
@@ -16,11 +16,16 @@ namespace edm { | |||
|
|||
NumberOfConcurrentIOVs::NumberOfConcurrentIOVs() : numberConcurrentIOVs_(1) {} | |||
|
|||
void NumberOfConcurrentIOVs::readConfigurationParameters(ParameterSet const* eventSetupPset) { | |||
void NumberOfConcurrentIOVs::readConfigurationParameters(ParameterSet const* eventSetupPset, | |||
unsigned int nConcurrentLumis, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxConcurrency
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above
FWCore/Framework/test/BuildFile.xml
Outdated
@@ -36,6 +36,11 @@ | |||
<use name="FWCore/Utilities"/> | |||
</bin> | |||
|
|||
<bin name="TestFWCoreFrameworkOptions" file="TestDriver.cpp"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is scram support now for specifying running a .sh
script as part of a test without doing the TestDriver.cpp
approach. I suggest switching to that since it reduces the number of executables we need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smuzaffar I am not familiar with this (I've never seen it or forgot...). Is there is an example or documentation somewhere? Does it work the same way and set the same environmental variables as the old way? Here is the script I want to run:
#!/bin/bash
function die { echo Failure $1: status $2 ; exit $2 ; }
pushd ${LOCAL_TMP_DIR}
echo testOptions1_cfg.py
cmsRun -p ${LOCAL_TEST_DIR}/testOptions1_cfg.py >& testOptions1.log || die "cmsRun testOptions1_cfg.py" $?
grep "Number of Streams = 1" testOptions1.log || die "Failed number of streams test" $?
grep "Number of Concurrent Lumis = 1" testOptions1.log || die "Failed number of concurrent lumis test" $?
grep "Number of Concurrent IOVs = 1" testOptions1.log || die "Failed number of concurrent IOVs test" $?
echo testOptions2_cfg.py
cmsRun -p ${LOCAL_TEST_DIR}/testOptions2_cfg.py >& testOptions2.log || die "cmsRun testOptions2_cfg.py" $?
grep "Number of Streams = 5" testOptions2.log || die "Failed number of streams test" $?
grep "Number of Concurrent Lumis = 4" testOptions2.log || die "Failed number of concurrent lumis test" $?
grep "Number of Concurrent IOVs = 3" testOptions2.log || die "Failed number of concurrent IOVs test" $?
echo testOptions3_cfg.py
cmsRun -p ${LOCAL_TEST_DIR}/testOptions3_cfg.py >& testOptions3.log || die "cmsRun testOptions3_cfg.py" $?
grep "Number of Streams = 6" testOptions3.log || die "Failed number of streams test" $?
grep "Number of Concurrent Lumis = 2" testOptions3.log || die "Failed number of concurrent lumis test" $?
grep "Number of Concurrent IOVs = 2" testOptions3.log || die "Failed number of concurrent IOVs test" $?
echo testOptions4_cfg.py
cmsRun -p ${LOCAL_TEST_DIR}/testOptions4_cfg.py >& testOptions4.log || die "cmsRun testOptions4_cfg.py" $?
grep "Number of Streams = 6" testOptions4.log || die "Failed number of streams test" $?
grep "Number of Concurrent Lumis = 6" testOptions4.log || die "Failed number of concurrent lumis test" $?
grep "Number of Concurrent IOVs = 6" testOptions4.log || die "Failed number of concurrent IOVs test" $?
echo testOptions5_cfg.py
cmsRun -p ${LOCAL_TEST_DIR}/testOptions5_cfg.py >& testOptions5.log || die "cmsRun testOptions5_cfg.py" $?
grep "Number of Streams = 1" testOptions5.log || die "Failed number of streams test" $?
grep "Number of Concurrent Lumis = 1" testOptions5.log || die "Failed number of concurrent lumis test" $?
grep "Number of Concurrent IOVs = 1" testOptions5.log || die "Failed number of concurrent IOVs test" $?
echo testOptions6_cfg.py
cmsRun -p ${LOCAL_TEST_DIR}/testOptions6_cfg.py >& testOptions6.log || die "cmsRun testOptions6_cfg.py" $?
# Cannot run the grep tests because by default the options are not dumped.
# You can however run this manually with a debugger and check (which was done)
# And also just run it and see it doesn't crash...
rm testOptions1.log
rm testOptions2.log
rm testOptions3.log
rm testOptions4.log
rm testOptions5.log
rm testOptions6.log
popd
exit 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here
cmssw/FWCore/SharedMemory/test/BuildFile.xml
Lines 17 to 19 in 98e05a9
<test name="testFWCoreSharedMemoryMonitorThreadSignals" command="test_monitor_thread_signals.sh"> | |
<flags PRE_TEST="testFWCoreSharedMemoryMonitorThread"/> | |
</test> |
cmssw/FWCore/Framework/test/BuildFile.xml
Lines 393 to 395 in 98e05a9
<test name="testFWCoreFrameworkNonEventOrdering" command="test_non_event_ordering.sh"/> | |
<test name="testFWCoreFramework1ThreadESPrefetch" command="run_test_1_thread_es_prefetching.sh"/> | |
<test name="testFWCoreFrameworkModuleDeletion" command="run_module_delete_tests.sh"/> |
cmssw/FWCore/Integration/test/BuildFile.xml
Lines 447 to 451 in 98e05a9
<test name="TestFWCoreIntegrationInterProcess" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/test_TestInterProcessProd_cfg.py"/> | |
<test name="TestFWCoreIntegrationPutOrMerge" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/putOrMergeTest_cfg.py"/> | |
<test name="TestFWCoreIntegrationInputSourceAlias" command="cmsRun ${LOCALTOP}//src/FWCore/Integration/test/inputSource_alias_Test_cfg.py"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wddgit , it works nearly the same way ( see examples of it https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PythonAnalysis/test/BuildFile.xml ) but it does not set the env variables set here https://github.com/cms-sw/cmssw/blob/master/FWCore/Utilities/src/TestHelper.cc#L149-L164 . But most of these you can calculate within the script itself. So basically you just need to use
<test name="TestFWCoreFrameworkOptions" command="your-script-in-test.sh args"/>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wddgit , if the tests in your script do not depend on each other then better to have a small script e.g.
runtest.sh
echo $1
cmsRun -p ${CMSSW_BASE}/src/FWCore/Framework/test/$1 >& ${1}.log || die "cmsRun$1" $?
grep "Number of Streams = $2" ${1}.log || die "Failed number of streams test" $?
grep "Number of Concurrent Lumis = $3" ${1}.log || die "Failed number of concurrent lumis test" $?
grep "Number of Concurrent IOVs = $4" ${1}.log || die "Failed number of concurrent IOVs test" $?
and then add multiple tests e.g.
<test name="TestFWCoreFrameworkOptions1" command="runtest.sh testOptions1_cfg.py 1 1 1"/>
<test name="TestFWCoreFrameworkOptions2" command="runtest.sh testOptions2_cfg.py 5 4 3"/>
...
this way scram can run these in parallel. Only thing you need to make sure that these tests do not write in to same output file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mostly did what was suggested here. The main difference is instead of putting so much in the BuildFile I passed only a single index to identify the test and put the config file names and expected results in arrays in the bash script. It seemed to be too much to put expected results in a BuildFile.
Overall, this seems much better. Thanks.
) | ||
process.options = dict ( | ||
dumpOptions = True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to do process.options.dumpOptions = True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It works. It is a little more concise. Thanks.
|
||
process.options = dict( | ||
dumpOptions = True, | ||
numberOfThreads = 6, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should probably keep the # threads to 4 or less as many IB VMs are only 4 threaded machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I reduced the number of threads to 4 when it was greater than 4 in this file and the other configurations as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering. For each individual test I limited this to 4 concurrent threads. But does this resolve all the problems if scram is running multiple tests concurrently? I was wondering how that works. Does each test run on a different VM?
@@ -61,6 +73,8 @@ namespace edm { | |||
description.addUntracked<std::vector<std::string>>("canDeleteEarly", emptyVector) | |||
->setComment("Branch names of products that the Framework can try to delete before the end of the Event"); | |||
|
|||
description.addUntracked<bool>("dumpOptions", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should setComment
here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
LogAbsolute("Options") << "Number of Threads = " << nThreads; | ||
LogAbsolute("Options") << "Number of Streams = " << nStreams; | ||
LogAbsolute("Options") << "Number of Concurrent Lumis = " << nConcurrentLumis; | ||
LogAbsolute("Options") << "Number of Concurrent Runs = " << nConcurrentRuns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogAbsolute("Options") << "Number of Threads = " << nThreads; | |
LogAbsolute("Options") << "Number of Streams = " << nStreams; | |
LogAbsolute("Options") << "Number of Concurrent Lumis = " << nConcurrentLumis; | |
LogAbsolute("Options") << "Number of Concurrent Runs = " << nConcurrentRuns; | |
LogAbsolute("Options") << "Number of Threads = " << nThreads | |
<< "\nNumber of Streams = " << nStreams | |
<< "\nNumber of Concurrent Lumis = " << nConcurrentLumis | |
<< "\nNumber of Concurrent Runs = " << nConcurrentRuns; |
This is much less work for the logging system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks.
The unit tests are still testing the same things. Use new scram method of running a bash script, more concise way to set parameters, renumber the config file names to start at 0, keep the number of threads less than 4. Tests can run concurrently now. Deleted the last configuration as it was doing the same thing as the first.
Also formatted a message logger call in a more efficient way
+1 The unit test fails in IBs, RelVals-INPUT are caused by timeouts, and the comparison differences are in MessageLogger. |
This pull request is fully signed and it will be integrated in one of the next master IBs (but tests are reportedly failing). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
merge |
Looks like this PR creates problems in HLT validation tests, from the IB onward where this PR was merged, see, eg: which shows several TSG tests failing with exit status: 134 |
@Martin-Grunewald wrote
The problem appears to be in the DQM: see https://cmssdt.cern.ch/SDT/jenkins-artifacts/HLT-Validation/CMSSW_12_0_X_2021-07-05-1100/slc7_amd64_gcc900/RelVal_RECO_PIon_MC.log where the relevant bits of the log are
and the traceback
|
Thanks @Martin-Grunewald. The failing tests are
They all fail with the following assertion
From https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/README.md#the-dqmstore it is a configuration option of None of these jobs print out a list of modules incompatible with concurrent lumis, so likely the assertion in @cms-sw/dqm-l2 How about setting |
@smuzaffar @Martin-Grunewald It seems that failures in HLT validation jobs are not signaled in the IB dashboard. Could we change that so that any problems there would become visible there (similar to how existence of failures in FWLite tests are shown)? |
that is correct @makortel . The tests are internally run by HLT driver script, I guess I can just search for 'exit status: ' lines in the logs to mark it passded/failed ... @Martin-Grunewald will this be enough or is there any thing else which will help identifying the state of the test? |
In the overall log file of the overall test (jenkins.log, runIB.log) look for that indeed! |
Thanks @smuzaffar! |
Hi, I have created a PR with this change: |
I was supposed to do this at the same time as cms-sw#35302 that followed cms-sw#34231 and the conclusion in cms-sw#33436
PR description:
Enable concurrent lumis and IOVs by default when number of streams is at least 2.
With the changes in this PR, the number of concurrent luminosity blocks will default to 2 if the number of streams is greater than 1. The number of concurrent IOVs will default to the number of concurrent luminosity blocks. These new defaults will apply if these parameters are not specified in the configuration or are specified to be zero. Previously, these parameters would default to 1.
If the number of concurrent luminosity blocks is greater than the number of streams, then it will be reset to the number of streams.
If the number of concurrent IOVs is greater than the number of concurrent luminosity blocks, then it will be reset to the number of concurrent luminosity blocks.
Merging this code will cause some existing configurations to start running processes with multiple concurrent luminosity blocks and IOVS. This is a significant change that will probably break some modules. I've tested this PR works in the Framework code. While testing this PR, I did not test outside the Framework beyond the limited runTheMatrix tests. Concurrent luminosity blocks have been implemented in the Framework since 2018 and concurrent IOVs have been implemented since 2019. There has been an ongoing effort since then to get code outside the Framework to support concurrent luminosity blocks and IOVs. Multiple people have participated in this. runTheMatrix has been running process with more than one concurrent luminosity block for a few months now. Known problems have been fixed. The plan of the Core group is that this PR represents the next step in this migration. It is expected there may be failures uncovered when this is merged. The way forward is approving this PR so we can identify those failures and then be able fix them.
PR validation:
Added new unit tests for the above. Fixed existing tests as necessary.