Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for EDAlias as a case in SwitchProducer #25907

Merged
merged 13 commits into from
Feb 20, 2019

Conversation

makortel
Copy link
Contributor

@makortel makortel commented Feb 11, 2019

In a heterogeneous workflow it may happen that the (legacy) CPU and non-CPU modules are naturally defined so differently that the final legacy CPU products are produced in a very asymmetric way (e.g.

  • CPU: producer A produces a1, a2, producer B produces `b1
  • non-CPU: producer D produces a1, producer C produces a2 and b1

). In order to be able to handle such cases flexibly and efficiently, this PR extends the SwitchProducer mechanism to allow EDAliases as cases (in addition to EDProducers). This required first some fixes to EDAliases:

  • Throw if an EDAlias defines two instances pointing to the same real product
    • Currently the OutputModule will complain if an alias and the alias-for branches are attempted to be kept. This check does not work with SwitchProducer, whose output branches are not exactly alias branches (even if they're very close to). The premise is that it does not make sense to have both the alias and alias-for branches.
  • If a job has an unscheduled EDProducer that registers a callback for new product registration such that the callback calls produces(), and that callback gets triggered by EDAlias (output) BranchDescriptions, the EDProducer BranchDescriptions have an incorrect on-demand status leading to a wrong ProductResolver and eventually a job to hang because of starvation in the TBB scheduler. This situation happens because the on-demand status for EDProducer BranchDescriptions is set before the EDAliases are processed in the Schedule. The fix is to move the setting of the on-demand status of BranchDescriptions to be done after the EDAlias output BranchDescriptions have been created.
    • The setting is also a bit more efficient now that it is done once instead of once per stream
  • Made the python EDAlias cloneable and to inherit from _Parameterizable
    • Latter reduces copy-paste and enables Modifier.toReplaceWith()

Adding the support for EDAlias cases in SwitchProducer required the following changes

  • SwitchProducer ProductResolvers can be set only after all EDAlias ProductResolvers have been set
  • Added a new (intermediate) base class DataManagingOrAliasProductResolver, from which the DataManaging, Alias, and SwitchBase ProductResolvers inherit from, in order to be able use both DataManaging and Alias ProductResolvers as the realProduct in SwitchBaseProductResolver
  • Added a special treatment for SwitchProducer in ProductSelector to detect the case when one attempts to keep in OutputModule both the real product, and the EDAlias-within-SwitchProducer. Similar check is already done for regular EDAliases (as mentioned above)
    • The same check does not work for SwitchProducers because there the BranchID of the SwitchProducer and the case product/alias are different, while in case of EDAlias the BranchID of the alias and the alias-for product as the same.
  • In python Process, separate SwitchProducers from the normal EDProducers. This change is needed in order to append elements to both @all_modules and @all_aliases vstrings when filling the ProcessDescription. This way the case-EDAlias can away from the Process.aliases_() (as the case-EDProducers are not in Process.producers_()).
  • As an additional sanity check, mark all case BranchDescriptions as transient in order to ignore them in OutputModule keep statements like keep *_foo*_*_*.
    • I also tried to "implicitly drop" all branches with '@' in ProductSelector, but that lead to SwitchProducer branches being dropped on inputs as they would be interpreted as branches depending on a dropped branch.

This PR also adds a generic mechanism to allow customized configuration entities to pass import statements to configuration dump. Related, currently the case-producer fragments end up in the configuration dump while they should not, this PR fixes that (as a byproduct).

Tested in CMSSW_10_5_X_2019-02-08-1100, no changes expected in current workflows.

Also add a test for the other kind of duplicate check
…cts are registered in a callback triggered by an EDAlias.

All BranchDescriptions have been created after processEDAliases(), so
let's move the call to BranchDescription::setOnDemand() there.
The clone() is needed to make SwitchProducer (that is clonable) to
support EDAliases as cases. Making EDAlias to inherit from
_Parameterizable reduces copy-paste.

Also add unit tests for using Modifier for EDAlias.
- In configuration need some special cases for EDAlias
- Need to process Switch aliases after all EDAliases in Principal,
  otherwise the aliased-for ProductResolvers would not necessarily exist
- Add DataManagingOrAliasProductResolver to be able to connect
  SwitchBaseProductResolver to AliasProductResolver
- Extend the duplicate product check in ProductSelector to disallow
  keeping both a real product, and EDAlias-within-SwitchProducer alias
  of the product
- Add a test to verify that having a producer and
  SwitchProducer-with-EDAlias pointing to that producer in the wrong
  oder in a Path leads to an exception
@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-25907/8386

  • This PR adds an extra 156KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @makortel (Matti Kortelainen) for master.

It involves the following packages:

DataFormats/Provenance
FWCore/Framework
FWCore/Integration
FWCore/ParameterSet

@cmsbuild, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@Martin-Grunewald, @rovere, @wddgit this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 11, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/33082/console Started: 2019/02/11 17:38

@cmsbuild
Copy link
Contributor

-1

Tested at: 194c177

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25907/33082/summary.html

I found follow errors while testing this PR

Failed tests: Build HeaderConsistency

  • Build:

I found compilation error when building:

Entering library rule at Mixing/Base
>> Compiling  /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_5_X_2019-02-11-1100/src/Mixing/Base/src/PileUp.cc 
>> Compiling  /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_5_X_2019-02-11-1100/src/Mixing/Base/src/BMixingModule.cc 
>> Compiling  /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_5_X_2019-02-11-1100/src/Mixing/Base/src/SecondaryEventProvider.cc 
/build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_5_X_2019-02-11-1100/src/Mixing/Base/src/SecondaryEventProvider.cc: In constructor 'edm::SecondaryEventProvider::SecondaryEventProvider(std::vector&, edm::ProductRegistry&, std::shared_ptr)':
/build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_5_X_2019-02-11-1100/src/Mixing/Base/src/SecondaryEventProvider.cc:27:22: error: 'class edm::WorkerManager' has no member named 'setOnDemandProducts'
       workerManager_.setOnDemandProducts(preg, unscheduledLabels);
                      ^~~~~~~~~~~~~~~~~~~
gmake: *** [tmp/slc7_amd64_gcc700/src/Mixing/Base/src/MixingBase/SecondaryEventProvider.cc.o] Error 1
>> Building shared library tmp/slc7_amd64_gcc700/src/Mixing/Base/src/MixingBase/libMixingBase.so
c++: error: tmp/slc7_amd64_gcc700/src/Mixing/Base/src/MixingBase/SecondaryEventProvider.cc.o: No such file or directory


@cmsbuild
Copy link
Contributor

Comparison not run due to Build errors (RelVals and Igprof tests were also skipped)

FWCore/Framework/src/Schedule.cc Outdated Show resolved Hide resolved
FWCore/Integration/test/run_TestEDAlias.sh Show resolved Hide resolved
@cmsbuild
Copy link
Contributor

cmsbuild commented Feb 19, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/33193/console Started: 2019/02/19 19:10

@civanch
Copy link
Contributor

civanch commented Feb 19, 2019

@makortel , for me it is a bit strange that this PR trigger problems of these tests. May it be that the modification makes the tests to be sensitive and they cannot work in parallel at the same node?

@makortel
Copy link
Contributor Author

@civanch I fully agree that the unit test failures are strange. But the failures are somewhat random (e.g. in #25907 (comment) only SiStripDAQ_O2O_test failed, and 250202.181 because of file open errors), occur also in other PRs, and are not reproducible in local tests.

Also @smuzaffar commented in #25917 (comment)

We have noticed that when PR include too many packages (e.g. in this case over 700) then unit tests misbehave and are mostly killed by timeout

@fabiocos
Copy link
Contributor

I confirm that in my local test build I manage to execute successfully the failing tests. Let's try again...

@cmsbuild
Copy link
Contributor

-1

Tested at: 95d0376

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25907/33193/summary.html

I found follow errors while testing this PR

Failed tests: UnitTests

  • Unit Tests:

I found errors in the following unit tests:

---> test testTauEmbeddingProducers had ERRORS
---> test runtestPhysicsToolsNanoAOD had ERRORS
---> test runtestPhysicsToolsPatAlgos had ERRORS
---> test runtestRecoEgammaPhotonIdentification had ERRORS
---> test runtestRecoEgammaElectronIdentification had ERRORS
---> test SiStripDAQ_O2O_test had ERRORS

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-25907/33193/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 3 differences found in the comparisons
  • DQMHistoTests: Total files compared: 32
  • DQMHistoTests: Total histograms compared: 3098286
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3098088
  • DQMHistoTests: Total skipped: 197
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 31 files compared)
  • Checked 133 log files, 14 edm output root files, 32 DQM output files

@boudoul
Copy link
Contributor

boudoul commented Feb 20, 2019

Dear all, please note that the PR #25736 is also experiencing strange and random unit test failures ....

@fabiocos
Copy link
Contributor

code-checks

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-25907/8490

  • This PR adds an extra 256KB to repository

@fabiocos
Copy link
Contributor

@civanch @mdhildreth as far as I can see, the change under the simulation umbrella in Mixing/Base is just a direct consequence of the other updates (the method used has been replace by another one). The short matrix tests involving PU do not show any difference, and the random failures in the unit tests are not explicitly related to the SIM part, and looks most likely an issue with the test ssyetm itself (under investigation).

In agreement with @smuzaffar I will move forward with the merge of this PR, so as we may verify in an IB whether we observe similar problems (none so far).

@fabiocos
Copy link
Contributor

+1

@fabiocos
Copy link
Contributor

merge

to verify in an IB environment whether the observed random crashes appear as well, or whether the problem is related to the PR test environment

@cmsbuild cmsbuild merged commit 4fd4ff5 into cms-sw:master Feb 20, 2019
@makortel
Copy link
Contributor Author

Seems that all unit tests indeed succeeded in the IB.

@fabiocos
Copy link
Contributor

@makortel indeed this points towards an issue in the test system...

@makortel makortel deleted the switchProducerAlias branch December 6, 2021 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants