Rivet ParticleLevelProducer and HTXS multi-threading #30888

mseidel42 · 2020-07-23T16:56:37Z

PR description:

Multi-threading in nanoaod step was broken due to new multi-thread feature in the recent Rivet update 3.1.2, see #30836

Seems to work fine when initialization is done for the first event of each thread in produce.

~~As a bonus, the producers run as edm::stream modules now.~~
Producers remain edm::one modules until Rivet becomes fully thread-safe.

PR validation:

Standalone tests of ParticleLevelProducer and HTXS successful with 4 threads.
Workflow 1302.18 with 2 threads in step5 (nanoaod)

cmsbuild · 2020-07-23T16:57:00Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-07-23T17:02:29Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17273

This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-format:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17273/code-format.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17273/code-format.patch | patch -p1
You can also run scram build code-format to apply code format directly

makortel · 2020-07-23T18:36:26Z

GeneratorInterface/RivetInterface/plugins/ParticleLevelProducer.cc

+    analysisHandler_ = new Rivet::AnalysisHandler();
+
+    analysisHandler_->setIgnoreBeams(true);
+    analysisHandler_->addAnalysis(rivetAnalysis_);


I see the ProjectionHandler::getInstance() is called in the constructor of ProjectionApplier, and Analysis inherits from ProjectionApplier, and our RivetAnalysis inherits from Analysis.

Does anything inside AnalysisHandler object (RivetAnalysis, Analysis, ProjectionApplier, or any helper class they use) rely that exactly the same ProjectionHandler object is used every time?

Or actually conversely, can it be "proven" that for a given AnalysisHandler object, it does not matter which ProjectionHandler object it uses (i.e. they could be cached in arbitrary way) as long as the same one is used for the whole duration of the AnalysisHandle::analyze() call? I.e. essentially the ProjectionApplier could be changed to replace ProjectionHandler member with always calling ProjectionHandler::getInstance()? (I'm not actually suggesting that, it's just an attempt to help to think about consequences)

I'm asking because we do not guarantee an instance of a edm::stream module to be run on any specific thread. A module instance could run on a different thread on each event than on the previous event.

On a further thought, I suspect this approach won't work in general with TBB. The approach would probably work if each instance of the edm::stream module would process the first event in a different thread wrt others because of this assignment in the constructor of ProjectionApplier
https://gitlab.com/hepcedar/rivet/-/blob/release-3-1-x/src/Core/ProjectionApplier.cc#L13

But we don't guarantee that. If two instances process the first event on the same thread, their RivetAnalysis objects will hold a reference to the same ProjectionHandler.

Hi Matti,

I think the ProjectionHandler objects are in principle interchangeable, they should just not be used at the same time.
If you have a patch to Rivet in mind I could try it out.

But we don't guarantee that. If two instances process the first event on the same thread, their RivetAnalysis objects will hold a reference to the same ProjectionHandler.

Is there any way to provoke that behavior for testing?

I modified this PR to re-init when the standard "FS" projection is not found. But I am not sure if that would solve the problem that you described.

Cheers,
Markus

Hi Markus,

I think the ProjectionHandler objects are in principle interchangeable, they should just not be used at the same time.

That's good.

If you have a patch to Rivet in mind I could try it out.

I don't, but to me the solution the authors outlined in https://gitlab.com/hepcedar/rivet/-/issues/84, i.e. each AnalysisHandler having its own ProjectionHandler object would be the best.

We don't need AnalysisHandler to be thread-safe by itself, avoiding any global state (which now occurs through ProjectionHandler::getInstance() for it to be thread-friendly would be sufficient.

Is there any way to provoke that behavior for testing?

With a minimal configuration of just the ParticleLevelProducer and whatever modules it requires it could be possible to hack along

create a dummy module that is fast on EDM stream 0, and for other streams spins in CPU for a long time (e.g. seconds to tens of seconds)

put that module and the ParticleLevelProducer in a Path in the order of cms.Path(spinningModule, particleLevelProducer)

if ParticleLevelProducer requires any input, modules producing those can be either in a Task or in the same Path before spinningModule

Then it could happen that many or all of the ParticleLevelProducer stream instances get run in the same thread (because spinningModule is keeping the other threads busy).

I modified this PR to re-init when the standard "FS" projection is not found. But I am not sure if that would solve the problem that you described.

I suspect it will not solve that particular problem. Or at least I'd assume if two RivetAnalysis's are initialized with the same ProjectionHandler, the hasProjection("FS") would return true for both of them.

Hi Matti,

I think the calls to getProjection (when a projection is applied) and hasProjection follow the same structure, see below.
Thus, the hasProjection should return false when the ProjectionHandler for a specific thread was not initialized (or initialized in another thread). In that case, the re-initialization should fix the problem.

Best,
Markus

This is the part of the code throwing the error:

const Projection& ProjectionHandler::getProjection(const ProjectionApplier& parent, const string& name) const { MSG_TRACE("Searching for child projection '" << name << "' of " << &parent); NamedProjsMap::const_iterator nps = _namedprojs.find(&parent); if (nps == _namedprojs.end()) { std::ostringstream msg; msg << "No projections registered for parent " << &parent; throw Error(msg.str()); }

called by ProjectionApplier/Analysis here:

/// Get the named projection, specifying return type via a template argument. /// @todo Add SFINAE to require that PROJ inherit from Projection template <typename PROJ> const PROJ& getProjection(const std::string& name) const { const Projection& p = getProjHandler().getProjection(*this, name); return pcast<PROJ>(p); }

The hasProjection goes through very similar code:

bool ProjectionHandler::hasProjection(const ProjectionApplier& parent, const string& name) const { MSG_TRACE("Searching for child projection '" << name << "' of " << &parent); NamedProjsMap::const_iterator nps = _namedprojs.find(&parent); if (nps == _namedprojs.end()) return false; NamedProjs::const_iterator np = nps->second.find(name); return !(np == nps->second.end()); }

called by this function in ProjectionApplier/Analysis:

/// Does this applier have a projection registered under the name @a name? bool hasProjection(const std::string& name) const { return getProjHandler().hasProjection(*this, name); }

I think the behavior breaks already with ProjectionApplier::getProjHandle()

/// Get a reference to the ProjectionHandler for this thread. ProjectionHandler& getProjHandler() const { return _projhandler; }

where _projhandler is initialized in the constructor

ProjectionApplier::ProjectionApplier() : _allowProjReg(true), _owned(false), _projhandler(ProjectionHandler::getInstance()) { }

Therefore, if two ProjectionApplier objects are created in the same thread, they get the same ProjectionHandler object. If these ProjectionApplier objects are later used in different threads, their getProjHandler() continue to return a reference to the same (shared) ProjectionHandler object, and the ProjectionHandler::hasProjection() will give the same answer to both (unless its internal state has already broken because of the data races).

Hm, and if we patch Rivet to use gettimeofday microseconds instead of the thread id? That should give as a unique ProjectionHandler for each stream.

/// Singleton creation function static std::mutex mtx; static ProjectionHandler& getInstance() { // static ProjectionHandler _instance; // return _instance; std::unique_lock<std::mutex> lock(mtx); static map<std::thread::id,ProjectionHandler> _instances; return _instances[std::this_thread::get_id()]; }

I would much rather see the "proper fix" suggested in https://gitlab.com/hepcedar/rivet/-/issues/84#note_341649903, i.e. each ProjectionApplier object just having its own PojectionHandler object. It is simpler and works correctly. If we go to patch Rivet, we could even contribute it to upstream and help to close that issue.

Ok, let's see if that can be done, I hope there is no major roadblock for that.

Otherwise, we may just make this a one module again.

cmsbuild · 2020-07-24T08:41:47Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-07-24T08:47:25Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17281

This PR adds an extra 20KB to repository

cmsbuild · 2020-07-24T08:47:47Z

A new Pull Request was created by @intrepid42 (Markus Seidel) for master.

It involves the following packages:

GeneratorInterface/RivetInterface

@SiewYan, @mkirsano, @cmsbuild, @GurpreetSinghChahal, @agrohsje, @alberto-sanchez, @qliphy can you please review it and eventually sign? Thanks.
@alberto-sanchez, @agrohsje, @mkirsano this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

cmsbuild · 2020-07-27T13:23:08Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-07-27T13:27:39Z

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17320

This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

code-checks:
https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17320/code-checks.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17320/code-checks.patch | patch -p1
You can also run scram build code-checks to apply code checks directly

cmsbuild · 2020-08-27T08:01:04Z

The code-checks are being triggered in jenkins.

cmsbuild · 2020-08-27T08:08:50Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17922

This PR adds an extra 20KB to repository

cmsbuild · 2020-08-27T08:09:16Z

Pull request #30888 was updated. @SiewYan, @mkirsano, @cmsbuild, @GurpreetSinghChahal, @agrohsje, @alberto-sanchez, @qliphy can you please check and sign again.

silviodonato · 2020-09-01T10:59:29Z

please test

cmsbuild · 2020-09-01T10:59:46Z

The tests are being triggered in jenkins.

CMSSW_11_2_X_2020-08-31-2300/slc7_amd64_gcc820: https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/9034/console Started: 2020/09/01 13:00

silviodonato · 2020-09-01T11:02:20Z

@cms-sw/generators-l2 @Dr15Jones do you have further comments?

agrohsje · 2020-09-01T11:06:20Z

No additional comments from my side. Thanks.
+1

agrohsje · 2020-09-01T11:28:10Z

+1

cmsbuild · 2020-09-01T11:28:33Z

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

cmsbuild · 2020-09-01T12:11:55Z

+1
Tested at: 313863a
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c47a0e/9034/summary.html
CMSSW: CMSSW_11_2_X_2020-08-31-2300
SCRAM_ARCH: slc7_amd64_gcc820

cmsbuild · 2020-09-01T12:11:58Z

Comparison job queued.

cmsbuild · 2020-09-01T13:43:30Z

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c47a0e/9034/summary.html

Comparison Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 35
DQMHistoTests: Total histograms compared: 2609667
DQMHistoTests: Total failures: 1
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2609644
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 34 files compared)
Checked 149 log files, 22 edm output root files, 35 DQM output files

qliphy · 2020-09-01T14:17:44Z

+1

Rivet ParticleLevelProducer and HTXS multi-threading

4b098c7

cmsbuild added this to the CMSSW_11_2_X milestone Jul 23, 2020

cmsbuild added code-checks-pending comparison-pending generators-pending orp-pending pending-signatures tests-pending labels Jul 23, 2020

mseidel42 mentioned this pull request Jul 23, 2020

Job failures in NANO step #30836

Open

cmsbuild added code-checks-rejected and removed code-checks-pending labels Jul 23, 2020

makortel reviewed Jul 23, 2020

View reviewed changes

mseidel42 added 2 commits July 24, 2020 10:39

Check if FS projection is registered, otherwise re-init

fe60c1a

Apply code-format

86e34d1

cmsbuild added code-checks-pending and removed code-checks-rejected labels Jul 24, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Jul 24, 2020

Add reinit for HTXS, delete discarded pointers

03bf652

cmsbuild added code-checks-pending and removed code-checks-approved labels Jul 27, 2020

cmsbuild added code-checks-rejected and removed code-checks-pending labels Jul 27, 2020

cmsbuild added code-checks-approved and removed code-checks-pending labels Aug 27, 2020

cmsbuild added tests-started and removed tests-pending labels Sep 1, 2020

cmsbuild added fully-signed generators-approved and removed generators-pending pending-signatures labels Sep 1, 2020

cmsbuild added tests-approved and removed tests-started labels Sep 1, 2020

cmsbuild added comparison-available and removed comparison-pending labels Sep 1, 2020

cmsbuild added orp-approved and removed orp-pending labels Sep 1, 2020

cmsbuild merged commit 2f3e951 into cms-sw:master Sep 1, 2020

This was referenced Sep 3, 2020

Revert "Revert "Update to Rivet 3.1.2 and Yoda 1.8.3"" cms-sw/cmsdist#6209

Merged

Revert "Revert "CSJet update for fastjet-contrib 1.044"" #31355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rivet ParticleLevelProducer and HTXS multi-threading #30888

Rivet ParticleLevelProducer and HTXS multi-threading #30888

mseidel42 commented Jul 23, 2020 •

edited

Loading

cmsbuild commented Jul 23, 2020

cmsbuild commented Jul 23, 2020

makortel Jul 23, 2020

makortel Jul 23, 2020

mseidel42 Jul 24, 2020

makortel Jul 24, 2020

mseidel42 Jul 27, 2020

makortel Jul 28, 2020

mseidel42 Jul 28, 2020

makortel Jul 28, 2020

mseidel42 Jul 28, 2020

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 27, 2020

cmsbuild commented Jul 27, 2020

cmsbuild commented Aug 27, 2020

cmsbuild commented Aug 27, 2020

cmsbuild commented Aug 27, 2020

silviodonato commented Sep 1, 2020

cmsbuild commented Sep 1, 2020 •

edited

Loading

silviodonato commented Sep 1, 2020

agrohsje commented Sep 1, 2020

agrohsje commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

qliphy commented Sep 1, 2020

Rivet ParticleLevelProducer and HTXS multi-threading #30888

Rivet ParticleLevelProducer and HTXS multi-threading #30888

Conversation

mseidel42 commented Jul 23, 2020 • edited Loading

PR description:

PR validation:

cmsbuild commented Jul 23, 2020

cmsbuild commented Jul 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 24, 2020

cmsbuild commented Jul 27, 2020

cmsbuild commented Jul 27, 2020

cmsbuild commented Aug 27, 2020

cmsbuild commented Aug 27, 2020

cmsbuild commented Aug 27, 2020

silviodonato commented Sep 1, 2020

cmsbuild commented Sep 1, 2020 • edited Loading

silviodonato commented Sep 1, 2020

agrohsje commented Sep 1, 2020

agrohsje commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

cmsbuild commented Sep 1, 2020

qliphy commented Sep 1, 2020

mseidel42 commented Jul 23, 2020 •

edited

Loading

cmsbuild commented Sep 1, 2020 •

edited

Loading