Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rivet ParticleLevelProducer and HTXS multi-threading #30888

Merged
merged 9 commits into from
Sep 1, 2020

Conversation

mseidel42
Copy link
Contributor

@mseidel42 mseidel42 commented Jul 23, 2020

PR description:

Multi-threading in nanoaod step was broken due to new multi-thread feature in the recent Rivet update 3.1.2, see #30836

Seems to work fine when initialization is done for the first event of each thread in produce.

As a bonus, the producers run as edm::stream modules now.
Producers remain edm::one modules until Rivet becomes fully thread-safe.

PR validation:

  • Standalone tests of ParticleLevelProducer and HTXS successful with 4 threads.
  • Workflow 1302.18 with 2 threads in step5 (nanoaod)

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17273

  • This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

analysisHandler_ = new Rivet::AnalysisHandler();

analysisHandler_->setIgnoreBeams(true);
analysisHandler_->addAnalysis(rivetAnalysis_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the ProjectionHandler::getInstance() is called in the constructor of ProjectionApplier, and Analysis inherits from ProjectionApplier, and our RivetAnalysis inherits from Analysis.

Does anything inside AnalysisHandler object (RivetAnalysis, Analysis, ProjectionApplier, or any helper class they use) rely that exactly the same ProjectionHandler object is used every time?

Or actually conversely, can it be "proven" that for a given AnalysisHandler object, it does not matter which ProjectionHandler object it uses (i.e. they could be cached in arbitrary way) as long as the same one is used for the whole duration of the AnalysisHandle::analyze() call? I.e. essentially the ProjectionApplier could be changed to replace ProjectionHandler member with always calling ProjectionHandler::getInstance()? (I'm not actually suggesting that, it's just an attempt to help to think about consequences)

I'm asking because we do not guarantee an instance of a edm::stream module to be run on any specific thread. A module instance could run on a different thread on each event than on the previous event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a further thought, I suspect this approach won't work in general with TBB. The approach would probably work if each instance of the edm::stream module would process the first event in a different thread wrt others because of this assignment in the constructor of ProjectionApplier
https://gitlab.com/hepcedar/rivet/-/blob/release-3-1-x/src/Core/ProjectionApplier.cc#L13

But we don't guarantee that. If two instances process the first event on the same thread, their RivetAnalysis objects will hold a reference to the same ProjectionHandler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matti,

I think the ProjectionHandler objects are in principle interchangeable, they should just not be used at the same time.
If you have a patch to Rivet in mind I could try it out.

But we don't guarantee that. If two instances process the first event on the same thread, their RivetAnalysis objects will hold a reference to the same ProjectionHandler.

Is there any way to provoke that behavior for testing?

I modified this PR to re-init when the standard "FS" projection is not found. But I am not sure if that would solve the problem that you described.

Cheers,
Markus

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Markus,

I think the ProjectionHandler objects are in principle interchangeable, they should just not be used at the same time.

That's good.

If you have a patch to Rivet in mind I could try it out.

I don't, but to me the solution the authors outlined in https://gitlab.com/hepcedar/rivet/-/issues/84, i.e. each AnalysisHandler having its own ProjectionHandler object would be the best.

We don't need AnalysisHandler to be thread-safe by itself, avoiding any global state (which now occurs through ProjectionHandler::getInstance() for it to be thread-friendly would be sufficient.

Is there any way to provoke that behavior for testing?

With a minimal configuration of just the ParticleLevelProducer and whatever modules it requires it could be possible to hack along

  • create a dummy module that is fast on EDM stream 0, and for other streams spins in CPU for a long time (e.g. seconds to tens of seconds)
  • put that module and the ParticleLevelProducer in a Path in the order of cms.Path(spinningModule, particleLevelProducer)
  • if ParticleLevelProducer requires any input, modules producing those can be either in a Task or in the same Path before spinningModule

Then it could happen that many or all of the ParticleLevelProducer stream instances get run in the same thread (because spinningModule is keeping the other threads busy).

I modified this PR to re-init when the standard "FS" projection is not found. But I am not sure if that would solve the problem that you described.

I suspect it will not solve that particular problem. Or at least I'd assume if two RivetAnalysis's are initialized with the same ProjectionHandler, the hasProjection("FS") would return true for both of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matti,

I think the calls to getProjection (when a projection is applied) and hasProjection follow the same structure, see below.
Thus, the hasProjection should return false when the ProjectionHandler for a specific thread was not initialized (or initialized in another thread). In that case, the re-initialization should fix the problem.

Best,
Markus

This is the part of the code throwing the error:

  const Projection& ProjectionHandler::getProjection(const ProjectionApplier& parent, const string& name) const {
    MSG_TRACE("Searching for child projection '" << name << "' of " << &parent);
    NamedProjsMap::const_iterator nps = _namedprojs.find(&parent);
    if (nps == _namedprojs.end()) {
      std::ostringstream msg;
      msg << "No projections registered for parent " << &parent;
      throw Error(msg.str());
    }

called by ProjectionApplier/Analysis here:

    /// Get the named projection, specifying return type via a template argument.
    /// @todo Add SFINAE to require that PROJ inherit from Projection
    template <typename PROJ>
    const PROJ& getProjection(const std::string& name) const {
      const Projection& p = getProjHandler().getProjection(*this, name);
      return pcast<PROJ>(p);
    }

The hasProjection goes through very similar code:

  bool ProjectionHandler::hasProjection(const ProjectionApplier& parent, const string& name) const {
    MSG_TRACE("Searching for child projection '" << name << "' of " << &parent);
    NamedProjsMap::const_iterator nps = _namedprojs.find(&parent);
    if (nps == _namedprojs.end()) return false;
    NamedProjs::const_iterator np = nps->second.find(name);
    return !(np == nps->second.end());
  }

called by this function in ProjectionApplier/Analysis:

    /// Does this applier have a projection registered under the name @a name?
    bool hasProjection(const std::string& name) const {
      return getProjHandler().hasProjection(*this, name);
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the behavior breaks already with ProjectionApplier::getProjHandle()

    /// Get a reference to the ProjectionHandler for this thread.
    ProjectionHandler& getProjHandler() const {
      return _projhandler;
    }

where _projhandler is initialized in the constructor

  ProjectionApplier::ProjectionApplier()
    : _allowProjReg(true), _owned(false),
      _projhandler(ProjectionHandler::getInstance())
  {  }

Therefore, if two ProjectionApplier objects are created in the same thread, they get the same ProjectionHandler object. If these ProjectionApplier objects are later used in different threads, their getProjHandler() continue to return a reference to the same (shared) ProjectionHandler object, and the ProjectionHandler::hasProjection() will give the same answer to both (unless its internal state has already broken because of the data races).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, and if we patch Rivet to use gettimeofday microseconds instead of the thread id? That should give as a unique ProjectionHandler for each stream.

    /// Singleton creation function
    static std::mutex mtx;
    static ProjectionHandler& getInstance() {
      // static ProjectionHandler _instance;
      // return _instance;
      std::unique_lock<std::mutex> lock(mtx);
      static map<std::thread::id,ProjectionHandler> _instances;
      return _instances[std::this_thread::get_id()];
      
    }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much rather see the "proper fix" suggested in https://gitlab.com/hepcedar/rivet/-/issues/84#note_341649903, i.e. each ProjectionApplier object just having its own PojectionHandler object. It is simpler and works correctly. If we go to patch Rivet, we could even contribute it to upstream and help to close that issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's see if that can be done, I hope there is no major roadblock for that.

Otherwise, we may just make this a one module again.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17281

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @intrepid42 (Markus Seidel) for master.

It involves the following packages:

GeneratorInterface/RivetInterface

@SiewYan, @mkirsano, @cmsbuild, @GurpreetSinghChahal, @agrohsje, @alberto-sanchez, @qliphy can you please review it and eventually sign? Thanks.
@alberto-sanchez, @agrohsje, @mkirsano this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17320

  • This PR adds an extra 20KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17922

  • This PR adds an extra 20KB to repository

@cmsbuild
Copy link
Contributor

Pull request #30888 was updated. @SiewYan, @mkirsano, @cmsbuild, @GurpreetSinghChahal, @agrohsje, @alberto-sanchez, @qliphy can you please check and sign again.

@silviodonato
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2020

The tests are being triggered in jenkins.

@silviodonato
Copy link
Contributor

@cms-sw/generators-l2 @Dr15Jones do you have further comments?

@agrohsje
Copy link

agrohsje commented Sep 1, 2020

No additional comments from my side. Thanks.
+1

@agrohsje
Copy link

agrohsje commented Sep 1, 2020

+1

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2020

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2020

+1
Tested at: 313863a
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c47a0e/9034/summary.html
CMSSW: CMSSW_11_2_X_2020-08-31-2300
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2020

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2020

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-c47a0e/9034/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 35
  • DQMHistoTests: Total histograms compared: 2609667
  • DQMHistoTests: Total failures: 1
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2609644
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 34 files compared)
  • Checked 149 log files, 22 edm output root files, 35 DQM output files

@qliphy
Copy link
Contributor

qliphy commented Sep 1, 2020

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants