-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rivet ParticleLevelProducer and HTXS multi-threading #30888
Rivet ParticleLevelProducer and HTXS multi-threading #30888
Conversation
The code-checks are being triggered in jenkins. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17273
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
analysisHandler_ = new Rivet::AnalysisHandler(); | ||
|
||
analysisHandler_->setIgnoreBeams(true); | ||
analysisHandler_->addAnalysis(rivetAnalysis_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the ProjectionHandler::getInstance()
is called in the constructor of ProjectionApplier
, and Analysis
inherits from ProjectionApplier
, and our RivetAnalysis
inherits from Analysis
.
Does anything inside AnalysisHandler
object (RivetAnalysis
, Analysis
, ProjectionApplier
, or any helper class they use) rely that exactly the same ProjectionHandler
object is used every time?
Or actually conversely, can it be "proven" that for a given AnalysisHandler
object, it does not matter which ProjectionHandler
object it uses (i.e. they could be cached in arbitrary way) as long as the same one is used for the whole duration of the AnalysisHandle::analyze()
call? I.e. essentially the ProjectionApplier
could be changed to replace ProjectionHandler
member with always calling ProjectionHandler::getInstance()
? (I'm not actually suggesting that, it's just an attempt to help to think about consequences)
I'm asking because we do not guarantee an instance of a edm::stream
module to be run on any specific thread. A module instance could run on a different thread on each event than on the previous event.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a further thought, I suspect this approach won't work in general with TBB. The approach would probably work if each instance of the edm::stream
module would process the first event in a different thread wrt others because of this assignment in the constructor of ProjectionApplier
https://gitlab.com/hepcedar/rivet/-/blob/release-3-1-x/src/Core/ProjectionApplier.cc#L13
But we don't guarantee that. If two instances process the first event on the same thread, their RivetAnalysis
objects will hold a reference to the same ProjectionHandler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matti,
I think the ProjectionHandler
objects are in principle interchangeable, they should just not be used at the same time.
If you have a patch to Rivet in mind I could try it out.
But we don't guarantee that. If two instances process the first event on the same thread, their RivetAnalysis objects will hold a reference to the same ProjectionHandler.
Is there any way to provoke that behavior for testing?
I modified this PR to re-init when the standard "FS" projection is not found. But I am not sure if that would solve the problem that you described.
Cheers,
Markus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Markus,
I think the
ProjectionHandler
objects are in principle interchangeable, they should just not be used at the same time.
That's good.
If you have a patch to Rivet in mind I could try it out.
I don't, but to me the solution the authors outlined in https://gitlab.com/hepcedar/rivet/-/issues/84, i.e. each AnalysisHandler
having its own ProjectionHandler
object would be the best.
We don't need AnalysisHandler
to be thread-safe by itself, avoiding any global state (which now occurs through ProjectionHandler::getInstance()
for it to be thread-friendly would be sufficient.
Is there any way to provoke that behavior for testing?
With a minimal configuration of just the ParticleLevelProducer
and whatever modules it requires it could be possible to hack along
- create a dummy module that is fast on EDM stream 0, and for other streams spins in CPU for a long time (e.g. seconds to tens of seconds)
- put that module and the
ParticleLevelProducer
in a Path in the order ofcms.Path(spinningModule, particleLevelProducer)
- if ParticleLevelProducer requires any input, modules producing those can be either in a Task or in the same Path before
spinningModule
Then it could happen that many or all of the ParticleLevelProducer
stream instances get run in the same thread (because spinningModule
is keeping the other threads busy).
I modified this PR to re-init when the standard "FS" projection is not found. But I am not sure if that would solve the problem that you described.
I suspect it will not solve that particular problem. Or at least I'd assume if two RivetAnalysis
's are initialized with the same ProjectionHandler
, the hasProjection("FS")
would return true
for both of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Matti,
I think the calls to getProjection (when a projection is applied) and hasProjection follow the same structure, see below.
Thus, the hasProjection should return false when the ProjectionHandler for a specific thread was not initialized (or initialized in another thread). In that case, the re-initialization should fix the problem.
Best,
Markus
This is the part of the code throwing the error:
const Projection& ProjectionHandler::getProjection(const ProjectionApplier& parent, const string& name) const {
MSG_TRACE("Searching for child projection '" << name << "' of " << &parent);
NamedProjsMap::const_iterator nps = _namedprojs.find(&parent);
if (nps == _namedprojs.end()) {
std::ostringstream msg;
msg << "No projections registered for parent " << &parent;
throw Error(msg.str());
}
called by ProjectionApplier/Analysis here:
/// Get the named projection, specifying return type via a template argument.
/// @todo Add SFINAE to require that PROJ inherit from Projection
template <typename PROJ>
const PROJ& getProjection(const std::string& name) const {
const Projection& p = getProjHandler().getProjection(*this, name);
return pcast<PROJ>(p);
}
The hasProjection
goes through very similar code:
bool ProjectionHandler::hasProjection(const ProjectionApplier& parent, const string& name) const {
MSG_TRACE("Searching for child projection '" << name << "' of " << &parent);
NamedProjsMap::const_iterator nps = _namedprojs.find(&parent);
if (nps == _namedprojs.end()) return false;
NamedProjs::const_iterator np = nps->second.find(name);
return !(np == nps->second.end());
}
called by this function in ProjectionApplier/Analysis:
/// Does this applier have a projection registered under the name @a name?
bool hasProjection(const std::string& name) const {
return getProjHandler().hasProjection(*this, name);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the behavior breaks already with ProjectionApplier::getProjHandle()
/// Get a reference to the ProjectionHandler for this thread.
ProjectionHandler& getProjHandler() const {
return _projhandler;
}
where _projhandler
is initialized in the constructor
ProjectionApplier::ProjectionApplier()
: _allowProjReg(true), _owned(false),
_projhandler(ProjectionHandler::getInstance())
{ }
Therefore, if two ProjectionApplier
objects are created in the same thread, they get the same ProjectionHandler
object. If these ProjectionApplier
objects are later used in different threads, their getProjHandler()
continue to return a reference to the same (shared) ProjectionHandler
object, and the ProjectionHandler::hasProjection()
will give the same answer to both (unless its internal state has already broken because of the data races).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, and if we patch Rivet to use gettimeofday microseconds instead of the thread id? That should give as a unique ProjectionHandler for each stream.
/// Singleton creation function
static std::mutex mtx;
static ProjectionHandler& getInstance() {
// static ProjectionHandler _instance;
// return _instance;
std::unique_lock<std::mutex> lock(mtx);
static map<std::thread::id,ProjectionHandler> _instances;
return _instances[std::this_thread::get_id()];
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather see the "proper fix" suggested in https://gitlab.com/hepcedar/rivet/-/issues/84#note_341649903, i.e. each ProjectionApplier
object just having its own PojectionHandler
object. It is simpler and works correctly. If we go to patch Rivet, we could even contribute it to upstream and help to close that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's see if that can be done, I hope there is no major roadblock for that.
Otherwise, we may just make this a one
module again.
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17281
|
A new Pull Request was created by @intrepid42 (Markus Seidel) for master. It involves the following packages: GeneratorInterface/RivetInterface @SiewYan, @mkirsano, @cmsbuild, @GurpreetSinghChahal, @agrohsje, @alberto-sanchez, @qliphy can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
The code-checks are being triggered in jenkins. |
-code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17320
Code check has found code style and quality issues which could be resolved by applying following patch(s)
|
The code-checks are being triggered in jenkins. |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-30888/17922
|
Pull request #30888 was updated. @SiewYan, @mkirsano, @cmsbuild, @GurpreetSinghChahal, @agrohsje, @alberto-sanchez, @qliphy can you please check and sign again. |
please test |
The tests are being triggered in jenkins.
|
@cms-sw/generators-l2 @Dr15Jones do you have further comments? |
No additional comments from my side. Thanks. |
+1 |
This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
Comparison job queued. |
Comparison is ready Comparison Summary:
|
+1 |
PR description:
Multi-threading in nanoaod step was broken due to new multi-thread feature in the recent Rivet update 3.1.2, see #30836
Seems to work fine when initialization is done for the first event of each thread in
produce
.As a bonus, the producers run as edm::stream modules now.Producers remain edm::one modules until Rivet becomes fully thread-safe.
PR validation: