-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cudadev][RFC] Next version of CUDA EDModule API #224
Conversation
Add runAcquire(), runProduce() functions
… in better-defined way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are a first round of comments.
I've been trying to keep track of how the CUDA device and stream information gets propagated through a chain of SynchronizingEDProducer
s, but I'm getting lost...
} | ||
|
||
// Internal API | ||
void commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TaskContext::commit()
is never called; is it intentional ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat intentional, there is no code in the standalone that is using the "intermediate task" feature. I consider the TaskContext
more of a sketch here that would be ironed out with proper tests when deploying in CMSSW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so it should be used in a way similar to the AcquireTask
, with something along the lines of
template <typename F>
void runTask(edm::StreamID streamID, edm::WaitingTaskWithArenaHolder holder, F func) {
TaskContext context(streamID, std::move(holder));
func(context);
context.commit();
}
and then a module would call
// do not move the holder as it is reused
runTask(event.streamID(), holder, [&](auto& ctx) { firstTask(event, eventSetup, ctx); });
runTask(event.streamID(), holder, [&](auto& ctx) { secondTask(event, eventSetup, ctx); });
runTask(event.streamID(), holder, [&](auto& ctx) { lastTask(event, eventSetup, ctx); });
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really satisfied with the API for the "intermediate tasks".
The runAcquire()
version would be straightforward to extend to something along
runAcquire(event.streamID(), std::move(holder), [&](AcquireContext& ctx) {
...
}).then([&](TaskContext& ctx) {
// to be run after the asynchronous work queued in runAcquire() has finished
});
but then there would be no similar way to queue a subsequent task in the TaskContext
-lambda above (without adding runTask()
boilerplate).
One alternative would be to keep the current-style API, i.e.
void FooProducer::acquire(AcquireContext& ctx) {
...
ctx.pushNextTask([](TaskContext& ctx2) {
// to be run after the asynchronous work queued in acquire() has finished
ctx2.pushNextTask(nextFunctor);
});
}
AcquireContext context(streamID, std::move(holder)); | ||
func(context); | ||
context.commit(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Browsing the code, this seems the only place where an AcquireContext
object is ever constructed.
Given this semantic, would it make sense to move the call to commit()
(or even its code) directly into the AcquireContext
's destructor ?
Similarly for ProduceContext
and TaskContext
(?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The separate commit()
function is one of the main motivations for this development by moving code that could throw an exception out from destructor. If func()
throws an exception, the commit()
is not called, further implying
- when the exception is not CUDA related, avoids queuing the callback function (also recording event but that has probably lesser importance)
- when the exception is CUDA related, avoids throwing a second exception from queuing the callback function / recording event that would lead to program termination
(I'm also thinking to rename the function to something more descriptive, maybe finish()
or enqueuCallback()
/recordEvent()
)
#include "CUDACore/AcquireContext.h" | ||
|
||
namespace cms::cuda { | ||
void AcquireContext::commit() { holderHelper_.enqueueCallback(stream()); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this defined in the .cc file, instead of the .h file like the other methods ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of any good reason now, so the definition could be moved to the header.
* - setting the current device | ||
* - calling edm::WaitingTaskWithArenaHolder::doneWaiting() when necessary | ||
* - synchronizing between CUDA streams if necessary | ||
* Users should not, however, construct it explicitly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this comment, should the constructor be private
, and have as friend
s the places that can construct objects of this type ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would attempt making constructors private
and adding friend
declarations.
void SynchronizingEDProducer::acquire(edm::Event const& event, | ||
edm::EventSetup const& eventSetup, | ||
edm::WaitingTaskWithArenaHolder holder) { | ||
runAcquire(event.streamID(), std::move(holder), [&](auto& ctx) { acquire(event, eventSetup, ctx); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the encapsulation, though personally I would find it simpler to avoid the lambda:
runAcquire(event.streamID(), std::move(holder), [&](auto& ctx) { acquire(event, eventSetup, ctx); }); | |
AcquireContext context(event.streamID(), std::move(holder)); | |
acquire(event, eventSetup, ctx); | |
context.commit(); // could be moved into the AcquireContext destructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so this is intended to support the first point from cms-sw/cmssw#30266 (comment) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
namespace impl { | ||
class FwkContextHolderHelper { | ||
public: | ||
FwkContextHolderHelper(edm::WaitingTaskWithArenaHolder waitingTaskHolder, int device) | ||
: waitingTaskHolder_{std::move(waitingTaskHolder)}, device_{device} {} | ||
|
||
template <typename F> | ||
void pushNextTask(F&& f); | ||
|
||
void replaceWaitingTaskHolder(edm::WaitingTaskWithArenaHolder waitingTaskHolder) { | ||
waitingTaskHolder_ = std::move(waitingTaskHolder); | ||
} | ||
|
||
void enqueueCallback(cudaStream_t stream); | ||
|
||
private: | ||
edm::WaitingTaskWithArenaHolder waitingTaskHolder_; | ||
int device_; | ||
}; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this duplicated from CUDACore/FwkContextHolderHelper.h
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is, thanks for catching.
/** | ||
* The aim of this class is to do necessary per-event "initialization" in ExternalWork acquire(): | ||
* - setting the current device | ||
* - calling edm::WaitingTaskWithArenaHolder::doneWaiting() when necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@makortel how is doneWaiting()
called ?
Is it commit()
that calls holderHelper_.enqueueCallback()
that queues cudaContextCallback()
that calls doneWaiting()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it
commit()
that callsholderHelper_.enqueueCallback()
that queuescudaContextCallback()
that callsdoneWaiting()
?
Correct. (similar to how now ~ScopedContextAcquire()
calls holderHelper_.enqueueCallback()
that calls cudaScopedContextCallback()
that calls doneWaiting()
)
* - synchronizing between CUDA streams if necessary | ||
* and enforce that those get done in a proper way in RAII fashion. | ||
*/ | ||
class AnalyzeContext : public impl::EDGetterContextBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AnalyzeContext
is not used anywhere, but I assume that's because we don't have any EDAnalyzer
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. It's mainly a copy-over from CMSSW (where there are tests using it), so I consider it more of a sketch here.
@makortel I understand the problem, and the implementation in the PR should indeed work. |
ss << "\n N(clusters) is " << clusters.nClusters() << " expected " << count.nClusters(); | ||
ok = false; | ||
} | ||
cms::cuda::runProduce(iEvent.streamID(), [&](cms::cuda::ProduceContext& ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why in this case you run this explicitly, instead of using a cms::cuda::EDProducer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it could, even if it doesn't do any CUDA work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True.
With the proposed approach, what is the recommended pattern for an EDProducer that reads a CUDA product and produces a CPU product ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be preferably inheriting from cms::cuda::SynchronizingEDProducer
, or if that doesn't work out for some (likely complicated) reason, call cms::cuda::runAcquire()
.
(sorry, sent the earlier comment without thinking through)
@@ -36,10 +33,7 @@ SiPixelDigisSoAFromCUDA::SiPixelDigisSoAFromCUDA(edm::ProductRegistry& reg) | |||
|
|||
void SiPixelDigisSoAFromCUDA::acquire(const edm::Event& iEvent, | |||
const edm::EventSetup& iSetup, | |||
edm::WaitingTaskWithArenaHolder waitingTaskHolder) { | |||
// Do the transfer in a CUDA stream parallel to the computation CUDA stream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't remember this comment, but re-reading it now, does it mean that the old code was trying to use two streams in parallel (hopefully for two independent operations) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. E.g. in the following DAG where all modules launch CUDA work
/- B
A <--
\- C
where B
and C
consume the product(s) of A
, one of the B
and C
("first one") re-uses the CUDA stream of A
, and the other gets another CUDA stream for its work. The other CUDA stream is synchronized to the completion of A
's work with cudaStreamWaitEvent()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the general approach.
I was referring to the (deleted) comment that says
Do the transfer in a CUDA stream parallel to the computation CUDA stream
which seemed to imply something like
A ----- --> B
\
`- --> T
A: first computation
B: second computation that uses the output of A
T: transfer the results of A in parallel to running B
I didn't look at any code, etc. - I just read this old comment, and found it confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for the clarifications. This is certainly an old comment (in the original CMSSW code), and probably refers to the fact that with ScopedContextAcquire ctx{iEvent.streamID(), ...}
this module gets a separate CUDA stream than the stream used to produce the input data product.
auto const& ptracks = iEvent.get(tokenGPUTrack_); | ||
cms::cuda::runProduce(iEvent.streamID(), [&](cms::cuda::ProduceContext& ctx) { | ||
auto const* tracks = ctx.get(iEvent, tokenGPUTrack_).get(); | ||
|
||
cms::cuda::ScopedContextProduce ctx{ptracks}; | ||
auto const* tracks = ctx.get(ptracks).get(); | ||
assert(tracks); | ||
|
||
assert(tracks); | ||
|
||
ctx.emplace(iEvent, tokenGPUVertex_, gpuAlgo_.makeAsync(ctx.stream(), tracks, ptMin_)); | ||
ctx.emplace(iEvent, tokenGPUVertex_, gpuAlgo_.makeAsync(ctx.stream(), tracks, ptMin_)); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm, is this (where the user calls runProduce
explicitly) a pattern we want to support ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Answered in #224 (comment)
Final comments: overall, I like the simplification this brings to the EDProcuers written by the users. I don't like much the option of calling Finally, I still do not understand how the context information gets propagated from the |
And that is because it is not propagated by design. |
(replying also here for future reference) As we discussed privately, this is to experiment the point 2 in cms-sw/cmssw#30266 (comment)) |
I fully agree the
|
but the |
Yeah, it is annoying at minimum.
This PR should not add any additional atomic flags. If you refer to I'm also thinking if we should really measure the benefit of these "asynchronously filled products", since supporting it appears to lead to various complications (still manageable though, but to see what we really gain). |
Yes - sorry, I meant "the implications of using the atomic flag in the new approach vs the old approach". |
About the std::unique_ptr<cms::cuda::ESProduct<PixelCPEFast>> PixelCPEFastESProducer::produce(TkPixelCPERecord const& iRecord) {
// all non-CUDA operations
return cms::cuda::runForHost([&](cms:::cuda::HostContext& ctx) {
// operations to allocate (pinned) host memory that needs to live until the transfers to all devices have completed
// HostContext provides enough "context" for pinned host memory allocations but nothing more
// for use cases that do not need such "temporary (pinned) host memory", there would be a direct 'runForEachDevice()' function
return SomeTemporaryObjectCouldBeTuple(...);
}).forEachDevice(SomeTemporaryObjectCouldBeTuple const&, cms::cuda::ESContext& ctx) {
// operations to transfer data and/or run kernels on one device
// ESContext provides "context" for device memory allocations and running kernels
return PixelCPEFast(...);
}); Note that the I have an earlier prototype of that, and should rebase it and make a separate PR (although there would be also a third one touching the user-facing memory allocation APIs that is in between this PR and the ESProducer PR). It might be most clear to leave further discussion on the ESProducer API for that PR (I'll try to open it as soon as I can). |
This prototype is now in #257. |
Made effectively obsolete by cms-sw/cmssw#39428 |
This PR provides a prototype of cms-sw/cmssw#30266 (comment).
In addition of interface changes, it fixes an issue with EDProducers that produce multiple CUDA products (I don't think we have any such EDProducers yet though). Currently the
cms::cuda::Product<T>
wrapper stores the same CUDA stream for each of such products, and since thosecms::cuda::SharedStreamPtr
are stored separately, tow (or more) of consumers of those products can submit their work into the same CUDA stream, whereas the intention was that only one consumer of any of the products of an EDProducer could use the same CUDA stream. This PR solves this by adding additional layer of indirection, i.e. thecms::cuda::Product<T>
wrappers hold ashared_ptr
into a helper object that contains theSharedStreamPtr
and the logic of deciding which one of the customers can share the CUDA stream.