Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudadev][RFC] Next version of CUDA EDModule API #224

Closed
wants to merge 9 commits into from

Conversation

makortel
Copy link
Collaborator

This PR provides a prototype of cms-sw/cmssw#30266 (comment).

In addition of interface changes, it fixes an issue with EDProducers that produce multiple CUDA products (I don't think we have any such EDProducers yet though). Currently the cms::cuda::Product<T> wrapper stores the same CUDA stream for each of such products, and since those cms::cuda::SharedStreamPtr are stored separately, tow (or more) of consumers of those products can submit their work into the same CUDA stream, whereas the intention was that only one consumer of any of the products of an EDProducer could use the same CUDA stream. This PR solves this by adding additional layer of indirection, i.e. the cms::cuda::Product<T> wrappers hold a shared_ptr into a helper object that contains the SharedStreamPtr and the logic of deciding which one of the customers can share the CUDA stream.

Copy link
Contributor

@fwyzard fwyzard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a first round of comments.

I've been trying to keep track of how the CUDA device and stream information gets propagated through a chain of SynchronizingEDProducers, but I'm getting lost...

}

// Internal API
void commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskContext::commit() is never called; is it intentional ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat intentional, there is no code in the standalone that is using the "intermediate task" feature. I consider the TaskContext more of a sketch here that would be ironed out with proper tests when deploying in CMSSW.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so it should be used in a way similar to the AcquireTask, with something along the lines of

  template <typename F>
  void runTask(edm::StreamID streamID, edm::WaitingTaskWithArenaHolder holder, F func) {
    TaskContext context(streamID, std::move(holder));
    func(context);
    context.commit();
  } 

and then a module would call

// do not move the holder as it is reused
runTask(event.streamID(), holder, [&](auto& ctx) { firstTask(event, eventSetup, ctx); });
runTask(event.streamID(), holder, [&](auto& ctx) { secondTask(event, eventSetup, ctx); });
runTask(event.streamID(), holder, [&](auto& ctx) { lastTask(event, eventSetup, ctx); });

?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really satisfied with the API for the "intermediate tasks".

The runAcquire() version would be straightforward to extend to something along

runAcquire(event.streamID(), std::move(holder), [&](AcquireContext& ctx) {
  ...
}).then([&](TaskContext& ctx) {
  // to be run after the asynchronous work queued in runAcquire() has finished
});

but then there would be no similar way to queue a subsequent task in the TaskContext-lambda above (without adding runTask() boilerplate).

One alternative would be to keep the current-style API, i.e.

void FooProducer::acquire(AcquireContext& ctx) {
  ...
  ctx.pushNextTask([](TaskContext& ctx2) {
    // to be run after the asynchronous work queued in acquire() has finished
    ctx2.pushNextTask(nextFunctor);
  });
}

Comment on lines +39 to +41
AcquireContext context(streamID, std::move(holder));
func(context);
context.commit();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Browsing the code, this seems the only place where an AcquireContext object is ever constructed.
Given this semantic, would it make sense to move the call to commit() (or even its code) directly into the AcquireContext's destructor ?

Similarly for ProduceContext and TaskContext (?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separate commit() function is one of the main motivations for this development by moving code that could throw an exception out from destructor. If func() throws an exception, the commit() is not called, further implying

  • when the exception is not CUDA related, avoids queuing the callback function (also recording event but that has probably lesser importance)
  • when the exception is CUDA related, avoids throwing a second exception from queuing the callback function / recording event that would lead to program termination

(I'm also thinking to rename the function to something more descriptive, maybe finish() or enqueuCallback()/recordEvent())

#include "CUDACore/AcquireContext.h"

namespace cms::cuda {
void AcquireContext::commit() { holderHelper_.enqueueCallback(stream()); }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this defined in the .cc file, instead of the .h file like the other methods ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of any good reason now, so the definition could be moved to the header.

* - setting the current device
* - calling edm::WaitingTaskWithArenaHolder::doneWaiting() when necessary
* - synchronizing between CUDA streams if necessary
* Users should not, however, construct it explicitly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this comment, should the constructor be private, and have as friends the places that can construct objects of this type ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would attempt making constructors private and adding friend declarations.

void SynchronizingEDProducer::acquire(edm::Event const& event,
edm::EventSetup const& eventSetup,
edm::WaitingTaskWithArenaHolder holder) {
runAcquire(event.streamID(), std::move(holder), [&](auto& ctx) { acquire(event, eventSetup, ctx); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the encapsulation, though personally I would find it simpler to avoid the lambda:

Suggested change
runAcquire(event.streamID(), std::move(holder), [&](auto& ctx) { acquire(event, eventSetup, ctx); });
AcquireContext context(event.streamID(), std::move(holder));
acquire(event, eventSetup, ctx);
context.commit(); // could be moved into the AcquireContext destructor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so this is intended to support the first point from cms-sw/cmssw#30266 (comment) .

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Comment on lines +8 to +27
namespace impl {
class FwkContextHolderHelper {
public:
FwkContextHolderHelper(edm::WaitingTaskWithArenaHolder waitingTaskHolder, int device)
: waitingTaskHolder_{std::move(waitingTaskHolder)}, device_{device} {}

template <typename F>
void pushNextTask(F&& f);

void replaceWaitingTaskHolder(edm::WaitingTaskWithArenaHolder waitingTaskHolder) {
waitingTaskHolder_ = std::move(waitingTaskHolder);
}

void enqueueCallback(cudaStream_t stream);

private:
edm::WaitingTaskWithArenaHolder waitingTaskHolder_;
int device_;
};
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this duplicated from CUDACore/FwkContextHolderHelper.h ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, thanks for catching.

/**
* The aim of this class is to do necessary per-event "initialization" in ExternalWork acquire():
* - setting the current device
* - calling edm::WaitingTaskWithArenaHolder::doneWaiting() when necessary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makortel how is doneWaiting() called ?

Is it commit() that calls holderHelper_.enqueueCallback() that queues cudaContextCallback() that calls doneWaiting() ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it commit() that calls holderHelper_.enqueueCallback() that queues cudaContextCallback() that calls doneWaiting() ?

Correct. (similar to how now ~ScopedContextAcquire() calls holderHelper_.enqueueCallback() that calls cudaScopedContextCallback() that calls doneWaiting())

* - synchronizing between CUDA streams if necessary
* and enforce that those get done in a proper way in RAII fashion.
*/
class AnalyzeContext : public impl::EDGetterContextBase {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalyzeContext is not used anywhere, but I assume that's because we don't have any EDAnalyzers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. It's mainly a copy-over from CMSSW (where there are tests using it), so I consider it more of a sketch here.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 25, 2021

This PR solves this by adding additional layer of indirection, i.e. the cms::cuda::Product<T> wrappers hold a shared_ptr into a helper object that contains the SharedStreamPtr and the logic of deciding which one of the customers can share the CUDA stream.

@makortel I understand the problem, and the implementation in the PR should indeed work.
However I'm concerned about the use of what effectively is a shared_ptr<shared_ptr<cudaStreamOpaqueObject>>, and I'm still trying to thinkg through the implications of the additional atomic flag.

ss << "\n N(clusters) is " << clusters.nClusters() << " expected " << count.nClusters();
ok = false;
}
cms::cuda::runProduce(iEvent.streamID(), [&](cms::cuda::ProduceContext& ctx) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why in this case you run this explicitly, instead of using a cms::cuda::EDProducer ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it could, even if it doesn't do any CUDA work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True.
With the proposed approach, what is the recommended pattern for an EDProducer that reads a CUDA product and produces a CPU product ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be preferably inheriting from cms::cuda::SynchronizingEDProducer, or if that doesn't work out for some (likely complicated) reason, call cms::cuda::runAcquire().

(sorry, sent the earlier comment without thinking through)

@@ -36,10 +33,7 @@ SiPixelDigisSoAFromCUDA::SiPixelDigisSoAFromCUDA(edm::ProductRegistry& reg)

void SiPixelDigisSoAFromCUDA::acquire(const edm::Event& iEvent,
const edm::EventSetup& iSetup,
edm::WaitingTaskWithArenaHolder waitingTaskHolder) {
// Do the transfer in a CUDA stream parallel to the computation CUDA stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't remember this comment, but re-reading it now, does it mean that the old code was trying to use two streams in parallel (hopefully for two independent operations) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. E.g. in the following DAG where all modules launch CUDA work

     /- B
A <--
     \- C

where B and C consume the product(s) of A, one of the B and C ("first one") re-uses the CUDA stream of A, and the other gets another CUDA stream for its work. The other CUDA stream is synchronized to the completion of A's work with cudaStreamWaitEvent().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the general approach.

I was referring to the (deleted) comment that says

Do the transfer in a CUDA stream parallel to the computation CUDA stream

which seemed to imply something like

A ----- --> B
    \
     `- --> T

A: first computation
B: second computation that uses the output of A
T: transfer the results of A in parallel to running B

I didn't look at any code, etc. - I just read this old comment, and found it confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarifications. This is certainly an old comment (in the original CMSSW code), and probably refers to the fact that with ScopedContextAcquire ctx{iEvent.streamID(), ...} this module gets a separate CUDA stream than the stream used to produce the input data product.

Comment on lines -61 to +67
auto const& ptracks = iEvent.get(tokenGPUTrack_);
cms::cuda::runProduce(iEvent.streamID(), [&](cms::cuda::ProduceContext& ctx) {
auto const* tracks = ctx.get(iEvent, tokenGPUTrack_).get();

cms::cuda::ScopedContextProduce ctx{ptracks};
auto const* tracks = ctx.get(ptracks).get();
assert(tracks);

assert(tracks);

ctx.emplace(iEvent, tokenGPUVertex_, gpuAlgo_.makeAsync(ctx.stream(), tracks, ptMin_));
ctx.emplace(iEvent, tokenGPUVertex_, gpuAlgo_.makeAsync(ctx.stream(), tracks, ptMin_));
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, is this (where the user calls runProduce explicitly) a pattern we want to support ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in #224 (comment)

@fwyzard
Copy link
Contributor

fwyzard commented Oct 25, 2021

Final comments: overall, I like the simplification this brings to the EDProcuers written by the users.

I don't like much the option of calling runProduce() explicitly; for the case of PixelVertexProducerCUDA, I'd rather split the EDProducer into a CPU version and a GPU version; for the case of CountValidator I did not understand why it is needed.

Finally, I still do not understand how the context information gets propagated from the acquire to the produce :-(

@fwyzard
Copy link
Contributor

fwyzard commented Oct 25, 2021

Finally, I still do not understand how the context information gets propagated from the acquire to the produce :-(

And that is because it is not propagated by design.

@makortel
Copy link
Collaborator Author

I've been trying to keep track of how the CUDA device and stream information gets propagated through a chain of SynchronizingEDProducers, but I'm getting lost...

(replying also here for future reference) As we discussed privately, this is to experiment the point 2 in cms-sw/cmssw#30266 (comment))

@makortel
Copy link
Collaborator Author

I don't like much the option of calling runProduce() explicitly; for the case of PixelVertexProducerCUDA, I'd rather split the EDProducer into a CPU version and a GPU version;

I fully agree the PixelVertexProducerCUDA would need to be split into CPU and GPU versions. There may be other cases where such split would be impractical (e.g. MixingModule and one digitizer would use GPU), for which I'd actually prefer to keep PixelVertexProducerCUDA here as it is. (CMSSW can be a different story, and going to Alpaka would anyway force that).

for the case of CountValidator I did not understand why it is needed.

CountValidator reads three numbers that are in the host memory, but those are in the products that are wrapped in cms::cuda::Product<T>. The T can be obtained from cms::cuda::Product<T> only via *Context (to guarantee proper synchronization), and in this design *Context can be obtained only either by deriving a base class or calling e.g. runProduce(). I'd say this specific use is a short cut for a prototype program that should not really be used in production (CMSSW). Ok, necessary "converter modules" could be added here too.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 25, 2021

CountValidator reads three numbers that are in the host memory,

but the CountValidator case could also be implemented inheriting from cms::cuda::EDproducer instead, right ?

@makortel
Copy link
Collaborator Author

However I'm concerned about the use of what effectively is a shared_ptr<shared_ptr<cudaStreamOpaqueObject>>,

Yeah, it is annoying at minimum.

and I'm still trying to think through the implications of the additional atomic flag.

This PR should not add any additional atomic flags. If you refer to StreamSharingHelper::mayReuseStream_, that flag is already used in ProductBase (for similar purpose).

I'm also thinking if we should really measure the benefit of these "asynchronously filled products", since supporting it appears to lead to various complications (still manageable though, but to see what we really gain).

@fwyzard
Copy link
Contributor

fwyzard commented Oct 25, 2021

This PR should not add any additional atomic flags. If you refer to StreamSharingHelper::mayReuseStream_, that flag is already used in ProductBase (for similar purpose).

Yes - sorry, I meant "the implications of using the atomic flag in the new approach vs the old approach".

@makortel
Copy link
Collaborator Author

About the runAcquire()/runProduce() API discussion it might be relevant that I'm thinking to add something along the following for ESProducers (where a base class would not really work)

std::unique_ptr<cms::cuda::ESProduct<PixelCPEFast>> PixelCPEFastESProducer::produce(TkPixelCPERecord const& iRecord) {
  // all non-CUDA operations
  return cms::cuda::runForHost([&](cms:::cuda::HostContext& ctx) {
    // operations to allocate (pinned) host memory that needs to live until the transfers to all devices have completed
    // HostContext provides enough "context" for pinned host memory allocations but nothing more
    // for use cases that do not need such "temporary (pinned) host memory", there would be a direct 'runForEachDevice()' function
    return SomeTemporaryObjectCouldBeTuple(...);
  }).forEachDevice(SomeTemporaryObjectCouldBeTuple const&, cms::cuda::ESContext& ctx) {
    // operations to transfer data and/or run kernels on one device
    // ESContext provides "context" for device memory allocations and running kernels
    return PixelCPEFast(...);
  });

Note that the cms::cuda::ESProduct<T> is different from cms::cuda::ESProduct<T> we have now (so there would be intermediate transition period with different name).

I have an earlier prototype of that, and should rebase it and make a separate PR (although there would be also a third one touching the user-facing memory allocation APIs that is in between this PR and the ESProducer PR). It might be most clear to leave further discussion on the ESProducer API for that PR (I'll try to open it as soon as I can).

@makortel
Copy link
Collaborator Author

I have an earlier prototype of that, and should rebase it and make a separate PR (although there would be also a third one touching the user-facing memory allocation APIs that is in between this PR and the ESProducer PR). It might be most clear to leave further discussion on the ESProducer API for that PR (I'll try to open it as soon as I can).

This prototype is now in #257.

@makortel
Copy link
Collaborator Author

Made effectively obsolete by cms-sw/cmssw#39428

@makortel makortel closed this Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants