[cudadev][RFC] Next version of CUDA EDModule API #224

makortel · 2021-09-14T21:36:30Z

This PR provides a prototype of cms-sw/cmssw#30266 (comment).

In addition of interface changes, it fixes an issue with EDProducers that produce multiple CUDA products (I don't think we have any such EDProducers yet though). Currently the cms::cuda::Product<T> wrapper stores the same CUDA stream for each of such products, and since those cms::cuda::SharedStreamPtr are stored separately, tow (or more) of consumers of those products can submit their work into the same CUDA stream, whereas the intention was that only one consumer of any of the products of an EDProducer could use the same CUDA stream. This PR solves this by adding additional layer of indirection, i.e. the cms::cuda::Product<T> wrappers hold a shared_ptr into a helper object that contains the SharedStreamPtr and the logic of deciding which one of the customers can share the CUDA stream.

Add runAcquire(), runProduce() functions

… in better-defined way

fwyzard

Here are a first round of comments.

I've been trying to keep track of how the CUDA device and stream information gets propagated through a chain of SynchronizingEDProducers, but I'm getting lost...

fwyzard · 2021-10-25T06:48:51Z

src/cudadev/CUDACore/TaskContext.h

+    }
+
+    // Internal API
+    void commit();


TaskContext::commit() is never called; is it intentional ?

Somewhat intentional, there is no code in the standalone that is using the "intermediate task" feature. I consider the TaskContext more of a sketch here that would be ironed out with proper tests when deploying in CMSSW.

OK, so it should be used in a way similar to the AcquireTask, with something along the lines of

template <typename F> void runTask(edm::StreamID streamID, edm::WaitingTaskWithArenaHolder holder, F func) { TaskContext context(streamID, std::move(holder)); func(context); context.commit(); }

and then a module would call

// do not move the holder as it is reused runTask(event.streamID(), holder, [&](auto& ctx) { firstTask(event, eventSetup, ctx); }); runTask(event.streamID(), holder, [&](auto& ctx) { secondTask(event, eventSetup, ctx); }); runTask(event.streamID(), holder, [&](auto& ctx) { lastTask(event, eventSetup, ctx); });

?

I'm not really satisfied with the API for the "intermediate tasks".

The runAcquire() version would be straightforward to extend to something along

runAcquire(event.streamID(), std::move(holder), [&](AcquireContext& ctx) { ... }).then([&](TaskContext& ctx) { // to be run after the asynchronous work queued in runAcquire() has finished });

but then there would be no similar way to queue a subsequent task in the TaskContext-lambda above (without adding runTask() boilerplate).

One alternative would be to keep the current-style API, i.e.

void FooProducer::acquire(AcquireContext& ctx) { ... ctx.pushNextTask([](TaskContext& ctx2) { // to be run after the asynchronous work queued in acquire() has finished ctx2.pushNextTask(nextFunctor); }); }

fwyzard · 2021-10-25T06:52:24Z

src/cudadev/CUDACore/AcquireContext.h

+    AcquireContext context(streamID, std::move(holder));
+    func(context);
+    context.commit();


Browsing the code, this seems the only place where an AcquireContext object is ever constructed.
Given this semantic, would it make sense to move the call to commit() (or even its code) directly into the AcquireContext's destructor ?

Similarly for ProduceContext and TaskContext (?).

The separate commit() function is one of the main motivations for this development by moving code that could throw an exception out from destructor. If func() throws an exception, the commit() is not called, further implying

when the exception is not CUDA related, avoids queuing the callback function (also recording event but that has probably lesser importance)

when the exception is CUDA related, avoids throwing a second exception from queuing the callback function / recording event that would lead to program termination

(I'm also thinking to rename the function to something more descriptive, maybe finish() or enqueuCallback()/recordEvent())

fwyzard · 2021-10-25T06:54:00Z

src/cudadev/CUDACore/AcquireContext.cc

+#include "CUDACore/AcquireContext.h"
+
+namespace cms::cuda {
+  void AcquireContext::commit() { holderHelper_.enqueueCallback(stream()); }


Why is this defined in the .cc file, instead of the .h file like the other methods ?

I can't think of any good reason now, so the definition could be moved to the header.

fwyzard · 2021-10-25T06:56:09Z

src/cudadev/CUDACore/AcquireContext.h

+   * - setting the current device
+   * - calling edm::WaitingTaskWithArenaHolder::doneWaiting() when necessary
+   * - synchronizing between CUDA streams if necessary
+   * Users should not, however, construct it explicitly.


Given this comment, should the constructor be private, and have as friends the places that can construct objects of this type ?

Yes, I would attempt making constructors private and adding friend declarations.

fwyzard · 2021-10-25T09:23:29Z

src/cudadev/CUDACore/EDProducer.cc

+  void SynchronizingEDProducer::acquire(edm::Event const& event,
+                                        edm::EventSetup const& eventSetup,
+                                        edm::WaitingTaskWithArenaHolder holder) {
+    runAcquire(event.streamID(), std::move(holder), [&](auto& ctx) { acquire(event, eventSetup, ctx); });


I understand the encapsulation, though personally I would find it simpler to avoid the lambda:

Suggested change

runAcquire(event.streamID(), std::move(holder), [&](auto& ctx) { acquire(event, eventSetup, ctx); });

AcquireContext context(event.streamID(), std::move(holder));

acquire(event, eventSetup, ctx);

context.commit(); // could be moved into the AcquireContext destructor

OK, so this is intended to support the first point from cms-sw/cmssw#30266 (comment) .

fwyzard · 2021-10-25T10:49:19Z

src/cudadev/CUDACore/TaskContext.h

+  namespace impl {
+    class FwkContextHolderHelper {
+    public:
+      FwkContextHolderHelper(edm::WaitingTaskWithArenaHolder waitingTaskHolder, int device)
+        : waitingTaskHolder_{std::move(waitingTaskHolder)}, device_{device} {}
+
+      template <typename F>
+      void pushNextTask(F&& f);
+
+      void replaceWaitingTaskHolder(edm::WaitingTaskWithArenaHolder waitingTaskHolder) {
+        waitingTaskHolder_ = std::move(waitingTaskHolder);
+      }
+
+      void enqueueCallback(cudaStream_t stream);
+
+    private:
+      edm::WaitingTaskWithArenaHolder waitingTaskHolder_;
+      int device_;
+    };
+  }


Isn't this duplicated from CUDACore/FwkContextHolderHelper.h ?

It is, thanks for catching.

fwyzard · 2021-10-25T10:52:51Z

src/cudadev/CUDACore/AcquireContext.h

+  /**
+   * The aim of this class is to do necessary per-event "initialization" in ExternalWork acquire():
+   * - setting the current device
+   * - calling edm::WaitingTaskWithArenaHolder::doneWaiting() when necessary


@makortel how is doneWaiting() called ?

Is it commit() that calls holderHelper_.enqueueCallback() that queues cudaContextCallback() that calls doneWaiting() ?

Is it commit() that calls holderHelper_.enqueueCallback() that queues cudaContextCallback() that calls doneWaiting() ?

Correct. (similar to how now ~ScopedContextAcquire() calls holderHelper_.enqueueCallback() that calls cudaScopedContextCallback() that calls doneWaiting())

src/cudadev/CUDACore/TaskContext.h

fwyzard · 2021-10-25T11:01:46Z

src/cudadev/CUDACore/AnalyzeContext.h

+   * - synchronizing between CUDA streams if necessary
+   * and enforce that those get done in a proper way in RAII fashion.
+   */
+  class AnalyzeContext : public impl::EDGetterContextBase {


AnalyzeContext is not used anywhere, but I assume that's because we don't have any EDAnalyzers.

Correct. It's mainly a copy-over from CMSSW (where there are tests using it), so I consider it more of a sketch here.

fwyzard · 2021-10-25T12:12:08Z

This PR solves this by adding additional layer of indirection, i.e. the cms::cuda::Product<T> wrappers hold a shared_ptr into a helper object that contains the SharedStreamPtr and the logic of deciding which one of the customers can share the CUDA stream.

@makortel I understand the problem, and the implementation in the PR should indeed work.
However I'm concerned about the use of what effectively is a shared_ptr<shared_ptr<cudaStreamOpaqueObject>>, and I'm still trying to thinkg through the implications of the additional atomic flag.

fwyzard · 2021-10-25T12:53:06Z

src/cudadev/plugin-Validation/CountValidator.cc

-      ss << "\n N(clusters) is " << clusters.nClusters() << " expected " << count.nClusters();
-      ok = false;
-    }
+    cms::cuda::runProduce(iEvent.streamID(), [&](cms::cuda::ProduceContext& ctx) {


Why in this case you run this explicitly, instead of using a cms::cuda::EDProducer ?

I suppose it could, even if it doesn't do any CUDA work.

True.
With the proposed approach, what is the recommended pattern for an EDProducer that reads a CUDA product and produces a CPU product ?

That would be preferably inheriting from cms::cuda::SynchronizingEDProducer, or if that doesn't work out for some (likely complicated) reason, call cms::cuda::runAcquire().

(sorry, sent the earlier comment without thinking through)

fwyzard · 2021-10-25T12:56:06Z

src/cudadev/plugin-SiPixelRawToDigi/SiPixelDigisSoAFromCUDA.cc

@@ -36,10 +33,7 @@ SiPixelDigisSoAFromCUDA::SiPixelDigisSoAFromCUDA(edm::ProductRegistry& reg)

 void SiPixelDigisSoAFromCUDA::acquire(const edm::Event& iEvent,
                                      const edm::EventSetup& iSetup,
-                                      edm::WaitingTaskWithArenaHolder waitingTaskHolder) {
-  // Do the transfer in a CUDA stream parallel to the computation CUDA stream


I didn't remember this comment, but re-reading it now, does it mean that the old code was trying to use two streams in parallel (hopefully for two independent operations) ?

Yes. E.g. in the following DAG where all modules launch CUDA work

/- B A <-- \- C

where B and C consume the product(s) of A, one of the B and C ("first one") re-uses the CUDA stream of A, and the other gets another CUDA stream for its work. The other CUDA stream is synchronized to the completion of A's work with cudaStreamWaitEvent().

Yes, this is the general approach.

I was referring to the (deleted) comment that says

Do the transfer in a CUDA stream parallel to the computation CUDA stream

which seemed to imply something like

A ----- --> B \ `- --> T A: first computation B: second computation that uses the output of A T: transfer the results of A in parallel to running B

I didn't look at any code, etc. - I just read this old comment, and found it confusing.

Ah, thanks for the clarifications. This is certainly an old comment (in the original CMSSW code), and probably refers to the fact that with ScopedContextAcquire ctx{iEvent.streamID(), ...} this module gets a separate CUDA stream than the stream used to produce the input data product.

fwyzard · 2021-10-25T13:14:42Z

src/cudadev/plugin-PixelVertexFinding/PixelVertexProducerCUDA.cc

-  auto const& ptracks = iEvent.get(tokenGPUTrack_);
+  cms::cuda::runProduce(iEvent.streamID(), [&](cms::cuda::ProduceContext& ctx) {
+    auto const* tracks = ctx.get(iEvent, tokenGPUTrack_).get();

-  cms::cuda::ScopedContextProduce ctx{ptracks};
-  auto const* tracks = ctx.get(ptracks).get();
+    assert(tracks);

-  assert(tracks);
-
-  ctx.emplace(iEvent, tokenGPUVertex_, gpuAlgo_.makeAsync(ctx.stream(), tracks, ptMin_));
+    ctx.emplace(iEvent, tokenGPUVertex_, gpuAlgo_.makeAsync(ctx.stream(), tracks, ptMin_));
+  });


Mhm, is this (where the user calls runProduce explicitly) a pattern we want to support ?

Answered in #224 (comment)

fwyzard · 2021-10-25T13:28:38Z

Final comments: overall, I like the simplification this brings to the EDProcuers written by the users.

I don't like much the option of calling runProduce() explicitly; for the case of PixelVertexProducerCUDA, I'd rather split the EDProducer into a CPU version and a GPU version; for the case of CountValidator I did not understand why it is needed.

Finally, I still do not understand how the context information gets propagated from the acquire to the produce :-(

fwyzard · 2021-10-25T13:49:46Z

Finally, I still do not understand how the context information gets propagated from the acquire to the produce :-(

And that is because it is not propagated by design.

makortel · 2021-10-25T13:51:01Z

I've been trying to keep track of how the CUDA device and stream information gets propagated through a chain of SynchronizingEDProducers, but I'm getting lost...

(replying also here for future reference) As we discussed privately, this is to experiment the point 2 in cms-sw/cmssw#30266 (comment))

makortel · 2021-10-25T14:11:48Z

I don't like much the option of calling runProduce() explicitly; for the case of PixelVertexProducerCUDA, I'd rather split the EDProducer into a CPU version and a GPU version;

I fully agree the PixelVertexProducerCUDA would need to be split into CPU and GPU versions. There may be other cases where such split would be impractical (e.g. MixingModule and one digitizer would use GPU), for which I'd actually prefer to keep PixelVertexProducerCUDA here as it is. (CMSSW can be a different story, and going to Alpaka would anyway force that).

for the case of CountValidator I did not understand why it is needed.

CountValidator reads three numbers that are in the host memory, but those are in the products that are wrapped in cms::cuda::Product<T>. The T can be obtained from cms::cuda::Product<T> only via *Context (to guarantee proper synchronization), and in this design *Context can be obtained only either by deriving a base class or calling e.g. runProduce(). I'd say this specific use is a short cut for a prototype program that should not really be used in production (CMSSW). Ok, necessary "converter modules" could be added here too.

fwyzard · 2021-10-25T14:20:07Z

CountValidator reads three numbers that are in the host memory,

but the CountValidator case could also be implemented inheriting from cms::cuda::EDproducer instead, right ?

makortel · 2021-10-25T15:10:17Z

However I'm concerned about the use of what effectively is a shared_ptr<shared_ptr<cudaStreamOpaqueObject>>,

Yeah, it is annoying at minimum.

and I'm still trying to think through the implications of the additional atomic flag.

This PR should not add any additional atomic flags. If you refer to StreamSharingHelper::mayReuseStream_, that flag is already used in ProductBase (for similar purpose).

I'm also thinking if we should really measure the benefit of these "asynchronously filled products", since supporting it appears to lead to various complications (still manageable though, but to see what we really gain).

fwyzard · 2021-10-25T15:25:13Z

This PR should not add any additional atomic flags. If you refer to StreamSharingHelper::mayReuseStream_, that flag is already used in ProductBase (for similar purpose).

Yes - sorry, I meant "the implications of using the atomic flag in the new approach vs the old approach".

makortel · 2021-10-25T16:10:11Z

About the runAcquire()/runProduce() API discussion it might be relevant that I'm thinking to add something along the following for ESProducers (where a base class would not really work)

std::unique_ptr<cms::cuda::ESProduct<PixelCPEFast>> PixelCPEFastESProducer::produce(TkPixelCPERecord const& iRecord) {
  // all non-CUDA operations
  return cms::cuda::runForHost([&](cms:::cuda::HostContext& ctx) {
    // operations to allocate (pinned) host memory that needs to live until the transfers to all devices have completed
    // HostContext provides enough "context" for pinned host memory allocations but nothing more
    // for use cases that do not need such "temporary (pinned) host memory", there would be a direct 'runForEachDevice()' function
    return SomeTemporaryObjectCouldBeTuple(...);
  }).forEachDevice(SomeTemporaryObjectCouldBeTuple const&, cms::cuda::ESContext& ctx) {
    // operations to transfer data and/or run kernels on one device
    // ESContext provides "context" for device memory allocations and running kernels
    return PixelCPEFast(...);
  });

Note that the cms::cuda::ESProduct<T> is different from cms::cuda::ESProduct<T> we have now (so there would be intermediate transition period with different name).

I have an earlier prototype of that, and should rebase it and make a separate PR (although there would be also a third one touching the user-facing memory allocation APIs that is in between this PR and the ESProducer PR). It might be most clear to leave further discussion on the ESProducer API for that PR (I'll try to open it as soon as I can).

makortel · 2021-10-27T01:50:58Z

I have an earlier prototype of that, and should rebase it and make a separate PR (although there would be also a third one touching the user-facing memory allocation APIs that is in between this PR and the ESProducer PR). It might be most clear to leave further discussion on the ESProducer API for that PR (I'll try to open it as soon as I can).

This prototype is now in #257.

makortel · 2022-09-16T19:12:32Z

Made effectively obsolete by cms-sw/cmssw#39428

makortel added 9 commits September 14, 2021 14:17

Add Context class hierarchy

d00f46b

Add EDProducer base class, and runAcquire() and runProduce() functions

f7a46ab

Add runAcquire(), runProduce() functions

Migrate code from ScopedContext to Context

3afeb58

Avoid using ContextState

aa518d6

Remove ScopedContext as obsolete

9f04f75

Share the CUDA stream between Event products from the same EDProducer…

bdc498e

… in better-defined way

Split Context.h into one class per file

7e68862

Migrate code away from Context.h

fdf393f

Remove Context.h

5e0d422

makortel mentioned this pull request Sep 14, 2021

Improve CUDA EDProducers cms-sw/cmssw#30266

Open

makortel requested a review from fwyzard September 14, 2021 21:37

makortel mentioned this pull request Sep 14, 2021

Prototype new Context class and module base classes in standalone cms-sw/framework-team#231

Closed

makortel added the cuda label Oct 7, 2021

fwyzard reviewed Oct 25, 2021

View reviewed changes

src/cudadev/CUDACore/TaskContext.h Show resolved Hide resolved

fwyzard reviewed Oct 25, 2021

View reviewed changes

fwyzard added the enhancement New feature or request label Oct 27, 2021

makortel mentioned this pull request Feb 22, 2022

[alpakatest][RFC] Prototype evolution of EDModule API #314

Closed

makortel mentioned this pull request Sep 16, 2022

Evolution of the Alpaka "gpu framework" cms-sw/cmssw#39428

Merged

5 tasks

makortel closed this Sep 16, 2022

[cudadev][RFC] Next version of CUDA EDModule API #224

[cudadev][RFC] Next version of CUDA EDModule API #224

Conversation

makortel commented Sep 14, 2021

fwyzard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard commented Oct 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fwyzard commented Oct 25, 2021

fwyzard commented Oct 25, 2021

makortel commented Oct 25, 2021

makortel commented Oct 25, 2021

fwyzard commented Oct 25, 2021 • edited Loading

makortel commented Oct 25, 2021

fwyzard commented Oct 25, 2021

makortel commented Oct 25, 2021

makortel commented Oct 27, 2021

makortel commented Sep 16, 2022

fwyzard commented Oct 25, 2021 •

edited

Loading