-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Alpaka event synchronization calls via EDMetadata #44841
Reduce Alpaka event synchronization calls via EDMetadata #44841
Conversation
…as launched Notably the case where an EDProducer launches asynchronous work, but does not produce any device-side data products continues to lead to blocking alpaka::wait() being called at the end of produce().
…not been reused by a consuming module If the consuming module does not produce any device-side products, the EDMetadata destructor should result in blocking alpaka::wait() call.
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44841/40091
|
A new Pull Request was created by @makortel for master. It involves the following packages:
@cmsbuild, @fwyzard, @makortel can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
enable gpu |
@cmsbuild, please test |
@fwyzard This PR might warrant some further testing on the HLT side. Let me know how you want to proceed (e.g. I have a 14_0_X-based branch already too). |
@@ -170,7 +170,7 @@ Also note that the `fillDescription()` function must have the same content for a | |||
* All Event data products in the host memory space are guaranteed to be accessible for all operations (after the data product has been obtained from the `edm::Event` or `device::Event`). | |||
* All EventSetup data products in the device memory space are guaranteed to be accessible only for operations enqueued in the `Queue` given by `device::Event::queue()` when accessed via the `device::EventSetup` (ED modules), or by `device::Record<TRecord>::queue()` when accessed via the `device::Record<TRecord>` (ESProducers). | |||
* The EDM Stream does not proceed to the next Event until after all asynchronous work of the current Event has finished. | |||
* **Note**: currently this guarantee does not hold if the job has any EDModule that launches asynchronous work but does not explicitly synchronize or produce any device-side data products. | |||
* **Note**: this implies if an EDProducer in its `produce()` function uses the `Event::queue()` or gets a device-side data product, and does not produce any device-side data products, the `produce()` call will be synchronous (i.e. will block the CPU thread until the asynchronous work finishes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update in the README actually holds (and is stronger) even without this PR, I just hadn't realized it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My brain hurts, but I trust you on this :-)
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fda4d/39077/summary.html Comparison SummaryThere are some workflows for which there are errors in the baseline: Summary:
GPU Comparison SummarySummary:
|
The code changes look good to me. I will try to test the impact on an HLT workflow. |
here's the impact on the HLT (8 jobs, 32 threads / 24 streams each): CMSSW_14_0_5_patch2
CMSSW_14_0_5_patch2 plus cms-sw/cmssw#44841
Whether the small improvement is real or a fluctuation is hard to say - but anyway it goes in the right direction. |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @sextonkennedy, @antoniovilela (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
Prompted by #44769 (comment) this PR reduces the amount of blocking
alpaka::wait()
calls (leading tocudaEventSynchronize()
on NVIDIA GPU) via the EDMetadata destructor. To check the impact I used the reproducer #44769 (even if it eventually leads to an assertion failure). Initially the reproducer lead to 2470 calls tocudaEventSynchronize()
.The first commit avoids calling
alpaka::wait()
inEDProducer::produce()
when no asynchronous work was launched (possible async work is identified by the use ofdevice::Event::queue()
in any way, including consuming a device side data product). In this case there is no need to synchronize the device and host. This commit reduced the number ofcudaEventSynchronize()
calls to 2284 (-7.5 %).The second commit caches the alpaka::Event completion, so that
alpaka::wait()
does not need to be called again on an already-completed alpaka Event. This commit reduced the number ofcudaEventSynchronize()
calls to 2019 (-18 % wrt the starting point).The third commit avoids calling
alpaka::wait()
on the destructor of anEDMetadata
when the Queue of theEDMetadata
object has been re-used by anotherEDMetadata
(in anEDProducer::produce()
that consumed the device-side data product holding the firstEDMetadata
object). Since the only purpose of thealpaka::wait()
in the~EDMetadata()
is to ensure all asynchronous work of an edm::Event has completed before the edm::Stream moves to the next edm::Event, for any Queue it is enough towait()
on the alpaka::Event that was last recorded on that Queue. This commit reduced the number ofcudaEventSynchronize()
calls to 1462 (-40 % wrt the starting point).There is one case left (on purpose) where the
EDProducer::produce()
still callsalpaka::wait()
. This happens when theproduce()
launches asynchronous work (i.e. callsdevice::Event::queue()
), but does not produce any device-side data products. I can't imagine a good use case for such an EDProducer (except maybe a DQM-style module, but that would need a separate module base class in any case that could then deal with the necessary edm::Event-level synchronizationI don't expect a huge impact from the reduction of these calls, as in the present cases in HLT the alpaka::Events should have completed by the time the
alpaka::wait()
is called. The stack traces in #44769 (comment) nevertheless show thecudaEventSynchronize()
acquiring a lock even when the CUDA event was complete at the time the event was recorded, so at least this PR should reduce the contention on that lock.Resolves cms-sw/framework-team#902
PR validation:
HeterogeneousCore/Alpaka{Core,Test}
unit tests pass.If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:
Likely will be backported to 14_0_X