-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698
Workaround to produce exactly same data products in Serial and CUDA backends in Alpaka modules possibly used at HLT #44698
Conversation
…ackends in Alpaka modules possibly used at HLT To be reverted in the near future after the necessary metadata to deal with different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework.
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-44698/39907
|
A new Pull Request was created by @makortel for master. It involves the following packages:
@mandrenguyen, @fwyzard, @perrotta, @consuegs, @saumyaphor4252, @cmsbuild, @jfernan2, @makortel can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
enable gpu |
@cmsbuild, please test |
urgent |
+1 Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-47488d/38758/summary.html Comparison SummarySummary:
GPU Comparison SummarySummary:
|
+1 |
+alca |
hi @makortel, a question about the workaround. Do the branch id depend on the branch type or name, or only on their presence ? I know it's not (even close to) a production environment, but I'm curious: what would happen if one mixes jobs running on the CPU and on AMD GPUs ? |
Both the BranchID and the ProductID depend on the friendly class name, the module label, the product instance name, and the process name (BranchID being a hash of those, and ProductID being an index into a container ordered by those). I.e. these "mock-up products" for the CPU-only case must reflect the other case exactly in order to reproduce the same behavior.
The CPU+AMD GPU job might have different ProductIDs than a CPU+NVIDIA GPU job, or not. It really depends on the other data products in the job (main their friendly class names) if the NVIDIA vs. AMD difference in the type names would make a difference or not. If one would require the same ProductIDs for CPU-only, CPU+NVIDIA, and CPU+AMD, then all three backends would have to declare the production of all three flavors of data products on each backend. (which is then why we don't do that) The workaround in this PR is thus very specific to the present HLT farm hardware. |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @rappoccio, @antoniovilela, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
This PR is a temporary workaround for the issue discussed in #44643. In short, with an HLT menu that uses Alpaka modules, when some HLT processes use GPU and some do not, the fact that the different backends of Alpaka modules produce different transient data products cause the ProductIDs to be different between the CPU-only and CPU+GPU processes, and presently there is not enough metadata available in the final streamer file for the framework to keep properly track of the various indices, that leads to de-referencing of
edm::Ref
to fail in a subsequent job (for longer description see #44643 (comment)).This PR works around the problem by making a subset(*) of the Alpaka EDProducers on the CPU serial backend, for each "device-side data product" they produce (that in reality are the "host-side data products" directly) they register the production of the corresponding CUDA backend data product. This hack makes the CPU-serial and CUDA backend EDProducers to register the production of exactly the same data products, that leads to equal ProductIDs for the same data products between CPU-only and CPU+GPU jobs, circumventing the problem.
(*) ECAL, PF, and phase1 Pixel EDProducers, that are either being used in the present HLT menu, or are planned to be used in the near future
The use of strings for the CUDA backend data products is ugly, but avoids the need to have a compile-time dependence on the CUDA backend code in the CPU serial backend code (I actually tried that first, but ran into compilation issues; probably our present build rules prevent the use of Alpaka CUDA backend types in the CPU serial backend code). At runtime the added explicit CUDA dependence will likely break on platforms that do not support CUDA (all our code directly depending on CUDA would be broken anyway, so the temporary loss of functionality seems acceptable).
This PR is intended to be reverted when the necessary metadata to deal with the different ProductIDs in different HLT processes (CPU-only vs. CPU+GPU) gets propagated to the framework), whose details are being discussed in #44643 .
PR validation:
I ran the example HLT job in #44643 (comment) for 1 event without and with this PR with and without a GPU. Then I ran
edmProvDump --productIDEntry 0 -a <file>
for both output files, and compared the lines that containedProductID 2:
text (this shows the set of data products that are either stored in the file, or are ancestors of the stored data products, produced by the re-HLT process).Without this PR the ProductIDs between CPU-only and CPU+GPU jobs show differences (#44643 (comment)).
With this PR the ProductIDs between CPU-only and CPU+GPU jobs are the same.
If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:
To be backported to 14_0_X.