-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime crash when forcing only pixel tracking+vertexing on serial_sync
backend
#45708
Comments
cms-bot internal usage |
A new Issue was created by @missirol. @Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign heterogeneous, reconstruction, hlt |
New categories assigned: heterogeneous,reconstruction,hlt @Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks |
Let's tag @cms-sw/tracking-pog-l2 |
Theoretically I'd expect it to work, at least from the framework point of view. |
type tracking |
@AdrianoDee FYI |
Compiling with debug symbols points the crash to occur in
|
Some additional information from a debugger session pIndex = 0
kl = 31
kk = 31
khh = 17
hoff = 256
phiBinner.off.m_v[hoff+kk] = 6504
phiBinner.content.m_capacity = 29601 so theoretically the Looking then at the hh.elements_ = 29601
# consistent with phiBinner.content.m_capacity
hh.phiBinnerStorageParameters_.addr_ = 0x7fff5393e580
phiBinner.content.m_v = 0x7fff5373e580
# phiBinner.content.m_v is exactly 2 MiB smaller than phiBinnerStorageParameters_.addr_ !
# ok, the "exactly 2 MiB" could be a coincidence
cmssw/HeterogeneousCore/AlpakaInterface/interface/FlexiStorage.h Lines 27 to 30 in 96d37fb
called from OneToManyAssocBase<...>::initStorage()
AFAICT initStorage() is called only in zeroAndInit kernel
and launchZero kernel
I see especially the device-to-host copy of cmssw/DataFormats/TrackingRecHitSoA/interface/alpaka/TrackingRecHitsSoACollection.h Lines 35 to 45 in 96d37fb
does not call the initStorage() , or set the phiBinner.content.m_v in any other way.
I see the cmssw/HeterogeneousCore/AlpakaInterface/test/alpaka/testHistoContainer.dev.cc Lines 201 to 208 in 96d37fb
before inspecting the host-side data. I think the device-to-host copy of Given the comment
means the the copyAsync() function must synchronize with alpaka::wait() before calling initStorage() . This might be sufficient at least for subsequent testing.
For the longer term, assuming we'd want to remove this |
…ection<TrackerTraits> Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>, in order to initialise the phiBinner data member on the host side. A more complete explanation of the issue is provided by @makortel in cms-sw#45708 (comment)
shall we have a PR for this, while a more thorough fix is developed concerning cms-sw/framework-team#989 ? |
missirol@62620da is my best-guess of a patch based on the explanations in #45708 (comment) (thanks @makortel for debugging the problem), but I don't know if it's correct. I checked that it avoids the crash, and the trigger results are the same (modulo what I think are the usual small GPU-vs-CPU discrepancies) when running pixel tracking+vertexing on CPU (as in the reproducer in the description) vs running all Alpaka modules on GPU, but so far I only tested on O(10) events. |
Fix along missirol@62620da is needed in any case. The cms-sw/framework-team#989 will only help to remove the
Could you point me to a timeline? Also, will the HLT use 14_0_X or 14_1_X for the HI data taking? (@missirol's test used 14_0_14, but my understanding is that 14_1_X would be the HI data taking release cycle). I'm asking early, because whether or not the outcome of cms-sw/framework-team#989 needs to be backported impacts how it will be done (because in 14_1_X-only could use C++20 features).
I'd believe the lines relate to
👍 A performance test (to see the cost of the |
please refer to this notice that any further tracking update hinges on this ticket to enter first.
HLT will use 14_1_X for actual data-taking, but we're still integrating updates in 14_0_X (and will continue doing so until we have |
…ection<TrackerTraits> Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>, in order to initialise the phiBinner data member on the host side. A more complete explanation of the issue is provided by @makortel in cms-sw#45708 (comment)
…ection<TrackerTraits> Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>, in order to initialise the phiBinner data member on the host side. A more complete explanation of the issue is provided by @makortel in cms-sw#45708 (comment)
I can confirm it (see #45743 (comment)). |
Sorry in advance for my ignorance...
I don't know how to remove L46; I thought the If I remove L47-51, the reproducer crashes as follows.
If I remove L48-51, the reproducer crashes as follows.
|
EDIT: it looks like these PRs generated the issue #45834, thus removing the |
A possibility for |
Thanks @makortel. I would suggest to try and adopt it for CMSSW 14.2.x, and stick to the simpler bugfix for 14.0.x/14.1.x. |
Ok. |
…ection<TrackerTraits> Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>, in order to initialise the phiBinner data member on the host side. A more complete explanation of the issue is provided by @makortel in cms-sw#45708 (comment)
+1 |
+heterogeneous |
@cmsbuild, please close |
This issue is fully signed and ready to be closed. |
…ection<TrackerTraits> Fix to the device-to-host copy of TrackingRecHitsSoACollection<TrackerTraits>, in order to initialise the phiBinner data member on the host side. A more complete explanation of the issue is provided by @makortel in cms-sw#45708 (comment)
The test in [1] crashes at runtime in
CMSSW_14_0_14
when running on a machine with a GPU (I did not try on a machine without one). The test modifies a recent HLT pp menu by setting the backend of the Alpaka pixel-tracks and pixel-vertices SoA producers to"serial_sync"
(in other words, offloading the pixel local reconstruction to GPUs, then forcing track and vertex reconstruction to run on CPU). This mimics the setup that the HIon group plans to implement in the lead-lead trigger menu of 2024 (see CMSHLT-3284) [*].The stack trace is in [2]. The crash does not happen if one uses
options.accelerators = ['cpu']
.Is [1] supposed to work ? If so, what's going wrong ?
[*] This 'mixed' approach (pixel local reconstruction on GPU, tracking and vertexing on CPU) has already been used in the 2023 HIon run, back then using the CUDA-based implementation of the pixel reconstruction. Pixel tracking is currently not offloaded to GPUs in the HIon menu because this leads to excessive GPU memory consumption (then, runtime crashes) in lead-lead events (at least with current data-taking conditions and current HLT hardware).
[1]
[2]
The text was updated successfully, but these errors were encountered: