-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults in RecHitsSortedInPhi constructor in GPU workflows #40604
Comments
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Here is another one pointing more clearly to the crash to occur in sorting
|
Assign reconstruction,heterogeneous |
New categories assigned: heterogeneous,reconstruction @mandrenguyen,@fwyzard,@clacaputo,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
#40465 looks like a plausible culprit. Let me tag also @AdrianoDee. |
Let me have a look. |
@AdrianoDee Have you had a chance to take a look? In principle it would be good to have the crashes fixed for 13_0_0. |
@makortel you are right. I had a look but I didn't converge. On it in the next days. |
So, I still didn't understand what's happening but something strange is that I can't reproduce this in single thread and the crash occurs when any of the threads goes to the next event (so at 5th event for 4 threads, 9th for 8 and so on). If this ring a bell for somebody please let me know. Debugging is getting nasty not being able to run single threaded (also, any suggestion on how to better debug this it's very welcome). |
Have you tried valgrind? It will also work with multiple threads. Another thing to try would be to see if using 2 streams and 1 thread also leads to a crash. |
After taking a look at the code (which ultimately is just sorting on floats which are stored as member data) it seems the most likely culprit is a NaN value as at least one of the phi values. A NaN breaks sorting since
so from the transitive property of arithmetics, the sort would assume
|
Thanks @Dr15Jones I was noticing the same |
The problem is that --- a/CUDADataFormats/TrackingRecHit/interface/TrackingRecHitSoADevice.h
+++ b/CUDADataFormats/TrackingRecHit/interface/TrackingRecHitSoADevice.h
@@ -48,7 +48,11 @@ public:
cms::cuda::host::unique_ptr<float[]> localCoordToHostAsync(cudaStream_t stream) const {
auto ret = cms::cuda::make_host_unique<float[]>(4 * nHits(), stream);
size_t rowSize = sizeof(float) * nHits();
- cudaCheck(cudaMemcpyAsync(ret.get(), view().xLocal(), rowSize * 4, cudaMemcpyDefault, stream));
+
+ cudaCheck(cudaMemcpyAsync(ret.get(), view().xLocal(), rowSize, cudaMemcpyDefault, stream));
+ cudaCheck(cudaMemcpyAsync(ret.get() + nHits(), view().yLocal(), rowSize, cudaMemcpyDefault, stream));
+ cudaCheck(cudaMemcpyAsync(ret.get() + nHits() * 2, view().xerrLocal(), rowSize, cudaMemcpyDefault, stream));
+ cudaCheck(cudaMemcpyAsync(ret.get() + nHits() * 3, view().yerrLocal(), rowSize, cudaMemcpyDefault, stream));
return ret;
} //move to utilities
|
Proposed the fix in #40869 |
+heterogeneous |
The step 3 in subset of 10824.59x and 11634.59x workflows have been segfaulting in GPU IBs since CMSSW_13_0_X_2023-01-18-2300. Example stack trace
https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc11/CMSSW_13_0_GPU_X_2023-01-23-2300/pyRelValMatrixLogs/run/10824.592_TTbar_13+2018_Patatrack_FullRecoGPU/step3_TTbar_13+2018_Patatrack_FullRecoGPU.log#/
The text was updated successfully, but these errors were encountered: