-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apparent incompatibility with current ATLAS GPU environment #1462
Comments
So the problem is weird and “seem” to indicate violation of ODR (i.e. an inconsistent build). On the debugger when we get to the point the ‘bad’ function is about to be called:
(and when we use up to get back to that frame), all the variable have correct values. when we step into the ‘bad’ function and immediately get to:
then everything is screwed
|
I hit the same error on Perlmutter with the build
|
I am making a build not using standalone versions of gcc13 and cuda 12.4.1 to get another data point. |
I only get the error running with Athena+Celeritas. I tried building Celeritas within the Athena environment and all the unit tests succeeded.
EDIT: |
Have we tried a build without VecGeom just to rule out RDC weirdness? |
WIth Geant4, 29 tests fail. Stack overflow
launch failure
launch failure cudaFree
|
@esseivaju youre sure you're building with the right cuda arch? |
yes, building with |
I rebuilt AthSimulation + AthSimulationExternals on Perlmutter, but that made no difference. Building Celeritas within the AthSimulationExternal environment leads to the same failures in Athena and the Celeritas unit tests. I also see invalid read as reported by Philippe invalid memory access
If I compile with |
My build ended up with the following failures:
(updated to remove test that failed due to an unrelated local mis-configuration resulting in not finding data files)
|
I've been seeing the same [ RUN ] TestTrackSortActionIdEm3Stepper.device_is_sorted
status: Celeritas core setup complete
info: Executing actions with additional debug checking
status: Celeritas core state initialization complete
warning: Cuda API error detected: cudaLaunchKernel returned (0x1)
warning: Cuda API error detected: cudaPeekAtLastError returned (0x1)
warning: Cuda API error detected: cudaGetLastError returned (0x1)
unknown file: Failure
C++ exception with description "radix_sort: failed on 2nd step: cudaErrorInvalidValue: invalid argument" thrown in the test body.
Writing diagnostic output because test failed |
Digging into all this a little.... NVIDIA/cub#545 ? |
It does seems to be exactly that. In the build used for #1462 (comment), VecGeom was built with CUDA architecture 70 while Celeritas was built with 80. Rebuilding VecGeom with 80 makes |
Using multiple (consistent) architecture: |
So back to the drawing board? 😞 |
@esseivaju figured out that the lxplus libraries were not build as they were meant to be and were indeed using the wrong cuda architecture. |
Y'all are the best!! |
Celeritas unit test no longer crash because of gpu-related issues, however athsimulation still does. This is the output of running The compute-sanitizer output doesn't show a thrust kernel failing even though the exception reported by Celeritas suggests that it is:
As mentioned on Slack, I don't think there is code generated for cuobjdump output
|
@esseivaju @pcanal @drbenmorgan Have we tried building with Also I remembered with #1433 we're actually not defaulting to using the same settings for GPU/CPU: perhaps we should use |
I've also sometimes seen errors like that ( |
That's a good point, we ought to try running this geometry in standalone to see. |
That the GDML you dump during the hackathon right? We should be able to quickly test that |
Yeah I've uploaded it and just ran on wildstyle with the regression harness; seems to work fine 👀 |
I tried turning on
I also tried the regression problem, and they work fine but they complete much faster than the old tilecal with 2 modules we were using before. Something must be still wrong with the build with Athena... |
Damn. Thanks Julien, this is a mystery indeed... |
No luck with increasing the stack and heap size. I ran athena through CUDA error:
Host stack trace and exception: (it fails on the first CUDA runtime API call after the failing kernel so it's not very important)
instruction causing the crash (multiply and add):
I haven't linked the instruction to the line of code yet... Since this block should be executing on the first step and it's successful, maybe there is an actual bug in InitTracksExecutor? |
Gosh. Next steps: can we get ROOT (or even hepmc3?) enabled so we can write out the primaries? Can we try testing with ORANGE? |
ORANGE fails at
in the |
Well that's not good either... |
@esseivaju reports a failure with ORANGE and the Bpip geometry but at least we get one event or so:
@esseivaju can you verify whether vecgeom (debug off) works with the beampipe? |
I still see CUDA memory-related issues with the Bpipe+VecGeom. I also have a dump of the Bpipe Gdml that I'll send on Slack since I can't push to https://gitlab.cern.ch/bmorgan/atlassim-6635/-/tree/main?ref_type=heads |
Does ATLAS have a "pretend everything is a solid block of aluminum" option??? Are there any weird potential "thread local" behavior outside of the usual geant4 that could be causing issues? |
Nothing simpler than the beam pipe we're already using.
Atlas has a lot of custom code for interfacing with Geant4. We're running single-threaded here so |
From @esseivaju regarding @amandalund 's suggestion to run a single track slot, still getting a failure on the very first step:
This has got to be some sort of linking issue... |
SASS looks effectively the same between standalone/Celeritas ATLAS:
taken from Build with shared libraries seems to fail due to the "split" RDC libraries:
seems like this is a postprocessing tool to check the libraries for missing symbols (like |
@esseivaju traced the failure to the first access of member data inside the
This suggests it's related to the BVH manager's store, namespace vecgeom {
inline namespace VECGEOM_IMPL_NAMESPACE {
inline std::vector<BVH *> hBVH;
#ifdef VECGEOM_ENABLE_CUDA
inline __device__ BVH *dBVH;
#endif
class BVHManager {
public:
VECCORE_ATT_HOST_DEVICE
static BVH const *GetBVH(int id)
{
#ifdef VECCORE_CUDA_DEVICE_COMPILATION
return &cuda::dBVH[id];
#else
return hBVH[id];
#endif
} so there's Possibly related documentation:
|
Just to clarify: the failure occurs in |
Confirmed by @esseivaju with #1481:
The change to use |
I've collected the various compiler and linker invocation from build logs here. |
I have a build of AthSimulation working with Celeritas GPU offload. Required changes are:
|
To preserve the history of how Philippe and Julien figured this out: @pcanal Thursday at 20:15 @esseivaju Thursday at 21:32
@pcanal Thursday at 21:32
@esseivaju Thursday at 21:36
@pcanal Thursday at 21:40 @esseivaju Thursday at 21:48 For reference, here’s the diff causing that problem: diff --git a/VecGeom/management/BVHManager.h b/VecGeom/management/BVHManager.h
index c760ec2d7..d3a061dd1 100644
--- a/VecGeom/management/BVHManager.h
+++ b/VecGeom/management/BVHManager.h
@@ -17,7 +17,7 @@ namespace vecgeom {
inline namespace VECGEOM_IMPL_NAMESPACE {
inline std::vector<BVH *> hBVH;
#ifdef VECGEOM_ENABLE_CUDA
-inline __device__ BVH *dBVH;
+extern __device__ BVH *dBVH;
#endif
// Macro allowing downstream codes to use GetDeviceBVH
diff --git a/source/BVHManager.cu b/source/BVHManager.cu
index 8c03d103b..2a740594a 100644
--- a/source/BVHManager.cu
+++ b/source/BVHManager.cu
@@ -10,6 +10,9 @@ using vecgeom::cxx::CudaCheckError;
namespace vecgeom {
inline namespace cuda {
+
+__device__ BVH *dBVH;
+
void *AllocateDeviceBVHBuffer(size_t n)
{
BVH *ptr = nullptr; @esseivaju Friday at 18:29 libvecgeomcuda.a is a final library (i.e. includes a dlink step) and libvecgeomcuda_static.a doesn’t include a dlink step. Isn’t that a problem if libvecgeomcuda.a is linked to libgeocel_final.a which also includes a dlink step? (edited) @esseivaju Friday at 19:21 Similarly, libvecgeomcuda.a is included when dlinking the final atlas shared lib, my understanding is that we should only link to libvecgeomcuda_static.a @pcanal Friday at 20:01 Yes, I think you are right. This is what I am trying to address with the first commit of https://github.com/celeritas-project/celeritas/pull/1489/commits but despites all the other (then necessary) commits, I will have some issues when shared library are also part of the mix. A priori that first commit could solve the problem (in conjunction with the other commit, I think the fully static build works). (but the fully shared is broken 😞 ). @esseivaju Friday at 20:04 I am using that PR when building Celeritas and AthSimulation. I still see libvecgeomcuda.a (and libvecgeomcuda_static.a) being picked up by the final atlas lib. Everything links successfully but we still have the kernel crash (edited) @pcanal Friday at 20:38 @esseivaju Friday at 21:01 It might come from Geant4 since they depend on VecGeom and would need the final library :thinking_face: @esseivaju Saturday at 01:29 That was it! Removing
|
Next steps for @drbenmorgan are to rebuild atlas externals:
|
Thanks @sethrj, I'll get these done by early next week, pending the full VecGeom fix, but I can get other bits in place. |
Can we close this now that #1489 and the VecGeom patch are merged? |
When running on
lxplus-gpu
we observed a crash within one ofCeleritas
test checkingThrust
functionality.With Celeritas using for example commit fe52da7 or commit aab88dd
and
we get a failure in
test/celeritas/track_TrackSort
compute-sanitize
reportsThe text was updated successfully, but these errors were encountered: