-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two instances of a function-local static variable on gcc11 #39786
Comments
assign core, heterogeneous |
New categories assigned: heterogeneous,core @fwyzard,@Dr15Jones,@smuzaffar,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks |
A new Issue was created by @makortel Matti Kortelainen. @Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
Found this
from gcc documentation https://gcc.gnu.org/faq.html#dso. Also two open tickets that might be connected
In this case in both code paths leading to the cmssw/FWCore/PluginManager/src/SharedLibrary.cc Lines 34 to 35 in c81258c
|
I was surprised to see that without setting |
So, it seems to be an issue with GCC 11 ? (not saying that we should not look for a solution, just asking if you think it's something that was working with GCC 10 and broke with GCC 11) |
We see a low rate of segfaults that look to be dlopen related, so it seems at least plausible that this has been around for a while. |
I wanted to see the behavior with the cmssw/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h Lines 26 to 35 in ed92088
to actually use the caching allocator, and see similar behavior (i.e. I see two allocator instances created from AlpakaService , one in getHostCachingAllocator() and another in getDeviceCachingAllocator() ; and additional one via the PortableHostCollection constructor in the EDModule). This test case does not segfault though. I suppose the difference wrt. the cuda_async backend is that in the CUDA case the CUDAService destructor resets the CUDA runtime, which will cause the cudaEventDestroy() to crash.
(this test was done with |
As far as I can tell this behavior seems specific to GCC 11. On the same |
I think we are doing both:
and cmssw/FWCore/PluginManager/src/SharedLibrary.cc Lines 34 to 35 in ed92088
|
One thing I noticed while investigating this issue was that the
gcc10 )
I also noticed we are not running IBs for |
are there also two copies of |
vs
'u' is good
b is bad (for this case) |
while
vs
being 'b' and lower case I do expect two copies of this one (local instance) |
Currently, for GPU builds we only run tests for packages which has direct
We only have two |
Thanks. In this case the There is also cmssw/HeterogeneousCore/AlpakaInterface/test/BuildFile.xml Lines 1 to 6 in 42baf3c
where an executable is compiled for each Alpaka backend. For now running the resulting alpakaTestVecCudaAsync and alpakaTestVecSerialSync on a non-GPU machine or on a GPU machine works (because no device code gets run). But maybe it would make more sense to run the alpakaTestVecSerialSync only in non-GPU IB (or maybe in both?) and alpakaTestVecCudaAsync only in GPU IB?
Or would the whole GPU unit test setup benefit from some generalizations (thinking towards adding support for AMD GPUs etc)?
Yeah, makes sense. |
Thanks @VinInn.
Indeed, inserting a printout in |
@makortel , direct dependency with in the package ( either in |
looking to the code I would have not expected multiple copies of "host" either. OK for |
Good catch, thanks! @fwyzard, do you have any objections for removing the
(I believe the replication was not intentional) |
You're right, I'll make a PR. |
there is a gcc 11.3 out. maybe would be worth to test it to see if this issue has been fixed |
spent the afternoon in some monkey typing. Eventually
do not ask. It is fact (and a bug in the compiler, probably) |
more
|
This is ok as well
|
btw the default compiler is |
This also works: diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..a125c98996b8 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,8 @@ namespace cms::alpakatools {
template <typename TDev,
typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev>>,
+ typename = std::enable_if_t<cms::alpakatools::is_queue_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED on
and |
This also works: diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..d8a25cce69d0 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,7 @@ namespace cms::alpakatools {
template <typename TDev,
typename TQueue,
- typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+ typename = std::enable_if_t<std::is_object_v<TDev> and std::is_object_v<TQueue>>>
class CachingAllocator {
public:
#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED (it's not a useful implementation, it's just a check) |
also this works (and IMHO , as argued in the above issue, I find it also a more adequate syntax)
|
But this would give a compile time error instead of failing to match the template parameters (OK, which would also give a different compile time error in this case). By the way, a curious observation: in my tests |
Ah, and just to exclude a possible culprit, I think the assembler and linker are not at fault here, the difference is there already in the GCC assembly output: gcc 10: .weak _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
.section .bss._ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,"awG",@nobits,_ZZN3cms11a
.align 32
.type _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, @gnu_unique_object
.size _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, 224
_ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator:
.zero 224 gcc 11: .local _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
.comm _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,224,32 |
yep. To report a bug (regression) one needs to reproduce it in a simpler example. I've failed until now. |
btw: why such a complex machinery to construct and destruct a bunch of allocators? |
OK, I think I found the source of the problem. In template <typename T>
constexpr bool is_device_v = is_device<T>::value; instead of template <typename T>
inline constexpr bool is_device_v = is_device<T>::value; So, as explained on StackOverflow, Adding the missing |
Fixed by #39826. |
Because a Doing |
I think a |
Ok, it may make sense. Not easy to reproduce with a small example anyhow. |
If you want to play with it, this is the reproducer I manage to cook:
|
Thanks. Useful to have a self contained example. |
I noticed something peculiar on
lxplus-gpu
that is nowadayscs9
. It can be reproduced with (e.g.)which results in
The stack trace of the segfault is
i.e. from
cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::QueueCudaRtNonBlocking>, void>::freeAllCached()
, but called from theCachingAllocator
destructor rather than fromcmssw/HeterogeneousCore/AlpakaServices/src/alpaka/AlpakaService.cc
Lines 80 to 82 in ed4c9be
With gdb I saw that the job has actually two
CachingAllocator
objects alive. Setting a breakpoint in theCachingAllocator
constructor showsi.e. it ends up being constructed the first time from
cmssw/HeterogeneousCore/AlpakaServices/src/alpaka/AlpakaService.cc
Line 75 in ed4c9be
and the second time from
cmssw/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h
Line 48 in c81258c
Both call
cmssw/HeterogeneousCore/AlpakaInterface/interface/getHostCachingAllocator.h
Lines 14 to 16 in c81258c
which should guarantee only a single object to be constructed
The text was updated successfully, but these errors were encountered: