Two instances of a function-local static variable on gcc11 #39786

makortel · 2022-10-20T08:24:14Z

I noticed something peculiar on lxplus-gpu that is nowadays cs9. It can be reproduced with (e.g.)

SCRAM_ARCH=el9_amd64_gcc11 cmsrel CMSSW_12_6_X_2022-10-19-2300
cd CMSSW_12_6_X_2022-10-19-2300/src
cmsenv
cmsRun $CMSSW_RELEASE_BASE/src/HeterogeneousCore/AlpakaTest/test/writer.py

which results in

%MSG-i CUDAService:  (NoModuleName) 20-Oct-2022 09:17:00 CEST pre-events
CUDA runtime version 11.5, driver version 11.8, NVIDIA driver version 520.61.05
CUDA device 0: Tesla T4 (sm_75)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 20-Oct-2022 09:17:01 CEST pre-events
AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - Tesla T4
%MSG
%MSG-i AlpakaService:  (NoModuleName) 20-Oct-2022 09:17:01 CEST pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
%MSG
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 20-Oct-2022 09:17:01.501 CEST

...

Fatal system signal has occurred during exit
Aborted (core dumped)

The stack trace of the segfault is

#0  0x00007fffeec03bcc in ?? () from /lib64/libcuda.so.1
#1  0x00007ffff04e3330 in ?? () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/external/el9_amd64_gcc11/lib/libcudart.so.11.0
#2  0x00007ffff051b190 in cudaEventDestroy () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/external/el9_amd64_gcc11/lib/libcudart.so.11.0
#3  0x00007fffc8bb1959 in std::_Sp_counted_ptr_inplace<alpaka::uniform_cuda_hip::detail::EventUniformCudaHipImpl<alpaka::ApiCudaRt>, std::allocator<alpaka::uniform_cuda_hip::detail::EventUniformCudaHipImpl<alpaka::ApiCudaRt> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
    () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#4  0x00007fffc8bbb0ca in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#5  0x00007fffc8bbb2a0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::~CachingAllocator() ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#6  0x00007ffff5d4e455 in __run_exit_handlers () from /lib64/libc.so.6
#7  0x00007ffff5d4e5d0 in exit () from /lib64/libc.so.6
#8  0x00007ffff5d36eb7 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007ffff5d36f60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x00000000004097f5 in _start ()

i.e. from cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::QueueCudaRtNonBlocking>, void>::freeAllCached(), but called from the CachingAllocator destructor rather than from

cmssw/HeterogeneousCore/AlpakaServices/src/alpaka/AlpakaService.cc

Lines 80 to 82 in ed4c9be

    
           AlpakaService::~AlpakaService() { 
        
             // clean up the caching memory allocators 
        
             cms::alpakatools::getHostCachingAllocator<Queue>().freeAllCached();

With gdb I saw that the job has actually two CachingAllocator objects alive. Setting a breakpoint in the CachingAllocator constructor shows

(gdb) break  cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::CachingAllocator

#0  0x00007ffff077cce0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::CachingAllocator(alpaka::DevCpu const&, unsigned int, unsigned int, unsigned int, unsigned long, double, bool, bool) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so
#1  0x00007ffff077a7db in alpaka_cuda_async::AlpakaService::AlpakaService(edm::ParameterSet const&, edm::ActivityRegistry&) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so
#2  0x00007ffff078cd02 in edm::serviceregistry::ServiceMaker<alpaka_cuda_async::AlpakaService, edm::serviceregistry::AllArgsMaker<alpaka_cuda_async::AlpakaService, alpaka_cuda_async::AlpakaService> >::make(edm::ParameterSet const&, edm::ActivityRegistry&, edm::serviceregistry::ServicesManager&) const () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaServicesPluginsCudaAsync.so
#3  0x00007ffff7b20a3b in edm::serviceregistry::ServicesManager::MakerHolder::add(edm::serviceregistry::ServicesManager&) const ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#4  0x00007ffff7b20c11 in edm::serviceregistry::ServicesManager::createServiceFor(edm::serviceregistry::ServicesManager::MakerHolder const&) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#5  0x00007ffff7b20d70 in edm::serviceregistry::ServicesManager::createServices() () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#6  0x00007ffff7b22026 in edm::serviceregistry::ServicesManager::ServicesManager(edm::ServiceToken, edm::serviceregistry::ServiceLegacy, std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&, bool) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#7  0x00007ffff7b1ef7d in edm::ServiceRegistry::createSet(std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&, edm::ServiceToken, edm::serviceregistry::ServiceLegacy, bool) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#8  0x00007ffff7e7ed97 in edm::ScheduleItems::initServices(std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&, edm::ParameterSet&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy, bool) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#9  0x00007ffff7d8e717 in edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#10 0x00007ffff7d91105 in edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#11 0x000000000040a0bf in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#12 0x00007ffff62fb8a8 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/arena.cpp:698
#13 0x000000000040af8b in main::{lambda()#1}::operator()() const ()
#14 0x00000000004096ec in main ()

#0  0x00007fffc8bb9ea0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::CachingAllocator(alpaka::DevCpu const&, unsigned int, unsigned int, unsigned int, unsigned long, double, bool, bool) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#1  0x00007fffc8bbdcb5 in alpaka::BufCpu<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int> cms::alpakatools::traits::CachedBufAlloc<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void, void>::allocCachedBuf<alpaka::Vec<std::integral_constant<unsigned long, 1ul>, unsigned int> >(alpaka::DevCpu const&, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, alpaka::Vec<std::integral_constant<unsigned long, 1ul>, unsigned int> const&) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#2  0x00007fffc8bbde28 in std::enable_if<((is_queue_v<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >)&&(is_unbounded_array_v<std::byte []>))&&(!(is_array_v<std::remove_extent<std::byte []>::type>)), cms::alpakatools::detail::buffer_type<alpaka::DevCpu, std::byte [], void>::type>::type cms::alpakatools::make_host_buffer<std::byte [], alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >(alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#3  0x00007fffc8bbe0e5 in PortableHostCollection<portabletest::TestSoALayout<128ul, false> >::PortableHostCollection<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>(int, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> const&) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#4  0x00007fffc8bbe4c6 in alpaka_cuda_async::TestAlpakaTranscriber::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#5  0x00007ffff7ee8c58 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#6  0x00007ffff7ee715a in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#7  0x00007ffff7eb9599 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#8  0x00007ffff7eb972e in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#9  0x00007ffff7e1ed7c in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#10 0x00007ffff7bea819 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreConcurrency.so
#11 0x00007ffff630dae9 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7ffff44ff000, this=0x7ffff4503e00)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7ffff4503e00)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#13 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#14 0x00007ffff7d9fa67 in edm::FinalWaitingTask::wait() () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#15 0x00007ffff7d867a5 in edm::EventProcessor::processRuns() () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#16 0x00007ffff7d94646 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#17 0x000000000040a190 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#18 0x00007ffff62fb8a8 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/arena.cpp:698
#19 0x000000000040af8b in main::{lambda()#1}::operator()() const ()
#20 0x00000000004096ec in main ()

i.e. it ends up being constructed the first time from

cmssw/HeterogeneousCore/AlpakaServices/src/alpaka/AlpakaService.cc

Line 75 in ed4c9be

cms::alpakatools::getHostCachingAllocator<Queue>();

and the second time from

cmssw/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h

Line 48 in c81258c

auto& allocator = getHostCachingAllocator<alpaka::QueueCudaRtBlocking>();

Both call

cmssw/HeterogeneousCore/AlpakaInterface/interface/getHostCachingAllocator.h

Lines 14 to 16 in c81258c

    
           inline CachingAllocator<alpaka_common::DevHost, TQueue>& getHostCachingAllocator() { 
        
             // thread safe initialisation of the host allocator 
        
             CMS_THREAD_SAFE static CachingAllocator<alpaka_common::DevHost, TQueue> allocator(

which should guarantee only a single object to be constructed

The text was updated successfully, but these errors were encountered:

makortel · 2022-10-20T08:24:27Z

assign core, heterogeneous

cmsbuild · 2022-10-20T08:24:33Z

New categories assigned: heterogeneous,core

@fwyzard,@Dr15Jones,@smuzaffar,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild · 2022-10-20T08:24:35Z

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel · 2022-10-20T08:29:17Z

Found this

If you use dlopen to explicitly load code from a shared library, you must do several things. First, export global symbols from the executable by linking it with the "-E" flag (you will have to specify this as "-Wl,-E" if you are invoking the linker in the usual manner from the compiler driver, g++). You must also make the external symbols in the loaded library available for subsequent libraries by providing the RTLD_GLOBAL flag to dlopen. The symbol resolution can be immediate or lazy.

Template instantiations are another, user visible, case of objects with vague linkage, which needs similar resolution. If you do not take the above precautions, you may discover that a template instantiation with the same argument list, but instantiated in multiple translation units, has several addresses, depending in which translation unit the address is taken. (This is not an exhaustive list of the kind of objects which have vague linkage and are expected to be resolved during linking & loading.)

from gcc documentation https://gcc.gnu.org/faq.html#dso. Also two open tickets that might be connected

In this case in both code paths leading to the CachingAllocator the underlying shared objects (pluginHeterogeneousCoreAlpakaServicesPluginsCudaAsync.so, pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so) are loaded with dlopen(). The dlopen() is called in

cmssw/FWCore/PluginManager/src/SharedLibrary.cc

Lines 34 to 35 in c81258c

    
           SharedLibrary::SharedLibrary(const std::filesystem::path& iName) 
        
               : libraryHandle_(::dlopen(iName.string().c_str(), RTLD_LAZY | RTLD_GLOBAL)), path_(iName) {

makortel · 2022-10-20T08:33:16Z

I was surprised to see that without setting SCRAM_ARCH on lxplus-gpu the el8_amd64_gcc10 gets picked up, and running it actually works. With el8_amd64_gcc11 I see the same behavior as in the issue description.

fwyzard · 2022-10-20T13:13:39Z

So, it seems to be an issue with GCC 11 ?

(not saying that we should not look for a solution, just asking if you think it's something that was working with GCC 10 and broke with GCC 11)

dan131riley · 2022-10-20T13:18:30Z

We see a low rate of segfaults that look to be dlopen related, so it seems at least plausible that this has been around for a while.

makortel · 2022-10-20T13:23:12Z

I wanted to see the behavior with the serial_sync backend (to see if gcc or nvcc in the link would make any difference). I hacked this piece

cmssw/HeterogeneousCore/AlpakaInterface/interface/CachedBufAlloc.h

Lines 26 to 35 in ed92088

    
           //! The caching memory allocator implementation for the CPU device 
        
           template <typename TElem, typename TDim, typename TIdx, typename TQueue> 
        
           struct CachedBufAlloc<TElem, TDim, TIdx, alpaka::DevCpu, TQueue, void> { 
        
             template <typename TExtent> 
        
             ALPAKA_FN_HOST static auto allocCachedBuf(alpaka::DevCpu const& dev, TQueue queue, TExtent const& extent) 
        
                 -> alpaka::BufCpu<TElem, TDim, TIdx> { 
        
               // non-cached, queue-ordered asynchronous host-only memory 
        
               return alpaka::allocAsyncBuf<TElem, TIdx>(queue, extent); 
        
             } 
        
           };

to actually use the caching allocator, and see similar behavior (i.e. I see two allocator instances created from AlpakaService, one in getHostCachingAllocator() and another in getDeviceCachingAllocator(); and additional one via the PortableHostCollection constructor in the EDModule). This test case does not segfault though. I suppose the difference wrt. the cuda_async backend is that in the CUDA case the CUDAService destructor resets the CUDA runtime, which will cause the cudaEventDestroy() to crash.

(this test was done with el9_amd64_gcc11)

makortel · 2022-10-20T13:23:55Z

So, it seems to be an issue with GCC 11 ?

(not saying that we should not look for a solution, just asking if you think it's something that was working with GCC 10 and broke with GCC 11)

As far as I can tell this behavior seems specific to GCC 11. On the same lxplus-gpu node, using el8_amd64_gcc10 works, whereas el8_amd64_gcc11 and el9_amd64_gcc11 crash.

fwyzard · 2022-10-20T13:25:32Z

First, export global symbols from the executable by linking it with the "-E" flag (you will have to specify this as "-Wl,-E" if you are invoking the linker in the usual manner from the compiler driver, g++). You must also make the external symbols in the loaded library available for subsequent libraries by providing the RTLD_GLOBAL flag to dlopen.

I think we are doing both:

$ scram tool info gcc-cxxcompiler | grep Wl
CXXSHAREDFLAGS+=-shared -Wl,-E
LDFLAGS+=-Wl,-E -Wl,--hash-style=gnu

and

cmssw/FWCore/PluginManager/src/SharedLibrary.cc

Lines 34 to 35 in ed92088

    
           SharedLibrary::SharedLibrary(const std::filesystem::path& iName) 
        
               : libraryHandle_(::dlopen(iName.string().c_str(), RTLD_LAZY | RTLD_GLOBAL)), path_(iName) {

makortel · 2022-10-20T14:00:00Z

One thing I noticed while investigating this issue was that the

cmssw/HeterogeneousCore/AlpakaTest/test/BuildFile.xml

Line 1 in ed92088

    
           <test name="testHeterogeneousCoreAlpakaTestWriteRead" command="testHeterogeneousCoreAlpakaTestWriteRead.sh"/>

test is not run in GPU_X IBs. @smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)

I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)

VinInn · 2022-10-20T17:23:50Z

are there also two copies of cms::alpakatools::host()::host?

VinInn · 2022-10-20T17:36:00Z

[innocent@lxplus8s12 CMSSW_12_6_X_2022-10-20-1100]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el8_amd64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-10-17-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

vs

[innocent@lxplus9s00 ~]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016380 b guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
00000000000163a0 b cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

'u' is good

"u"

The symbol is a unique global symbol. This is a GNU extension to the standard set of ELF symbol bindings. For such a symbol the dynamic linker will make sure that in the entire process there is just one symbol with this name and type in use.

b is bad (for this case)

VinInn · 2022-10-20T17:38:27Z

while

[innocent@lxplus8s12 CMSSW_12_6_X_2022-10-20-1100]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el8_amd64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-10-17-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'host$'
0000000000016370 b guard variable for cms::alpakatools::host()::host
0000000000016380 b cms::alpakatools::host()::host

vs

[innocent@lxplus9s00 ~]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'host$'
0000000000016480 b guard variable for cms::alpakatools::host()::host
0000000000016490 b cms::alpakatools::host()::host

being 'b' and lower case I do expect two copies of this one (local instance)

smuzaffar · 2022-10-20T19:01:28Z

@smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)

Currently, for GPU builds we only run tests for packages which has direct cuda dependency. We can also add package with direct alpaka dependency too

I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)

We only have two ppc64le nodes which are already overloaded due to various IBs/Release and PR tests. So I am afraid we can not add ppc64le_gcc11

makortel · 2022-10-20T19:21:03Z

@smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)

Currently, for GPU builds we only run tests for packages which has direct cuda dependency. We can also add package with direct alpaka dependency too

Thanks. In this case the HeterogeneousCore/AlpakaTest/test/BuildFile.xml does not declare any dependencies (because the runs a script that runs cmsRun). Should the <test/> be made dependent on alpaka? Or would an overall dependence on alpaka be sufficient?

There is also

cmssw/HeterogeneousCore/AlpakaInterface/test/BuildFile.xml

Lines 1 to 6 in 42baf3c

    
           <bin name="alpakaTestVec" file="alpaka/testVec.cc"> 
        
             <use name="alpaka"/> 
        
             <use name="catch2"/> 
        
             <use name="HeterogeneousCore/AlpakaInterface"/> 
        
             <flags ALPAKA_BACKENDS="1"/> 
        
           </bin>

where an executable is compiled for each Alpaka backend. For now running the resulting alpakaTestVecCudaAsync and alpakaTestVecSerialSync on a non-GPU machine or on a GPU machine works (because no device code gets run). But maybe it would make more sense to run the alpakaTestVecSerialSync only in non-GPU IB (or maybe in both?) and alpakaTestVecCudaAsync only in GPU IB?

Or would the whole GPU unit test setup benefit from some generalizations (thinking towards adding support for AMD GPUs etc)?

I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)

We only have two ppc64le nodes which are already overloaded due to various IBs/Release and PR tests. So I am afraid we add ppc64le_gcc11 yet

Yeah, makes sense.

makortel · 2022-10-20T19:42:34Z

Thanks @VinInn.

being 'b' and lower case I do expect two copies of this one (local instance)

Indeed, inserting a printout in cms::alpakatools::detail::enumerate_host() shows multiple printouts (3 in my test, which I presume are from libDataFormatsPortableTestObjectsSerialSync.so, libHeterogeneousCoreAlpakaServicesSerialSync.so, pluginHeterogeneousCoreAlpakaTestPluginsPortableSerialSync.so). And this happens also with gcc 10 (as your nm outputs pointed out).

smuzaffar · 2022-10-20T19:43:35Z

@makortel , direct dependency with in the package ( either in Package/BuildFile.xml or Package/{plugins,test,bin}/BuildFile.xml ) is good enough for bot to check out that package and run all its tests. Note that we run all tests for these package (just in case if there are tests which process the output of some GPU based test).

VinInn · 2022-10-21T08:51:46Z

looking to the code I would have not expected multiple copies of "host" either.
somehow those functions get a "hidden" linking.
Maybe we should try to write a small demonstrator.

OK for host() is obvious: It is declared static that in old good C means local to the compilation unit!

makortel · 2022-10-21T11:47:46Z

OK for host() is obvious: It is declared static that in old good C means local to the compilation unit!

Good catch, thanks! @fwyzard, do you have any objections for removing the static from

cmssw/HeterogeneousCore/AlpakaInterface/interface/host.h

Line 24 in c81258c

static inline alpaka::DevCpu const& host() {

(I believe the replication was not intentional)

fwyzard · 2022-10-21T12:14:15Z

You're right, I'll make a PR.

fwyzard · 2022-10-21T12:30:08Z

See #39813 and #39814 .

VinInn · 2022-10-21T12:46:59Z

there is a gcc 11.3 out. maybe would be worth to test it to see if this issue has been fixed

VinInn · 2022-10-21T15:48:59Z

spent the afternoon in some monkey typing. Eventually

[innocent@lxplus9s00 src]$ git diff
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..8f0a669b7e7 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -83,8 +83,8 @@ namespace cms::alpakatools {
    */

   template <typename TDev,
-            typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename TQueue> //,
+//            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

do not ask. It is fact (and a bug in the compiler, probably)

VinInn · 2022-10-21T15:59:33Z

more

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..af3efa974a5 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,7 @@ namespace cms::alpakatools {

   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> >> // and cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..3faf423b865 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,9 @@ namespace cms::alpakatools {

   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<
+                // cms::alpakatools::is_device_v<TDev> >> and
+                cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

VinInn · 2022-10-21T16:20:35Z

This is ok as well

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..2404d639246 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -82,9 +82,15 @@ namespace cms::alpakatools {
    *    - the `Queue` type can be either `Sync` _or_ `Async` on any allocation.
    */

+    template <typename TDev,
+              typename TQueue>
+    constexpr bool isGood =  cms::alpakatools::is_device_v<TDev> and
+                                     cms::alpakatools::is_queue_v<TQueue>;
   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t< isGood<TDev,TQueue>>>
+                // cms::alpakatools::is_device_v<TDev> >> and
+                // cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

VinInn · 2022-10-21T16:36:21Z

btw the default compiler is
gcc version 11.3.1 20220421 (Red Hat 11.3.1-2) (GCC)

fwyzard · 2022-10-21T21:09:48Z

This also works:

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..a125c98996b8 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,8 @@ namespace cms::alpakatools {
 
   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev>>,
+            typename = std::enable_if_t<cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED

on el8_amd64_gcc11:

$ nm -C ../lib/el8_amd64_gcc11/lib*.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000011380 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
00000000000113a0 u cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator

and cmsRun HeterogeneousCore/AlpakaTest/test/writer.py works fine.

fwyzard · 2022-10-21T21:28:11Z

This also works:

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..d8a25cce69d0 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,7 @@ namespace cms::alpakatools {
 
   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<std::is_object_v<TDev> and std::is_object_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED

(it's not a useful implementation, it's just a check)

VinInn · 2022-10-22T09:02:53Z

also this works (and IMHO , as argued in the above issue, I find it also a more adequate syntax)

[innocent@lxplus9s00 src]$ git diff
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..551bc948853 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -83,9 +83,9 @@ namespace cms::alpakatools {
    */

   template <typename TDev,
-            typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename TQueue>
   class CachingAllocator {
+    static_assert(cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>);
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
     friend class alpaka_cuda_async::AlpakaService;
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/lib*.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000011380 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
00000000000113a0 u cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
0000000000016458 u guard variable for cms::cuda::allocator::getCachingHostAllocator()::allocator
0000000000016340 u guard variable for cms::cuda::allocator::getCachingDeviceAllocator()::allocator
0000000000016460 u cms::cuda::allocator::getCachingHostAllocator()::allocator
0000000000016360 u cms::cuda::allocator::getCachingDeviceAllocator()::allocator

fwyzard · 2022-10-22T09:33:59Z

But this would give a compile time error instead of failing to match the template parameters (OK, which would also give a different compile time error in this case).

By the way, a curious observation: in my tests getCachingDeviceAllocator()::allocator is always compiled correctly (in terms of what we want it to do), only getCachingHostAllocator()::allocator shows the different behaviour in GCC 11 vs GCC 10.

fwyzard · 2022-10-22T09:38:18Z

Ah, and just to exclude a possible culprit, I think the assembler and linker are not at fault here, the difference is there already in the GCC assembly output:

gcc 10:

        .weak   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
        .section        .bss._ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,"awG",@nobits,_ZZN3cms11a
        .align 32
        .type   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, @gnu_unique_object
        .size   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, 224
_ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator:
        .zero   224

gcc 11:

        .local  _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
        .comm   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,224,32

VinInn · 2022-10-22T09:52:04Z

yep. To report a bug (regression) one needs to reproduce it in a simpler example. I've failed until now.

VinInn · 2022-10-22T10:06:18Z

btw: why such a complex machinery to construct and destruct a bunch of allocators?
(in HeterogeneousCore/AlpakaInterface/interface/getDeviceCachingAllocator.h)
a simple std::vector would not have been enough?

fwyzard · 2022-10-23T09:37:57Z

OK, I think I found the source of the problem. In HeterogeneousCore/AlpakaInterface/interface/traits.h I used

  template <typename T>
  constexpr bool is_device_v = is_device<T>::value;

instead of

  template <typename T>
  inline constexpr bool is_device_v = is_device<T>::value;

So, as explained on StackOverflow, is_device_v is a local variable, and it looks like recent versions of GCC make the template specialisation based on it local as well.

Adding the missing inline to the declarations fixes the problem.

fwyzard · 2022-10-23T09:48:33Z

Fixed by #39826.

fwyzard · 2022-10-23T10:02:00Z

btw: why such a complex machinery to construct and destruct a bunch of allocators? (in HeterogeneousCore/AlpakaInterface/interface/getDeviceCachingAllocator.h) a simple std::vector would not have been enough?

Because a CachingAllocator does not have a default constructor, and is not copyable or movable.

Doing make_unique<Allocator[]>(size) does not work either, because of the lack of a default constructor.

VinInn · 2022-10-23T10:06:54Z

I think a std::vector<std::unique_ptr<Allocator>> should work.

VinInn · 2022-10-23T10:10:31Z

Adding the missing inline to the declarations fixes the problem.

Ok, it may make sense. Not easy to reproduce with a small example anyhow.

fwyzard · 2022-10-23T10:25:36Z

Ok, it may make sense. Not easy to reproduce with a small example anyhow.

If you want to play with it, this is the reproducer I manage to cook:

`test.cc`

#include <type_traits>

// concepts
namespace concepts {
  //! Tag used in class inheritance hierarchies that describes that a specific concept (TConcept)
  //! is implemented by the given base class (TBase).
  template <typename TConcept, typename TBase>
  struct Implements {};

  //! Checks whether the concept is implemented by the given class
  template <typename TConcept, typename TDerived>
  struct ImplementsConcept {
    template <typename TBase>
    static auto implements(Implements<TConcept, TBase>&) -> std::true_type;
    static auto implements(...) -> std::false_type;

    static constexpr auto value = decltype(implements(std::declval<TDerived&>()))::value;
  };

}  // namespace concepts

struct ConceptDev;
struct ConceptQueue;

// traits
template <typename T>
using is_device = concepts::ImplementsConcept<ConceptDev, T>;

template <typename T>
/* inline */ constexpr bool is_device_v = is_device<T>::value;

template <typename T>
using is_queue = concepts::ImplementsConcept<ConceptQueue, T>;

template <typename T>
/* inline */ constexpr bool is_queue_v = is_queue<T>::value;

// host
class DevCpu : public concepts::Implements<ConceptDev, DevCpu> {
private:
  void* impl = nullptr;
};

// queue
class QueueCpu : public concepts::Implements<ConceptQueue, QueueCpu> {
private:
  void* impl = nullptr;
};

// allocator
template <typename TDev, typename TQueue, typename = std::enable_if_t<is_device_v<TDev> and is_queue_v<TQueue>>>
//template <typename TDev, typename TQueue, typename = std::enable_if_t<concepts::ImplementsConcept<ConceptDev, TDev>::value and concepts::ImplementsConcept<ConceptQueue, TQueue>::value>>
class Allocator {
public:
  Allocator(TDev const& dev) : dev{dev} {}

private:
  TDev dev;
};

// access the allocator
template <typename TQueue, typename = std::enable_if_t<is_queue_v<TQueue>>>
//template <typename TQueue, typename = std::enable_if_t<concepts::ImplementsConcept<ConceptQueue, TQueue>::value>>
inline Allocator<DevCpu, TQueue>& getHostAllocator() {
  static Allocator<DevCpu, TQueue> allocator{DevCpu{}};
  return allocator;
}

void test() { getHostAllocator<QueueCpu>(); }

`Makefile`

.PHONY: all dump clean

all: dump

dump: dump.gcc10 dump.gcc11 dump.gcc12 dump.clang12

clean:
        rm -f test.gcc* test.clang*

CXXFLAGS := -std=c++17 -O2 -ftree-vectorize -msse3 -fPIC -pthread

# gcc 10
test.gcc10.ii: test.cc
        g++-10 $(CXXFLAGS) $< -E -o $@

test.gcc10.s: test.gcc10.ii
        g++-10 $(CXXFLAGS) $< -S -o $@

test.gcc10.o: test.gcc10.s
        g++-10 $(CXXFLAGS) $< -c -o $@

dump.gcc10: test.gcc10.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

# gcc 11
test.gcc11.ii: test.cc
        g++-11 $(CXXFLAGS) $< -E -o $@

test.gcc11.s: test.gcc11.ii
        g++-11 $(CXXFLAGS) $< -S -o $@

test.gcc11.o: test.gcc11.s
        g++-11 $(CXXFLAGS) $< -c -o $@

dump.gcc11: test.gcc11.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

# gcc 12
test.gcc12.ii: test.cc
        g++-12 $(CXXFLAGS) $< -E -o $@

test.gcc12.s: test.gcc12.ii
        g++-12 $(CXXFLAGS) $< -S -o $@

test.gcc12.o: test.gcc12.s
        g++-12 $(CXXFLAGS) $< -c -o $@

dump.gcc12: test.gcc12.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

# clang 12
test.clang12.ii: test.cc
        clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -E -o $@

test.clang12.s: test.clang12.ii
        clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -S -o $@

test.clang12.o: test.clang12.s
        clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -c -o $@

dump.clang12: test.clang12.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

By the way, clang generates a V weak object, possibly with a default value (both with and without the inline).

VinInn · 2022-10-23T10:49:34Z

Thanks. Useful to have a self contained example.

cmsbuild added core-pending heterogeneous-pending pending-signatures labels Oct 20, 2022

VinInn mentioned this issue Oct 22, 2022

use of enabled_if in place of static_assert #39825

Open

This was referenced Oct 23, 2022

Mark alpaka trait constants as inline #39826

Merged

Mark alpaka trait constants as inline [12.5.x] #39827

Merged

cmsbuild closed this as completed in #39826 Oct 24, 2022

Two instances of a function-local static variable on gcc11 #39786

Two instances of a function-local static variable on gcc11 #39786

Comments

makortel commented Oct 20, 2022

makortel commented Oct 20, 2022

cmsbuild commented Oct 20, 2022

cmsbuild commented Oct 20, 2022

makortel commented Oct 20, 2022

makortel commented Oct 20, 2022

fwyzard commented Oct 20, 2022

dan131riley commented Oct 20, 2022

makortel commented Oct 20, 2022 • edited Loading

makortel commented Oct 20, 2022

fwyzard commented Oct 20, 2022

makortel commented Oct 20, 2022

VinInn commented Oct 20, 2022

VinInn commented Oct 20, 2022 • edited Loading

VinInn commented Oct 20, 2022 • edited Loading

smuzaffar commented Oct 20, 2022 • edited Loading

makortel commented Oct 20, 2022

makortel commented Oct 20, 2022 • edited Loading

smuzaffar commented Oct 20, 2022

VinInn commented Oct 21, 2022 • edited Loading

makortel commented Oct 21, 2022

fwyzard commented Oct 21, 2022

fwyzard commented Oct 21, 2022

VinInn commented Oct 21, 2022

VinInn commented Oct 21, 2022 • edited Loading

VinInn commented Oct 21, 2022

VinInn commented Oct 21, 2022

VinInn commented Oct 21, 2022

fwyzard commented Oct 21, 2022

fwyzard commented Oct 21, 2022

VinInn commented Oct 22, 2022

fwyzard commented Oct 22, 2022

fwyzard commented Oct 22, 2022

gcc 10:

gcc 11:

VinInn commented Oct 22, 2022

VinInn commented Oct 22, 2022

fwyzard commented Oct 23, 2022

fwyzard commented Oct 23, 2022

fwyzard commented Oct 23, 2022

VinInn commented Oct 23, 2022

VinInn commented Oct 23, 2022 • edited Loading

fwyzard commented Oct 23, 2022

test.cc

Makefile

VinInn commented Oct 23, 2022

makortel commented Oct 20, 2022 •

edited

Loading

VinInn commented Oct 20, 2022 •

edited

Loading

VinInn commented Oct 20, 2022 •

edited

Loading

smuzaffar commented Oct 20, 2022 •

edited

Loading

makortel commented Oct 20, 2022 •

edited

Loading

VinInn commented Oct 21, 2022 •

edited

Loading

VinInn commented Oct 21, 2022 •

edited

Loading

VinInn commented Oct 23, 2022 •

edited

Loading

`test.cc`

`Makefile`