Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two instances of a function-local static variable on gcc11 #39786

Closed
makortel opened this issue Oct 20, 2022 · 41 comments · Fixed by #39826
Closed

Two instances of a function-local static variable on gcc11 #39786

makortel opened this issue Oct 20, 2022 · 41 comments · Fixed by #39826

Comments

@makortel
Copy link
Contributor

I noticed something peculiar on lxplus-gpu that is nowadays cs9. It can be reproduced with (e.g.)

SCRAM_ARCH=el9_amd64_gcc11 cmsrel CMSSW_12_6_X_2022-10-19-2300
cd CMSSW_12_6_X_2022-10-19-2300/src
cmsenv
cmsRun $CMSSW_RELEASE_BASE/src/HeterogeneousCore/AlpakaTest/test/writer.py

which results in

%MSG-i CUDAService:  (NoModuleName) 20-Oct-2022 09:17:00 CEST pre-events
CUDA runtime version 11.5, driver version 11.8, NVIDIA driver version 520.61.05
CUDA device 0: Tesla T4 (sm_75)
%MSG
%MSG-i AlpakaService:  (NoModuleName) 20-Oct-2022 09:17:01 CEST pre-events
AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - Tesla T4
%MSG
%MSG-i AlpakaService:  (NoModuleName) 20-Oct-2022 09:17:01 CEST pre-events
AlpakaServiceSerialSync succesfully initialised.
Found 1 device:
  - Intel(R) Xeon(R) Silver 4216 CPU @ 2.10GHz
%MSG
Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 20-Oct-2022 09:17:01.501 CEST

...

Fatal system signal has occurred during exit
Aborted (core dumped)

The stack trace of the segfault is

#0  0x00007fffeec03bcc in ?? () from /lib64/libcuda.so.1
#1  0x00007ffff04e3330 in ?? () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/external/el9_amd64_gcc11/lib/libcudart.so.11.0
#2  0x00007ffff051b190 in cudaEventDestroy () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/external/el9_amd64_gcc11/lib/libcudart.so.11.0
#3  0x00007fffc8bb1959 in std::_Sp_counted_ptr_inplace<alpaka::uniform_cuda_hip::detail::EventUniformCudaHipImpl<alpaka::ApiCudaRt>, std::allocator<alpaka::uniform_cuda_hip::detail::EventUniformCudaHipImpl<alpaka::ApiCudaRt> >, (__gnu_cxx::_Lock_policy)2>::_M_dispose()
    () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#4  0x00007fffc8bbb0ca in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#5  0x00007fffc8bbb2a0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::~CachingAllocator() ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#6  0x00007ffff5d4e455 in __run_exit_handlers () from /lib64/libc.so.6
#7  0x00007ffff5d4e5d0 in exit () from /lib64/libc.so.6
#8  0x00007ffff5d36eb7 in __libc_start_call_main () from /lib64/libc.so.6
#9  0x00007ffff5d36f60 in __libc_start_main_impl () from /lib64/libc.so.6
#10 0x00000000004097f5 in _start ()

i.e. from cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::QueueCudaRtNonBlocking>, void>::freeAllCached(), but called from the CachingAllocator destructor rather than from

AlpakaService::~AlpakaService() {
// clean up the caching memory allocators
cms::alpakatools::getHostCachingAllocator<Queue>().freeAllCached();

With gdb I saw that the job has actually two CachingAllocator objects alive. Setting a breakpoint in the CachingAllocator constructor shows

(gdb) break  cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::CachingAllocator

#0  0x00007ffff077cce0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::CachingAllocator(alpaka::DevCpu const&, unsigned int, unsigned int, unsigned int, unsigned long, double, bool, bool) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so
#1  0x00007ffff077a7db in alpaka_cuda_async::AlpakaService::AlpakaService(edm::ParameterSet const&, edm::ActivityRegistry&) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so
#2  0x00007ffff078cd02 in edm::serviceregistry::ServiceMaker<alpaka_cuda_async::AlpakaService, edm::serviceregistry::AllArgsMaker<alpaka_cuda_async::AlpakaService, alpaka_cuda_async::AlpakaService> >::make(edm::ParameterSet const&, edm::ActivityRegistry&, edm::serviceregistry::ServicesManager&) const () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaServicesPluginsCudaAsync.so
#3  0x00007ffff7b20a3b in edm::serviceregistry::ServicesManager::MakerHolder::add(edm::serviceregistry::ServicesManager&) const ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#4  0x00007ffff7b20c11 in edm::serviceregistry::ServicesManager::createServiceFor(edm::serviceregistry::ServicesManager::MakerHolder const&) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#5  0x00007ffff7b20d70 in edm::serviceregistry::ServicesManager::createServices() () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#6  0x00007ffff7b22026 in edm::serviceregistry::ServicesManager::ServicesManager(edm::ServiceToken, edm::serviceregistry::ServiceLegacy, std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&, bool) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#7  0x00007ffff7b1ef7d in edm::ServiceRegistry::createSet(std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&, edm::ServiceToken, edm::serviceregistry::ServiceLegacy, bool) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libFWCoreServiceRegistry.so
#8  0x00007ffff7e7ed97 in edm::ScheduleItems::initServices(std::vector<edm::ParameterSet, std::allocator<edm::ParameterSet> >&, edm::ParameterSet&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy, bool) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#9  0x00007ffff7d8e717 in edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#10 0x00007ffff7d91105 in edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#11 0x000000000040a0bf in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#12 0x00007ffff62fb8a8 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/arena.cpp:698
#13 0x000000000040af8b in main::{lambda()#1}::operator()() const ()
#14 0x00000000004096ec in main ()

#0  0x00007fffc8bb9ea0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::CachingAllocator(alpaka::DevCpu const&, unsigned int, unsigned int, unsigned int, unsigned long, double, bool, bool) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#1  0x00007fffc8bbdcb5 in alpaka::BufCpu<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int> cms::alpakatools::traits::CachedBufAlloc<std::byte, std::integral_constant<unsigned long, 1ul>, unsigned int, alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void, void>::allocCachedBuf<alpaka::Vec<std::integral_constant<unsigned long, 1ul>, unsigned int> >(alpaka::DevCpu const&, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, alpaka::Vec<std::integral_constant<unsigned long, 1ul>, unsigned int> const&) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#2  0x00007fffc8bbde28 in std::enable_if<((is_queue_v<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >)&&(is_unbounded_array_v<std::byte []>))&&(!(is_array_v<std::remove_extent<std::byte []>::type>)), cms::alpakatools::detail::buffer_type<alpaka::DevCpu, std::byte [], void>::type>::type cms::alpakatools::make_host_buffer<std::byte [], alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> >(alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> const&, unsigned int) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#3  0x00007fffc8bbe0e5 in PortableHostCollection<portabletest::TestSoALayout<128ul, false> >::PortableHostCollection<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>(int, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false> const&) () from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#4  0x00007fffc8bbe4c6 in alpaka_cuda_async::TestAlpakaTranscriber::acquire(edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder) ()
   from /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#5  0x00007ffff7ee8c58 in edm::stream::doAcquireIfNeeded(edm::stream::impl::ExternalWork*, edm::Event const&, edm::EventSetup const&, edm::WaitingTaskWithArenaHolder&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#6  0x00007ffff7ee715a in edm::stream::EDProducerAdaptorBase::doAcquire(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*, edm::WaitingTaskWithArenaHolder&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#7  0x00007ffff7eb9599 in edm::Worker::runAcquire(edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#8  0x00007ffff7eb972e in edm::Worker::runAcquireAfterAsyncPrefetch(std::__exception_ptr::exception_ptr, edm::EventTransitionInfo const&, edm::ParentContext const&, edm::WaitingTaskWithArenaHolder) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#9  0x00007ffff7e1ed7c in edm::Worker::AcquireTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>, void>::execute() ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#10 0x00007ffff7bea819 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) ()
   from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreConcurrency.so
#11 0x00007ffff630dae9 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=0x7ffff44ff000, this=0x7ffff4503e00)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7ffff4503e00)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/task_dispatcher.h:463
#13 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...)
    at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/task_dispatcher.cpp:168
#14 0x00007ffff7d9fa67 in edm::FinalWaitingTask::wait() () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#15 0x00007ffff7d867a5 in edm::EventProcessor::processRuns() () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#16 0x00007ffff7d94646 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms-ib.cern.ch/week1/el9_amd64_gcc11/cms/cmssw-patch/CMSSW_12_6_X_2022-10-19-2300/lib/el9_amd64_gcc11/libFWCoreFramework.so
#17 0x000000000040a190 in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#18 0x00007ffff62fb8a8 in tbb::detail::r1::task_arena_impl::execute (ta=..., d=...) at /data/cmsbld/jenkins/workspace/build-any-ib/w/BUILD/el9_amd64_gcc11/external/tbb/v2021.5.0-c0dbb6bd7407c1b3ad4cee87bb02cbc1/tbb-v2021.5.0/src/tbb/arena.cpp:698
#19 0x000000000040af8b in main::{lambda()#1}::operator()() const ()
#20 0x00000000004096ec in main ()

i.e. it ends up being constructed the first time from

cms::alpakatools::getHostCachingAllocator<Queue>();

and the second time from
auto& allocator = getHostCachingAllocator<alpaka::QueueCudaRtBlocking>();

Both call
inline CachingAllocator<alpaka_common::DevHost, TQueue>& getHostCachingAllocator() {
// thread safe initialisation of the host allocator
CMS_THREAD_SAFE static CachingAllocator<alpaka_common::DevHost, TQueue> allocator(

which should guarantee only a single object to be constructed

@makortel
Copy link
Contributor Author

assign core, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,core

@fwyzard,@Dr15Jones,@smuzaffar,@makortel,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

Found this

  • If you use dlopen to explicitly load code from a shared library, you must do several things. First, export global symbols from the executable by linking it with the "-E" flag (you will have to specify this as "-Wl,-E" if you are invoking the linker in the usual manner from the compiler driver, g++). You must also make the external symbols in the loaded library available for subsequent libraries by providing the RTLD_GLOBAL flag to dlopen. The symbol resolution can be immediate or lazy.

Template instantiations are another, user visible, case of objects with vague linkage, which needs similar resolution. If you do not take the above precautions, you may discover that a template instantiation with the same argument list, but instantiated in multiple translation units, has several addresses, depending in which translation unit the address is taken. (This is not an exhaustive list of the kind of objects which have vague linkage and are expected to be resolved during linking & loading.)

from gcc documentation https://gcc.gnu.org/faq.html#dso. Also two open tickets that might be connected

In this case in both code paths leading to the CachingAllocator the underlying shared objects (pluginHeterogeneousCoreAlpakaServicesPluginsCudaAsync.so, pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so) are loaded with dlopen(). The dlopen() is called in

SharedLibrary::SharedLibrary(const std::filesystem::path& iName)
: libraryHandle_(::dlopen(iName.string().c_str(), RTLD_LAZY | RTLD_GLOBAL)), path_(iName) {

@makortel
Copy link
Contributor Author

I was surprised to see that without setting SCRAM_ARCH on lxplus-gpu the el8_amd64_gcc10 gets picked up, and running it actually works. With el8_amd64_gcc11 I see the same behavior as in the issue description.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 20, 2022

So, it seems to be an issue with GCC 11 ?

(not saying that we should not look for a solution, just asking if you think it's something that was working with GCC 10 and broke with GCC 11)

@dan131riley
Copy link

We see a low rate of segfaults that look to be dlopen related, so it seems at least plausible that this has been around for a while.

@makortel
Copy link
Contributor Author

makortel commented Oct 20, 2022

I wanted to see the behavior with the serial_sync backend (to see if gcc or nvcc in the link would make any difference). I hacked this piece

//! The caching memory allocator implementation for the CPU device
template <typename TElem, typename TDim, typename TIdx, typename TQueue>
struct CachedBufAlloc<TElem, TDim, TIdx, alpaka::DevCpu, TQueue, void> {
template <typename TExtent>
ALPAKA_FN_HOST static auto allocCachedBuf(alpaka::DevCpu const& dev, TQueue queue, TExtent const& extent)
-> alpaka::BufCpu<TElem, TDim, TIdx> {
// non-cached, queue-ordered asynchronous host-only memory
return alpaka::allocAsyncBuf<TElem, TIdx>(queue, extent);
}
};

to actually use the caching allocator, and see similar behavior (i.e. I see two allocator instances created from AlpakaService, one in getHostCachingAllocator() and another in getDeviceCachingAllocator(); and additional one via the PortableHostCollection constructor in the EDModule). This test case does not segfault though. I suppose the difference wrt. the cuda_async backend is that in the CUDA case the CUDAService destructor resets the CUDA runtime, which will cause the cudaEventDestroy() to crash.

(this test was done with el9_amd64_gcc11)

@makortel
Copy link
Contributor Author

So, it seems to be an issue with GCC 11 ?

(not saying that we should not look for a solution, just asking if you think it's something that was working with GCC 10 and broke with GCC 11)

As far as I can tell this behavior seems specific to GCC 11. On the same lxplus-gpu node, using el8_amd64_gcc10 works, whereas el8_amd64_gcc11 and el9_amd64_gcc11 crash.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 20, 2022

  • First, export global symbols from the executable by linking it with the "-E" flag (you will have to specify this as "-Wl,-E" if you are invoking the linker in the usual manner from the compiler driver, g++). You must also make the external symbols in the loaded library available for subsequent libraries by providing the RTLD_GLOBAL flag to dlopen.

I think we are doing both:

$ scram tool info gcc-cxxcompiler | grep Wl
CXXSHAREDFLAGS+=-shared -Wl,-E
LDFLAGS+=-Wl,-E -Wl,--hash-style=gnu

and

SharedLibrary::SharedLibrary(const std::filesystem::path& iName)
: libraryHandle_(::dlopen(iName.string().c_str(), RTLD_LAZY | RTLD_GLOBAL)), path_(iName) {

@makortel
Copy link
Contributor Author

One thing I noticed while investigating this issue was that the

<test name="testHeterogeneousCoreAlpakaTestWriteRead" command="testHeterogeneousCoreAlpakaTestWriteRead.sh"/>
test is not run in GPU_X IBs. @smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)

I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

are there also two copies of cms::alpakatools::host()::host?

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

[innocent@lxplus8s12 CMSSW_12_6_X_2022-10-20-1100]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el8_amd64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-10-17-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

vs

[innocent@lxplus9s00 ~]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016380 b guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
00000000000163a0 b cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

'u' is good

"u"

The symbol is a unique global symbol. This is a GNU extension to the standard set of ELF symbol bindings. For such a symbol the dynamic linker will make sure that in the entire process there is just one symbol with this name and type in use.

b is bad (for this case)

@VinInn
Copy link
Contributor

VinInn commented Oct 20, 2022

while

[innocent@lxplus8s12 CMSSW_12_6_X_2022-10-20-1100]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el8_amd64_gcc10/cms/cmssw/CMSSW_12_6_X_2022-10-17-2300/lib/el8_amd64_gcc10/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'host$'
0000000000016370 b guard variable for cms::alpakatools::host()::host
0000000000016380 b cms::alpakatools::host()::host

vs

[innocent@lxplus9s00 ~]$ nm -C /cvmfs/cms-ib.cern.ch/nweek-02755/el9_amd64_gcc11/cms/cmssw/CMSSW_12_6_X_2022-10-18-2300/lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'host$'
0000000000016480 b guard variable for cms::alpakatools::host()::host
0000000000016490 b cms::alpakatools::host()::host

being 'b' and lower case I do expect two copies of this one (local instance)

@smuzaffar
Copy link
Contributor

smuzaffar commented Oct 20, 2022

@smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)

Currently, for GPU builds we only run tests for packages which has direct cuda dependency. We can also add package with direct alpaka dependency too

I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)

We only have two ppc64le nodes which are already overloaded due to various IBs/Release and PR tests. So I am afraid we can not add ppc64le_gcc11

@makortel
Copy link
Contributor Author

@smuzaffar what should be done in order to get it run in GPU_X IBs? (although that alone would not have helped to discover this issue because GPU_X IBs are run only for gcc10)

Currently, for GPU builds we only run tests for packages which has direct cuda dependency. We can also add package with direct alpaka dependency too

Thanks. In this case the HeterogeneousCore/AlpakaTest/test/BuildFile.xml does not declare any dependencies (because the runs a script that runs cmsRun). Should the <test/> be made dependent on alpaka? Or would an overall dependence on alpaka be sufficient?

There is also

<bin name="alpakaTestVec" file="alpaka/testVec.cc">
<use name="alpaka"/>
<use name="catch2"/>
<use name="HeterogeneousCore/AlpakaInterface"/>
<flags ALPAKA_BACKENDS="1"/>
</bin>

where an executable is compiled for each Alpaka backend. For now running the resulting alpakaTestVecCudaAsync and alpakaTestVecSerialSync on a non-GPU machine or on a GPU machine works (because no device code gets run). But maybe it would make more sense to run the alpakaTestVecSerialSync only in non-GPU IB (or maybe in both?) and alpakaTestVecCudaAsync only in GPU IB?

Or would the whole GPU unit test setup benefit from some generalizations (thinking towards adding support for AMD GPUs etc)?

I also noticed we are not running IBs for *_ppc64le_gcc11 (which might have revealed this issue as well)

We only have two ppc64le nodes which are already overloaded due to various IBs/Release and PR tests. So I am afraid we add ppc64le_gcc11 yet

Yeah, makes sense.

@makortel
Copy link
Contributor Author

makortel commented Oct 20, 2022

Thanks @VinInn.

being 'b' and lower case I do expect two copies of this one (local instance)

Indeed, inserting a printout in cms::alpakatools::detail::enumerate_host() shows multiple printouts (3 in my test, which I presume are from libDataFormatsPortableTestObjectsSerialSync.so, libHeterogeneousCoreAlpakaServicesSerialSync.so, pluginHeterogeneousCoreAlpakaTestPluginsPortableSerialSync.so). And this happens also with gcc 10 (as your nm outputs pointed out).

@smuzaffar
Copy link
Contributor

@makortel , direct dependency with in the package ( either in Package/BuildFile.xml or Package/{plugins,test,bin}/BuildFile.xml ) is good enough for bot to check out that package and run all its tests. Note that we run all tests for these package (just in case if there are tests which process the output of some GPU based test).

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

looking to the code I would have not expected multiple copies of "host" either.
somehow those functions get a "hidden" linking.
Maybe we should try to write a small demonstrator.

OK for host() is obvious: It is declared static that in old good C means local to the compilation unit!

@makortel
Copy link
Contributor Author

OK for host() is obvious: It is declared static that in old good C means local to the compilation unit!

Good catch, thanks! @fwyzard, do you have any objections for removing the static from

static inline alpaka::DevCpu const& host() {

(I believe the replication was not intentional)

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2022

You're right, I'll make a PR.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2022

See #39813 and #39814 .

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

there is a gcc 11.3 out. maybe would be worth to test it to see if this issue has been fixed

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

spent the afternoon in some monkey typing. Eventually

[innocent@lxplus9s00 src]$ git diff
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..8f0a669b7e7 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -83,8 +83,8 @@ namespace cms::alpakatools {
    */

   template <typename TDev,
-            typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename TQueue> //,
+//            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

do not ask. It is fact (and a bug in the compiler, probably)

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

more

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..af3efa974a5 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,7 @@ namespace cms::alpakatools {

   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> >> // and cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..3faf423b865 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,9 @@ namespace cms::alpakatools {

   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<
+                // cms::alpakatools::is_device_v<TDev> >> and
+                cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

This is ok as well

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..2404d639246 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -82,9 +82,15 @@ namespace cms::alpakatools {
    *    - the `Queue` type can be either `Sync` _or_ `Async` on any allocation.
    */

+    template <typename TDev,
+              typename TQueue>
+    constexpr bool isGood =  cms::alpakatools::is_device_v<TDev> and
+                                     cms::alpakatools::is_queue_v<TQueue>;
   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t< isGood<TDev,TQueue>>>
+                // cms::alpakatools::is_device_v<TDev> >> and
+                // cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

@VinInn
Copy link
Contributor

VinInn commented Oct 21, 2022

btw the default compiler is
gcc version 11.3.1 20220421 (Red Hat 11.3.1-2) (GCC)

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2022

This also works:

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..a125c98996b8 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,8 @@ namespace cms::alpakatools {
 
   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev>>,
+            typename = std::enable_if_t<cms::alpakatools::is_queue_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED

on el8_amd64_gcc11:

$ nm -C ../lib/el8_amd64_gcc11/lib*.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000011380 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
00000000000113a0 u cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator

and cmsRun HeterogeneousCore/AlpakaTest/test/writer.py works fine.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 21, 2022

This also works:

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c01..d8a25cce69d0 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -84,7 +84,7 @@ namespace cms::alpakatools {
 
   template <typename TDev,
             typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename = std::enable_if_t<std::is_object_v<TDev> and std::is_object_v<TQueue>>>
   class CachingAllocator {
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED

(it's not a useful implementation, it's just a check)

@VinInn
Copy link
Contributor

VinInn commented Oct 22, 2022

also this works (and IMHO , as argued in the above issue, I find it also a more adequate syntax)

[innocent@lxplus9s00 src]$ git diff
diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index 589950ae6c0..551bc948853 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -83,9 +83,9 @@ namespace cms::alpakatools {
    */

   template <typename TDev,
-            typename TQueue,
-            typename = std::enable_if_t<cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>>>
+            typename TQueue>
   class CachingAllocator {
+    static_assert(cms::alpakatools::is_device_v<TDev> and cms::alpakatools::is_queue_v<TQueue>);
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
     friend class alpaka_cuda_async::AlpakaService;
[innocent@lxplus9s00 src]$ nm -C ../lib/el9_amd64_gcc11/lib*.so | egrep 'allocator$'
0000000000016420 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000016440 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000011380 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
00000000000113a0 u cms::alpakatools::getHostCachingAllocator<alpaka::QueueGenericThreadsBlocking<alpaka::DevCpu>, void>()::allocator
0000000000016458 u guard variable for cms::cuda::allocator::getCachingHostAllocator()::allocator
0000000000016340 u guard variable for cms::cuda::allocator::getCachingDeviceAllocator()::allocator
0000000000016460 u cms::cuda::allocator::getCachingHostAllocator()::allocator
0000000000016360 u cms::cuda::allocator::getCachingDeviceAllocator()::allocator

@fwyzard
Copy link
Contributor

fwyzard commented Oct 22, 2022

But this would give a compile time error instead of failing to match the template parameters (OK, which would also give a different compile time error in this case).

By the way, a curious observation: in my tests getCachingDeviceAllocator()::allocator is always compiled correctly (in terms of what we want it to do), only getCachingHostAllocator()::allocator shows the different behaviour in GCC 11 vs GCC 10.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 22, 2022

Ah, and just to exclude a possible culprit, I think the assembler and linker are not at fault here, the difference is there already in the GCC assembly output:

gcc 10:

        .weak   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
        .section        .bss._ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,"awG",@nobits,_ZZN3cms11a
        .align 32
        .type   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, @gnu_unique_object
        .size   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator, 224
_ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_L11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator:
        .zero   224

gcc 11:

        .local  _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator
        .comm   _ZZN3cms11alpakatools23getHostCachingAllocatorIN6alpaka27QueueGenericThreadsBlockingINS2_6DevCpuEEEvEERNS0_16CachingAllocatorIS4_T_NSt9enable_ifIXaaL_ZNS0_11is_device_vIS4_EEE10is_queue_vIS7_EEvE4typeEEEvE9allocator,224,32

@VinInn
Copy link
Contributor

VinInn commented Oct 22, 2022

yep. To report a bug (regression) one needs to reproduce it in a simpler example. I've failed until now.

@VinInn
Copy link
Contributor

VinInn commented Oct 22, 2022

btw: why such a complex machinery to construct and destruct a bunch of allocators?
(in HeterogeneousCore/AlpakaInterface/interface/getDeviceCachingAllocator.h)
a simple std::vector would not have been enough?

@fwyzard
Copy link
Contributor

fwyzard commented Oct 23, 2022

OK, I think I found the source of the problem. In HeterogeneousCore/AlpakaInterface/interface/traits.h I used

  template <typename T>
  constexpr bool is_device_v = is_device<T>::value;

instead of

  template <typename T>
  inline constexpr bool is_device_v = is_device<T>::value;

So, as explained on StackOverflow, is_device_v is a local variable, and it looks like recent versions of GCC make the template specialisation based on it local as well.

Adding the missing inline to the declarations fixes the problem.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 23, 2022

Fixed by #39826.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 23, 2022

btw: why such a complex machinery to construct and destruct a bunch of allocators? (in HeterogeneousCore/AlpakaInterface/interface/getDeviceCachingAllocator.h) a simple std::vector would not have been enough?

Because a CachingAllocator does not have a default constructor, and is not copyable or movable.

Doing make_unique<Allocator[]>(size) does not work either, because of the lack of a default constructor.

@VinInn
Copy link
Contributor

VinInn commented Oct 23, 2022

I think a std::vector<std::unique_ptr<Allocator>> should work.

@VinInn
Copy link
Contributor

VinInn commented Oct 23, 2022

Adding the missing inline to the declarations fixes the problem.

Ok, it may make sense. Not easy to reproduce with a small example anyhow.

@fwyzard
Copy link
Contributor

fwyzard commented Oct 23, 2022

Ok, it may make sense. Not easy to reproduce with a small example anyhow.

If you want to play with it, this is the reproducer I manage to cook:

test.cc

#include <type_traits>

// concepts
namespace concepts {
  //! Tag used in class inheritance hierarchies that describes that a specific concept (TConcept)
  //! is implemented by the given base class (TBase).
  template <typename TConcept, typename TBase>
  struct Implements {};

  //! Checks whether the concept is implemented by the given class
  template <typename TConcept, typename TDerived>
  struct ImplementsConcept {
    template <typename TBase>
    static auto implements(Implements<TConcept, TBase>&) -> std::true_type;
    static auto implements(...) -> std::false_type;

    static constexpr auto value = decltype(implements(std::declval<TDerived&>()))::value;
  };

}  // namespace concepts

struct ConceptDev;
struct ConceptQueue;

// traits
template <typename T>
using is_device = concepts::ImplementsConcept<ConceptDev, T>;

template <typename T>
/* inline */ constexpr bool is_device_v = is_device<T>::value;

template <typename T>
using is_queue = concepts::ImplementsConcept<ConceptQueue, T>;

template <typename T>
/* inline */ constexpr bool is_queue_v = is_queue<T>::value;

// host
class DevCpu : public concepts::Implements<ConceptDev, DevCpu> {
private:
  void* impl = nullptr;
};

// queue
class QueueCpu : public concepts::Implements<ConceptQueue, QueueCpu> {
private:
  void* impl = nullptr;
};

// allocator
template <typename TDev, typename TQueue, typename = std::enable_if_t<is_device_v<TDev> and is_queue_v<TQueue>>>
//template <typename TDev, typename TQueue, typename = std::enable_if_t<concepts::ImplementsConcept<ConceptDev, TDev>::value and concepts::ImplementsConcept<ConceptQueue, TQueue>::value>>
class Allocator {
public:
  Allocator(TDev const& dev) : dev{dev} {}

private:
  TDev dev;
};

// access the allocator
template <typename TQueue, typename = std::enable_if_t<is_queue_v<TQueue>>>
//template <typename TQueue, typename = std::enable_if_t<concepts::ImplementsConcept<ConceptQueue, TQueue>::value>>
inline Allocator<DevCpu, TQueue>& getHostAllocator() {
  static Allocator<DevCpu, TQueue> allocator{DevCpu{}};
  return allocator;
}

void test() { getHostAllocator<QueueCpu>(); }

Makefile

.PHONY: all dump clean

all: dump

dump: dump.gcc10 dump.gcc11 dump.gcc12 dump.clang12

clean:
        rm -f test.gcc* test.clang*

CXXFLAGS := -std=c++17 -O2 -ftree-vectorize -msse3 -fPIC -pthread

# gcc 10
test.gcc10.ii: test.cc
        g++-10 $(CXXFLAGS) $< -E -o $@

test.gcc10.s: test.gcc10.ii
        g++-10 $(CXXFLAGS) $< -S -o $@

test.gcc10.o: test.gcc10.s
        g++-10 $(CXXFLAGS) $< -c -o $@

dump.gcc10: test.gcc10.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

# gcc 11
test.gcc11.ii: test.cc
        g++-11 $(CXXFLAGS) $< -E -o $@

test.gcc11.s: test.gcc11.ii
        g++-11 $(CXXFLAGS) $< -S -o $@

test.gcc11.o: test.gcc11.s
        g++-11 $(CXXFLAGS) $< -c -o $@

dump.gcc11: test.gcc11.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

# gcc 12
test.gcc12.ii: test.cc
        g++-12 $(CXXFLAGS) $< -E -o $@

test.gcc12.s: test.gcc12.ii
        g++-12 $(CXXFLAGS) $< -S -o $@

test.gcc12.o: test.gcc12.s
        g++-12 $(CXXFLAGS) $< -c -o $@

dump.gcc12: test.gcc12.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

# clang 12
test.clang12.ii: test.cc
        clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -E -o $@

test.clang12.s: test.clang12.ii
        clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -S -o $@

test.clang12.o: test.clang12.s
        clang++-12 -Wno-unused-command-line-argument $(CXXFLAGS) $< -c -o $@

dump.clang12: test.clang12.o
        nm -C $< | grep 'allocator$$' | grep --color -w '[[:alnum:]]'

By the way, clang generates a V weak object, possibly with a default value (both with and without the inline).

@VinInn
Copy link
Contributor

VinInn commented Oct 23, 2022

Thanks. Useful to have a self contained example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants