Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocating a host buffer in a dev.cc file causes a crash at the end of the job #42414

Closed
fwyzard opened this issue Jul 29, 2023 · 20 comments
Closed

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Jul 29, 2023

Calling cms::alpakatools::make_host_buffer<T>(queue) in a .dev.cc file compiled for the CUDA back-end causes a crash at the end of the job:

Fatal system signal has occurred during exit
Aborted (core dumped)

A simple reproducer is

diff --git a/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlgo.dev.cc b/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlgo.dev.cc
index 6bdb36e0e57a..03d4d0eeb1bb 100644
--- a/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlgo.dev.cc
+++ b/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlgo.dev.cc
@@ -52,6 +52,12 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     auto workDiv = make_workdiv<Acc1D>(groups, items);
 
     alpaka::exec<Acc1D>(queue, workDiv, TestAlgoKernel{}, collection.view(), collection->metadata().size(), xvalue);
+
+    // unused
+    std::cerr << "############################################################################\n";
+    auto buffer = cms::alpakatools::make_host_buffer<int>(queue);
+    std::cerr << "unused buffer at " << buffer.data() << '\n';
+    std::cerr << "############################################################################\n";
   }
 
 }  // namespace ALPAKA_ACCELERATOR_NAMESPACE
scram b
cmsRun HeterogeneousCore/AlpakaTest/test/writer.py
@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

assign heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Jul 29, 2023

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

@ywkao @borzari FYI

I finally managed to come up with a minimal reproducer of a problem you independently reported in the past weeks.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

Enabling debugging information for the CachingAllocator for CUDA host memory

#ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
          debug_(std::is_same_v<Device, alpaka::DevCpu> and std::is_same_v<Queue, alpaka::QueueCudaRtNonBlocking>)
#else
          debug_(false)
#endif

and instrumenting the freeAllCached() method

    void freeAllCached() {
      std::scoped_lock lock(mutex_);
      if (debug_)
        std::cout << alpaka::core::demangled<CachingAllocator<Device, Queue, void>> << "::freeAllCached()" << " - start" << std::endl;

...

      if (debug_)
        std::cout << alpaka::core::demangled<CachingAllocator<Device, Queue, void>> << "::freeAllCached()" << " - done" << std::endl;
    }

highlights that the problem is caused by freeAllCached() being called twice:

cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() - start
        alpaka::DevCpu AMD EPYC 7763 64-Core Processor                 freed 16384 bytes.
                  0 available blocks cached (0 bytes), 0 live blocks (0 bytes) outstanding.

cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() - done
cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() - start
        alpaka::DevCpu AMD EPYC 7763 64-Core Processor                 freed 256 bytes.
                  0 available blocks cached (0 bytes), 0 live blocks (0 bytes) outstanding.



Fatal system signal has occurred during exit
Aborted (core dumped)

The first time it does not find any blocks to release.
The second time it does find a block, and releasing it causes the crash.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

Running with GDB

gdb -ex r -args cmsRun HeterogeneousCore/AlpakaTest/test/writer.py
b cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached()

shows that the first time it is called by the destructor of the corresponding AlpakaService:

run
...
Thread 1 "cmsRun" hit Breakpoint 1, 0x00007fffec81ea30 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() () from /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so
gdb$ bt
#0  0x00007fffec81ea30 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() () from /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so
#1  0x00007fffec822d60 in alpaka_cuda_async::AlpakaService::~AlpakaService() () from /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so
#2  0x00007fffec82e055 in edm::serviceregistry::ServiceWrapper<alpaka_cuda_async::AlpakaService>::~ServiceWrapper() () from /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaServicesPluginsCudaAsync.so
#3  0x00007ffff7f64eaa in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libFWCoreServiceRegistry.so
#4  0x00007ffff7f6b19a in edm::serviceregistry::ServicesManager::~ServicesManager() () from /data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libFWCoreServiceRegistry.so
#5  0x00007ffff7f64eaa in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libFWCoreServiceRegistry.so
#6  0x00007ffff7f64eaa in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libFWCoreServiceRegistry.so
#7  0x00007ffff7bc860a in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() () from /data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#8  0x00007ffff7be2a29 in edm::EventProcessor::~EventProcessor() () from /data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/libFWCoreFramework.so
#9  0x000000000040b731 in (anonymous namespace)::EventProcessorWithSentry::~EventProcessorWithSentry() ()
#10 0x0000000000407d89 in main ()

The second time it's called by the destructor of the CachingAllocator itself:

continue
...
Thread 1 "cmsRun" hit Breakpoint 1, 0x00007fffc42f5ff0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() () from /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.sogdb$ bt
#0  0x00007fffc42f5ff0 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() () from /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#1  0x00007fffc42f6de2 in cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::~CachingAllocator() () from /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so
#2  0x00007ffff521229c in __run_exit_handlers () from /lib64/libc.so.6
#3  0x00007ffff52123d0 in exit () from /lib64/libc.so.6
#4  0x00007ffff51fbd8c in __libc_start_main () from /lib64/libc.so.6
#5  0x000000000040803e in _start ()

Both calls are expected.

What is unexpected is that the first call does not find any blocks to release, and that the second one does.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

Going through the logs and adding more debugging information shows that there in fact two instances of the CachingAllocator.

The first one is initialised by the AlpakaService at the beginning of the job:

%MSG-i AlpakaService:  (NoModuleName) 29-Jul-2023 08:20:54 CEST pre-events
AlpakaServiceCudaAsync succesfully initialised.
Found 1 device:
  - Tesla T4
%MSG
[0x7f45705cd4c0] cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void> settings:
  bin growth 2
  min bin    8
  max bin    30
...

A second one is initialised by the call to cms::alpakatools::make_host_buffer<int>(queue) while processing the first event:

Begin processing the 1st record. Run 1, Event 1, LumiSection 1 on stream 0 at 29-Jul-2023 08:20:54.397 CEST
############################################################################
[0x7f4548d68b40] cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void> settings:
  bin growth 2
  min bin    8
  max bin    30
...
        alpaka::DevCpu AMD EPYC 7763 64-Core Processor                 allocated new block at 0x7f4545600000 (256 bytes associated with queue 0x7f45498780f0, event 0x7f4549d197d0.

unused buffer at 0x7f4545600000
############################################################################

At the end of the job, the instance known to the AlpakaService does not have any blocks to release:

[0x7f45705cd4c0] cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() - start
        alpaka::DevCpu AMD EPYC 7763 64-Core Processor                 freed 16384 bytes.
                  0 available blocks cached (0 bytes), 0 live blocks (0 bytes) outstanding.

[0x7f45705cd4c0] cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() - done

Instead, it is the second instance that crashes while trying to release its memory blocks:

[0x7f4548d68b40] cms::alpakatools::CachingAllocator<alpaka::DevCpu, alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>::freeAllCached() - start
        alpaka::DevCpu AMD EPYC 7763 64-Core Processor                 freed 256 bytes.
                  0 available blocks cached (0 bytes), 0 live blocks (0 bytes) outstanding.

Fatal system signal has occurred during exit

This starts to make sense: the CUDAService will reset all CUDA devices in its destructor.
The destructor of the AlpakaService is called before the destructor of the CUDAService, but the destructor of the global CachingAllocator will be called after that - when the CUDA objects are no longer valid.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

Instances of a CachingAllocator for pinned host memory should only be created by the call to cms::alpakatools::getHostCachingAllocator(...):

namespace cms::alpakatools {

  template <typename TQueue, typename = std::enable_if_t<alpaka::isQueue<TQueue>>>
  inline CachingAllocator<alpaka_common::DevHost, TQueue>& getHostCachingAllocator() {
    // thread safe initialisation of the host allocator
    CMS_THREAD_SAFE static CachingAllocator<alpaka_common::DevHost, TQueue> allocator(
        host(),
        config::binGrowth,
        config::minBin,
        config::maxBin,
        config::maxCachedBytes,
        config::maxCachedFraction,
        false,   // reuseSameQueueAllocations
        false);  // debug

    // the public interface is thread safe
    return allocator;
  }

}  // namespace cms::alpakatools

The allocator object is a function static, so there should always be only one instance in a running a program.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

Looking for those instances in the shared library shows something unexpected:

gcc-nm -A -C -l lib/el8_amd64_gcc11/*.so | grep 'cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUnifor
mCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator'
lib/el8_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so:00000000000175a0 B guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
lib/el8_amd64_gcc11/libHeterogeneousCoreAlpakaServicesCudaAsync.so:00000000000174c0 B cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so:0000000000ccbb20 b guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so:0000000000d01260 B guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so:0000000000ccbb40 b cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
lib/el8_amd64_gcc11/pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync.so:0000000000d01180 B cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
  • libHeterogeneousCoreAlpakaServicesCudaAsync.so has a reference to a global (B) object
  • pluginHeterogeneousCoreAlpakaTestPluginsPortableCudaAsync has a has a reference to a global (B) object and a reference to a local (b) object !

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

Looking inside the individual .o files shows something similar:

find tmp/ -name '*.o' | xargs gcc-nm -A -C -l | grep 'cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator' | sed -e's#cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()#...#'
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestHelperClass.cc.o:00000000                 W guard variable for ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestHelperClass.cc.o:00000000                 W ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaGlobalProducer.cc.o:00000000        W guard variable for ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaGlobalProducer.cc.o:00000000        W ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.o:00000000              W guard variable for ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.o:00000000              W ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaGlobalProducerOffset.cc.o:00000000  W guard variable for ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaGlobalProducerOffset.cc.o:00000000  W ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.o:0000000000000000            b guard variable for ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.o:0000000000000020            b ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaStreamProducer.cc.o:00000000        W guard variable for ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaStreamProducer.cc.o:00000000        W ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaServices/src/alpaka/HeterogeneousCoreAlpakaServicesCudaAsync/AlpakaService.cc.o:00000000                              W guard variable for ...::allocator
tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaServices/src/alpaka/HeterogeneousCoreAlpakaServicesCudaAsync/AlpakaService.cc.o:00000000                              W ...::allocator

All .cc.o files have a global weak symbol (W) for the allocator.
The .dev.cc.o file has a local symbol (b).

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

The .cc.o files are compiled directly by GCC, with LTO enabled, and produce implicit weak global symbols (W):

$ /data/cmssw/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/bin/c++ -c -DGNU_GCC -D_GNU_SOURCE -DEIGEN_DONT_PARALLELIZE -DTBB_USE_GLIBCXX_VERSION=110401 -DTBB_SUPPRESS_DEPRECATED_MESSAGES -DTBB_PREVIEW_RESUMABLE_TASKS=1 -DTBB_PREVIEW_TASK_GROUP_EXTENSIONS=1 -DBOOST_SPIRIT_THREADSAFE -DPHOENIX_THREADSAFE -DBOOST_MATH_DISABLE_STD_FPCLASSIFY -DBOOST_UUID_RANDOM_PROVIDER_FORCE_POSIX -DCMSSW_GIT_HASH='CMSSW_13_2_0_pre3' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_13_2_0_pre3' -I/data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/external/alpaka/develop-20230621-9e2225ac6c979464a40749ef9d1e0331/include -I/data/cmssw/el8_amd64_gcc11/external/pcre/8.43-bd2b09f5d686f0f36e748ce001d315ad/include -isystem/data/cmssw/el8_amd64_gcc11/external/boost/1.80.0-5305613b2f750cf1a05dcadf0d672647/include -I/data/cmssw/el8_amd64_gcc11/external/bz2lib/1.0.6-24b287d9981341b8441eb85733326b1a/include -I/data/cmssw/el8_amd64_gcc11/external/cuda/11.8.0-9f0af0f4206be7b705fe550319c49a11/include -I/data/cmssw/el8_amd64_gcc11/external/libuuid/2.34-f7577986509a353c203144983884d697/include -isystem/data/cmssw/el8_amd64_gcc11/lcg/root/6.26.11-50eed3272fcfa103ebe9cf3182b98eb9/include -isystem/data/cmssw/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/include -I/data/cmssw/el8_amd64_gcc11/external/xz/5.2.5-56c8544f64e9d56c1108fbe00c3ecb67/include -I/data/cmssw/el8_amd64_gcc11/external/zlib/1.2.11-a365170a889b785ec23815da2b99d7d1/include -I/data/cmssw/el8_amd64_gcc11/external/eigen/82dd3710dac619448f50331c1d6a35da673f764a-f9c27fce684e89466e2ef07869cd264d/include/eigen3 -I/data/cmssw/el8_amd64_gcc11/external/fmt/8.0.1-89199f97a8c166a965017c69137de0d0/include -I/data/cmssw/el8_amd64_gcc11/external/md5/1.0.0-6bede1cf43db82355b3835c81f384d05/include -I/data/cmssw/el8_amd64_gcc11/external/tinyxml2/6.2.0-f05bc085db13b8b4b752c87703ff413d/include -O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++17 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-deprecated-copy -Wno-unused-parameter -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Wno-error=unused-variable -DALPAKA_DEFAULT_HOST_MEMORY_ALIGNMENT=128 -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_HOST_ONLY -DBOOST_DISABLE_ASSERTS -flto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr  -fPIC  -MMD -MF tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.d /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlpakaProducer.cc -o tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.o
$ gcc-nm -C tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.o | grep 'cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator'
00000000 W guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
00000000 W cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

The same .cc.o files compiled without LTO produces unique global symbols (u):

$ /data/cmssw/el8_amd64_gcc11/external/gcc/11.4.1-30ebdc301ebd200f2ae0e3d880258e65/bin/c++ -c -DGNU_GCC -D_GNU_SOURCE -DEIGEN_DONT_PARALLELIZE -DTBB_USE_GLIBCXX_VERSION=110401 -DTBB_SUPPRESS_DEPRECATED_MESSAGES -DTBB_PREVIEW_RESUMABLE_TASKS=1 -DTBB_PREVIEW_TASK_GROUP_EXTENSIONS=1 -DBOOST_SPIRIT_THREADSAFE -DPHOENIX_THREADSAFE -DBOOST_MATH_DISABLE_STD_FPCLASSIFY -DBOOST_UUID_RANDOM_PROVIDER_FORCE_POSIX -DCMSSW_GIT_HASH='CMSSW_13_2_0_pre3' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_13_2_0_pre3' -I/data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/external/alpaka/develop-20230621-9e2225ac6c979464a40749ef9d1e0331/include -I/data/cmssw/el8_amd64_gcc11/external/pcre/8.43-bd2b09f5d686f0f36e748ce001d315ad/include -isystem/data/cmssw/el8_amd64_gcc11/external/boost/1.80.0-5305613b2f750cf1a05dcadf0d672647/include -I/data/cmssw/el8_amd64_gcc11/external/bz2lib/1.0.6-24b287d9981341b8441eb85733326b1a/include -I/data/cmssw/el8_amd64_gcc11/external/cuda/11.8.0-9f0af0f4206be7b705fe550319c49a11/include -I/data/cmssw/el8_amd64_gcc11/external/libuuid/2.34-f7577986509a353c203144983884d697/include -isystem/data/cmssw/el8_amd64_gcc11/lcg/root/6.26.11-50eed3272fcfa103ebe9cf3182b98eb9/include -isystem/data/cmssw/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/include -I/data/cmssw/el8_amd64_gcc11/external/xz/5.2.5-56c8544f64e9d56c1108fbe00c3ecb67/include -I/data/cmssw/el8_amd64_gcc11/external/zlib/1.2.11-a365170a889b785ec23815da2b99d7d1/include -I/data/cmssw/el8_amd64_gcc11/external/eigen/82dd3710dac619448f50331c1d6a35da673f764a-f9c27fce684e89466e2ef07869cd264d/include/eigen3 -I/data/cmssw/el8_amd64_gcc11/external/fmt/8.0.1-89199f97a8c166a965017c69137de0d0/include -I/data/cmssw/el8_amd64_gcc11/external/md5/1.0.0-6bede1cf43db82355b3835c81f384d05/include -I/data/cmssw/el8_amd64_gcc11/external/tinyxml2/6.2.0-f05bc085db13b8b4b752c87703ff413d/include -O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -std=c++17 -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-deprecated-copy -Wno-unused-parameter -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Wno-error=unused-variable -DALPAKA_DEFAULT_HOST_MEMORY_ALIGNMENT=128 -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_HOST_ONLY -DBOOST_DISABLE_ASSERTS -fPIC  -MMD -MF tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.d /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlpakaProducer.cc -o tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.o
$ gcc-nm -C tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlpakaProducer.cc.o | grep 'cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator'
0000000000000000 u guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000000000 u cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

The .dev.cc.o files are compiled by nvcc, with LTO disabled, and produce local zero-initialised symbols (b):

$ /data/cmssw/el8_amd64_gcc11/external/cuda/11.8.0-9f0af0f4206be7b705fe550319c49a11/bin/nvcc -x cu -MMD -MF tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.d -dc -DGNU_GCC -D_GNU_SOURCE -DEIGEN_DONT_PARALLELIZE -DTBB_USE_GLIBCXX_VERSION=110401 -DTBB_SUPPRESS_DEPRECATED_MESSAGES -DTBB_PREVIEW_RESUMABLE_TASKS=1 -DTBB_PREVIEW_TASK_GROUP_EXTENSIONS=1 -DBOOST_SPIRIT_THREADSAFE -DPHOENIX_THREADSAFE -DBOOST_MATH_DISABLE_STD_FPCLASSIFY -DBOOST_UUID_RANDOM_PROVIDER_FORCE_POSIX -DCMSSW_GIT_HASH='CMSSW_13_2_0_pre3' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_13_2_0_pre3' -I/data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/external/alpaka/develop-20230621-9e2225ac6c979464a40749ef9d1e0331/include -I/data/cmssw/el8_amd64_gcc11/external/pcre/8.43-bd2b09f5d686f0f36e748ce001d315ad/include -I/data/cmssw/el8_amd64_gcc11/external/boost/1.80.0-5305613b2f750cf1a05dcadf0d672647/include -I/data/cmssw/el8_amd64_gcc11/external/bz2lib/1.0.6-24b287d9981341b8441eb85733326b1a/include -I/data/cmssw/el8_amd64_gcc11/external/cuda/11.8.0-9f0af0f4206be7b705fe550319c49a11/include -I/data/cmssw/el8_amd64_gcc11/external/libuuid/2.34-f7577986509a353c203144983884d697/include -I/data/cmssw/el8_amd64_gcc11/lcg/root/6.26.11-50eed3272fcfa103ebe9cf3182b98eb9/include -I/data/cmssw/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/include -I/data/cmssw/el8_amd64_gcc11/external/xz/5.2.5-56c8544f64e9d56c1108fbe00c3ecb67/include -I/data/cmssw/el8_amd64_gcc11/external/zlib/1.2.11-a365170a889b785ec23815da2b99d7d1/include -I/data/cmssw/el8_amd64_gcc11/external/eigen/82dd3710dac619448f50331c1d6a35da673f764a-f9c27fce684e89466e2ef07869cd264d/include/eigen3 -I/data/cmssw/el8_amd64_gcc11/external/fmt/8.0.1-89199f97a8c166a965017c69137de0d0/include -I/data/cmssw/el8_amd64_gcc11/external/md5/1.0.0-6bede1cf43db82355b3835c81f384d05/include -I/data/cmssw/el8_amd64_gcc11/external/tinyxml2/6.2.0-f05bc085db13b8b4b752c87703ff413d/include --diag-suppress 20014 -std=c++17 -O3 --generate-line-info --source-in-ptx --display-error-number --expt-relaxed-constexpr --extended-lambda -gencode arch=compute_60,code=[sm_60,compute_60] -gencode arch=compute_70,code=[sm_70,compute_70] -gencode arch=compute_75,code=[sm_75,compute_75] -Wno-deprecated-gpu-targets -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored --cudart shared -DALPAKA_DEFAULT_HOST_MEMORY_ALIGNMENT=128 -DALPAKA_ACC_GPU_CUDA_ENABLED -UALPAKA_HOST_ONLY --compiler-options '-O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-deprecated-copy -Wno-unused-parameter -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Wno-error=unused-variable -DALPAKA_DEFAULT_HOST_MEMORY_ALIGNMENT=128 -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_HOST_ONLY -DBOOST_DISABLE_ASSERTS  -std=c++17 -fPIC '  /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlgo.dev.cc -o tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.o
$ gcc-nm -C tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.o | grep 'cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator'
0000000000000000 b guard variable for cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator
0000000000000020 b cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator

Enabling host-side LTO produces non-working CUDA programs, but for the sake of argument can be tested here:

$ /data/cmssw/el8_amd64_gcc11/external/cuda/11.8.0-9f0af0f4206be7b705fe550319c49a11/bin/nvcc -x cu -MMD -MF tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.d -dc -DGNU_GCC -D_GNU_SOURCE -DEIGEN_DONT_PARALLELIZE -DTBB_USE_GLIBCXX_VERSION=110401 -DTBB_SUPPRESS_DEPRECATED_MESSAGES -DTBB_PREVIEW_RESUMABLE_TASKS=1 -DTBB_PREVIEW_TASK_GROUP_EXTENSIONS=1 -DBOOST_SPIRIT_THREADSAFE -DPHOENIX_THREADSAFE -DBOOST_MATH_DISABLE_STD_FPCLASSIFY -DBOOST_UUID_RANDOM_PROVIDER_FORCE_POSIX -DCMSSW_GIT_HASH='CMSSW_13_2_0_pre3' -DPROJECT_NAME='CMSSW' -DPROJECT_VERSION='CMSSW_13_2_0_pre3' -I/data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/cms/cmssw/CMSSW_13_2_0_pre3/src -I/data/cmssw/el8_amd64_gcc11/external/alpaka/develop-20230621-9e2225ac6c979464a40749ef9d1e0331/include -I/data/cmssw/el8_amd64_gcc11/external/pcre/8.43-bd2b09f5d686f0f36e748ce001d315ad/include -I/data/cmssw/el8_amd64_gcc11/external/boost/1.80.0-5305613b2f750cf1a05dcadf0d672647/include -I/data/cmssw/el8_amd64_gcc11/external/bz2lib/1.0.6-24b287d9981341b8441eb85733326b1a/include -I/data/cmssw/el8_amd64_gcc11/external/cuda/11.8.0-9f0af0f4206be7b705fe550319c49a11/include -I/data/cmssw/el8_amd64_gcc11/external/libuuid/2.34-f7577986509a353c203144983884d697/include -I/data/cmssw/el8_amd64_gcc11/lcg/root/6.26.11-50eed3272fcfa103ebe9cf3182b98eb9/include -I/data/cmssw/el8_amd64_gcc11/external/tbb/v2021.8.0-7e31093a7b4a477d01bc3946dd0bf612/include -I/data/cmssw/el8_amd64_gcc11/external/xz/5.2.5-56c8544f64e9d56c1108fbe00c3ecb67/include -I/data/cmssw/el8_amd64_gcc11/external/zlib/1.2.11-a365170a889b785ec23815da2b99d7d1/include -I/data/cmssw/el8_amd64_gcc11/external/eigen/82dd3710dac619448f50331c1d6a35da673f764a-f9c27fce684e89466e2ef07869cd264d/include/eigen3 -I/data/cmssw/el8_amd64_gcc11/external/fmt/8.0.1-89199f97a8c166a965017c69137de0d0/include -I/data/cmssw/el8_amd64_gcc11/external/md5/1.0.0-6bede1cf43db82355b3835c81f384d05/include -I/data/cmssw/el8_amd64_gcc11/external/tinyxml2/6.2.0-f05bc085db13b8b4b752c87703ff413d/include --diag-suppress 20014 -std=c++17 -O3 --generate-line-info --source-in-ptx --display-error-number --expt-relaxed-constexpr --extended-lambda -gencode arch=compute_60,code=[sm_60,compute_60] -gencode arch=compute_70,code=[sm_70,compute_70] -gencode arch=compute_75,code=[sm_75,compute_75] -Wno-deprecated-gpu-targets -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored --cudart shared -DALPAKA_DEFAULT_HOST_MEMORY_ALIGNMENT=128 -DALPAKA_ACC_GPU_CUDA_ENABLED -UALPAKA_HOST_ONLY --compiler-options '-O2 -pthread -pipe -Werror=main -Werror=pointer-arith -Werror=overlength-strings -Wno-vla -Werror=overflow -ftree-vectorize -Werror=array-bounds -Werror=format-contains-nul -Werror=type-limits -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -Wall -Wno-non-template-friend -Wno-long-long -Wreturn-type -Wextra -Wpessimizing-move -Wclass-memaccess -Wno-cast-function-type -Wno-unused-but-set-parameter -Wno-ignored-qualifiers -Wno-deprecated-copy -Wno-unused-parameter -Wunused -Wparentheses -Wno-deprecated -Werror=return-type -Werror=missing-braces -Werror=unused-value -Werror=unused-label -Werror=address -Werror=format -Werror=sign-compare -Werror=write-strings -Werror=delete-non-virtual-dtor -Werror=strict-aliasing -Werror=narrowing -Werror=unused-but-set-variable -Werror=reorder -Werror=unused-variable -Werror=conversion-null -Werror=return-local-addr -Wnon-virtual-dtor -Werror=switch -fdiagnostics-show-option -Wno-unused-local-typedefs -Wno-attributes -Wno-psabi -Wno-error=unused-variable -DALPAKA_DEFAULT_HOST_MEMORY_ALIGNMENT=128 -DALPAKA_ACC_GPU_CUDA_ENABLED -DALPAKA_HOST_ONLY -DBOOST_DISABLE_ASSERTS  -std=c++17 -fPIC -flto -fipa-icf -flto-odr-type-merging -fno-fat-lto-objects -Wodr'  /data/user/fwyzard/repro/CMSSW_13_2_0_pre3/src/HeterogeneousCore/AlpakaTest/plugins/alpaka/TestAlgo.dev.cc -o tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.o
$ gcc-nm -C tmp/el8_amd64_gcc11/src/HeterogeneousCore/AlpakaTest/plugins/HeterogeneousCoreAlpakaTestPluginsPortableCudaAsync/alpaka/TestAlgo.dev.cc.o | grep 'cms::alpakatools::getHostCachingAllocator<alpaka::uniform_cuda_hip::detail::QueueUniformCudaHipRt<alpaka::ApiCudaRt, false>, void>()::allocator'

This produces no symbols at all for the static allocator variable :-/

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 29, 2023

Trying to write a much simpler reproducer:

symbol.h

#ifndef SYMBOL_H
#define SYMBOL_H

inline int& get_symbol() {
  static int counter = 0;
  return counter;
}

#endif  // SYMBOL_H

test.cc

#include <iostream>
#include "symbol.h"

void test() {
  int& symbol = get_symbol();
  ++symbol;
  ++symbol;
  ++symbol;
  std::cout << "test_a(): symbol is " << symbol << std::endl;
}

gives the same behaviour for gcc (with and without LTO), but does not reproduce the problem with nvcc.
Instead, nvcc behaves like gcc without LOT, as expected:

$ /usr/local/cuda-11.5/bin/nvcc -std=c++17 -O3 -g -DGNU_GCC -D_GNU_SOURCE -ccbin g++-10 -Xcompiler '-Wall -O2 -pthread -pipe -ftree-vectorize -fvisibility-inlines-hidden -fno-math-errno --param vect-max-version-for-alias-checks=50 -Xassembler --compress-debug-sections -fuse-ld=bfd -msse3 -felide-constructors -fmessage-length=0 -fdiagnostics-show-option -std=c++17 -fPIC' --diag-suppress 20014 --generate-line-info --source-in-ptx --display-error-number --expt-relaxed-constexpr --extended-lambda -gencode arch=compute_60,code=[sm_60,compute_60] -gencode arch=compute_70,code=[sm_70,compute_70] -gencode arch=compute_75,code=[sm_75,compute_75] -Wno-deprecated-gpu-targets -Xcudafe --diag_suppress=esa_on_defaulted_function_ignored --cudart shared -dc -MMD -x cu test.cc -o test.o
$ gcc-nm -A -C test.o | grep counter
test.o:0000000000000000 u get_symbol()::counter

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 30, 2023

I think this is a minimal reproducer for the underlying problem.

test.cc

#include <type_traits>

class Type {};

// the SFINAE condition is a possible cause of the problem
template <typename T1, typename T2, typename = std::enable_if_t<std::is_class_v<T1> and std::is_class_v<T2>>>
class Resource {
public:
  explicit Resource(int value) : value_{value} {}

  int value_;
};

template <typename T>
inline Resource<Type, T>& getResource() {
  static Resource<Type, T> resource(42);
  return resource;
}

void call() {
  getResource<Type>();
}

Compiled with GCC gives:

$ g++ -O2 -std=c++17 -c test.cc -o test.gcc.o
$ gcc-nm -C test.gcc.o | grep '::resource$' | grep --color '\<\w\>\|::resource$'
0000000000000000 u guard variable for getResource<Type>()::resource
0000000000000000 u getResource<Type>()::resource

Compiled with NVCC gives:

$ nvcc -O2 -std=c++17 -x cu -dc test.cc -o test.nvcc.o
$ gcc-nm -C test.nvcc.o | grep '::resource$' | grep --color '\<\w\>\|::resource$'
0000000000000000 b guard variable for getResource<Type>()::resource
0000000000000008 b getResource<Type>()::resource

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 30, 2023

Looking at the intermediate files produced by NVCC and compiled by GCC, the code from test.cc is basically unchanged, and the code coming after it has no impact on this issue.

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 30, 2023

Looks like this could be a workaround for the issue:

diff --git a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
index dfda1ee3d7e2..1a9a7d8fe070 100644
--- a/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
+++ b/HeterogeneousCore/AlpakaInterface/interface/CachingAllocator.h
@@ -83,9 +83,11 @@ namespace cms::alpakatools {
    */
 
   template <typename TDev,
-            typename TQueue,
-            typename = std::enable_if_t<alpaka::isDevice<TDev> and alpaka::isQueue<TQueue>>>
+            typename TQueue>
   class CachingAllocator {
+    static_assert(alpaka::isDevice<TDev>, "");
+    static_assert(alpaka::isQueue<TQueue>, "");
+
   public:
 #ifdef ALPAKA_ACC_GPU_CUDA_ENABLED
     friend class alpaka_cuda_async::AlpakaService;

@fwyzard
Copy link
Contributor Author

fwyzard commented Jul 31, 2023

I've submitted a bug report to NVIDIA: https://developer.nvidia.com/nvidia_bug/4216808 .
A reproducer with all the details can be found at https://github.com/fwyzard/nvidia_bug_4216808 .

@makortel
Copy link
Contributor

+heterogeneous

This issue was fixed by the PRs linking this issue listed above.

@makortel
Copy link
Contributor

@cmsbuild, please close

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants