Enable std::variant for libcu++ #1076

miscco · 2023-11-09T12:37:17Z

This is work from @griwes that was stuck on the old libcudacxx repo.

Windows is soo much func, especially because I nuked all fixes I had in place 🤡

libcudacxx/include/cuda/std/detail/libcxx/include/variant

miscco · 2023-11-10T08:49:18Z

I had a look at the clang-cuda fails and it looks spicy 😬

Slowest Tests:
--------------------------------------------------------------------------
1219.56s: libcu++ :: std/utilities/variant/variant.visit_return/visit_call_operator_forwarding.pass.cpp
504.29s: libcu++ :: std/utilities/variant/variant.visit_return/visit_argument_forwarding.pass.cpp
463.56s: libcu++ :: std/utilities/variant/variant.visit/visit_call_operator_forwarding.pass.cpp
248.87s: libcu++ :: std/utilities/variant/variant.visit/visit_argument_forwarding.pass.cpp
15.63s: libcu++ :: std/utilities/variant/variant.visit/visit_return_type.pass.cpp
7.91s: libcu++ :: std/utilities/variant/variant.visit_return/visit_constexpr.pass.cpp
5.86s: libcu++ :: std/utilities/variant/variant.visit/visit_constexpr.pass.cpp
5.15s: libcu++ :: std/utilities/variant/variant.variant/variant.swap/swap.pass.cpp
4.50s: libcu++ :: std/utilities/variant/variant.visit/visit_derived.pass.cpp
4.48s: libcu++ :: std/utilities/variant/variant.variant/variant.assign/move.pass.cpp
4.45s: libcu++ :: std/utilities/variant/variant.variant/variant.mod/emplace_index_args.pass.cpp
4.44s: libcu++ :: std/utilities/variant/variant.visit_return/visit_return_type.pass.cpp
4.42s: libcu++ :: std/utilities/variant/variant.variant/variant.ctor/move.pass.cpp
4.41s: libcu++ :: std/utilities/variant/variant.visit/visit_sfinae.pass.cpp
4.39s: libcu++ :: std/utilities/variant/variant.variant/variant.ctor/T.pass.cpp
4.38s: libcu++ :: std/utilities/variant/variant.visit_return/visit_derived.pass.cpp
4.36s: libcu++ :: std/utilities/variant/variant.variant/variant.status/valueless_by_exception.pass.cpp
4.33s: libcu++ :: std/utilities/variant/variant.variant/variant.ctor/default.pass.cpp
4.31s: libcu++ :: std/utilities/variant/variant.variant/variant.assign/T.pass.cpp
4.30s: libcu++ :: std/utilities/variant/variant.helpers/variant_alternative.pass.cpp

Tests Times:
--------------------------------------------------------------------------
[    Range    ] :: [               Percentage               ] :: [Count]
--------------------------------------------------------------------------
[1200s,1300s) :: [                                        ] :: [ 1/60]
[1100s,1200s) :: [                                        ] :: [ 0/60]
[1000s,1100s) :: [                                        ] :: [ 0/60]
[ 900s,1000s) :: [                                        ] :: [ 0/60]
[ 800s, 900s) :: [                                        ] :: [ 0/60]
[ 700s, 800s) :: [                                        ] :: [ 0/60]
[ 600s, 700s) :: [                                        ] :: [ 0/60]
[ 500s, 600s) :: [                                        ] :: [ 1/60]
[ 400s, 500s) :: [                                        ] :: [ 1/60]
[ 300s, 400s) :: [                                        ] :: [ 0/60]
[ 200s, 300s) :: [                                        ] :: [ 1/60]
[ 100s, 200s) :: [                                        ] :: [ 0/60]
[   0s, 100s) :: [*************************************   ] :: [56/60]
--------------------------------------------------------------------------

gevtushenko · 2023-11-15T19:02:29Z

@griwes, @miscco per our offline discussion, attaching CPU benchmark:

#include <benchmark/benchmark.h>
#include <cuda/std/variant>
#include <random>
#include "variant.h"

using target = std::uint32_t;

template <template <class... Vs> class Variant>
using variant_t = Variant<std::int8_t, std::uint8_t, std::int16_t, std::uint16_t, std::int32_t, target, std::int64_t, std::uint64_t, float, double>;

void table_bench(benchmark::State &state) {
  variant_t<cuda::std::variant> v;
  v.emplace<target>(rand());
  for (auto _ : state) {
    cuda::std::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
  }
}
BENCHMARK(table_bench);

void recursion_bench(benchmark::State &state) {
  variant_t<nvexec::variant_t> v;
  v.emplace<target>(rand());
  for (auto _ : state) {
    nvexec::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
  }
}
BENCHMARK(recursion_bench);

BENCHMARK_MAIN();

Recursion is ~8 times faster on CPU:

Table-based approach slows down as elements are added, but at a much slower rate than recursive one. On Threadripper PRO 5975WX, recursion is more expensive at 42 elements:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
table_bench           77.4 ns         77.4 ns      9266148
recursion_bench       96.5 ns         96.5 ns      7076926

I think it makes sense to optimize for variants with less than 42 elements, but I'm open for discussion. In worst case, we could use recursive implementation for up to dozens of elements and then fallback to table-based approach.

gevtushenko · 2023-11-15T19:38:10Z

I forgot to turn on optimizations. With optimizations (g++-12), I can see no perf regression on variants with large number of elements. Updated benchmark to eliminate possible interference on branch predictor side:

#include <benchmark/benchmark.h>
#include <cuda/std/variant>
#include <random>
#include <type_traits>
#include "variant.h"

using target = std::uint32_t;

template <template <class... Vs> class Variant>
using variant_t = Variant<
    std::int8_t, 
    std::uint8_t, 
    std::int16_t, 
    std::uint16_t,
    std::int32_t, 
    target, 
    std::int64_t, 
    std::uint64_t, 
    float, 
    double,
    std::integral_constant<int, 1>,
    std::integral_constant<int, 2>,
    std::integral_constant<int, 3>,
    std::integral_constant<int, 4>,
    std::integral_constant<int, 5>,
    std::integral_constant<int, 6>,
    std::integral_constant<int, 7>,
    std::integral_constant<int, 8>,
    std::integral_constant<int, 9>,
    std::integral_constant<int, 10>,
    std::integral_constant<int, 11>,
    std::integral_constant<int, 12>,
    std::integral_constant<int, 13>,
    std::integral_constant<int, 14>,
    std::integral_constant<int, 15>,
    std::integral_constant<int, 16>,
    std::integral_constant<int, 17>,
    std::integral_constant<int, 18>,
    std::integral_constant<int, 19>,
    std::integral_constant<int, 20>,
    std::integral_constant<int, 21>,
    std::integral_constant<int, 22>,
    std::integral_constant<int, 23>,
    std::integral_constant<int, 24>,
    std::integral_constant<int, 25>,
    std::integral_constant<int, 26>,
    std::integral_constant<int, 27>,
    std::integral_constant<int, 28>,
    std::integral_constant<int, 29>,
    std::integral_constant<int, 32>>;

template <template <class... Vs> class Variant>
std::vector<variant_t<Variant>> random() {
  std::vector<variant_t<Variant>> v(10000);
  for (int i = 0; i < v.size(); ++i) {
    int index = rand() % 30;
    switch (index) {
      case 0: v[i].template emplace<std::int8_t>(rand()); break;
      case 1: v[i].template emplace<std::uint8_t>(rand()); break;
      case 2: v[i].template emplace<std::int16_t>(rand()); break;
      case 3: v[i].template emplace<std::uint16_t>(rand()); break;
      case 4: v[i].template emplace<std::int32_t>(rand()); break;
      case 5: v[i].template emplace<target>(rand()); break;
      case 6: v[i].template emplace<std::uint64_t>(rand()); break;
      case 7: v[i].template emplace<std::int64_t>(rand()); break;
      default: v[i].template emplace<float>(rand());
    }
  }
  return v;
}

void table_bench(benchmark::State &state) {
  auto vs = random<cuda::std::variant>();
  for (auto _ : state) {
    for (auto &v : vs) {
      cuda::std::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
    }
  }
}
BENCHMARK(table_bench);

void recursion_bench(benchmark::State &state) {
  auto vs = random<nvexec::variant_t>();
  for (auto _ : state) {
    for (auto &v : vs) {
      nvexec::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
    }
  }
}
BENCHMARK(recursion_bench);

BENCHMARK_MAIN();

Results:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
table_bench         147860 ns       147845 ns         4460
recursion_bench      45125 ns        45122 ns        15274

In this case, speedup of recursive version is ~3x for small and large (40 elements) variants.

miscco · 2023-11-16T10:06:08Z

@Artem-B this one might be interesting for you. It seems clang-cuda has issues with the test visit_call_operator_forwarding.pass.cpp where it times out in CI and also takes about 20 minutes to compile the test locally on a beefy machine

Artem-B · 2023-11-16T17:55:33Z

@Artem-B this one might be interesting for you. It seems clang-cuda has issues with the test visit_call_operator_forwarding.pass.cpp where it times out in CI and also takes about 20 minutes to compile the test locally on a beefy machine

Can you post the complete clang command line and clang --version output?
If you have a corresponding NVCC command line, that would be useful, too, as a baseline for the code the test is expected to produce..

gevtushenko

There are no performance issues now.

miscco · 2023-11-24T13:24:32Z

@Artem-B sorry for the delay, I was working on replacing our visit implementation and it seems it made the issue worse :(

Attached are two files with the lit runtime of clang and nvcc and the command line of one of the tests. You can replace the name of the test in that command line as all tests are compiled with the same flags. I observed ptxas eating up all my 256GB of RAM and more

timings_with_clang.txt
timings_with_nvcc.txt

Generally if you want to try and test this locally:

git clone https://github.com/NVIDIA/cccl.git
cd cccl
./ci/test_libcudacxx.sh -cxx PATH_TO_CLANG -cuda PATH_TO_CLANG -std 14 -arch "86"

This will run the whole test suite against clang which might be excessive. After the configure is run through you can go into the build folder and run a selection of tests like:

cd build
lit libcudacxx-cpp14/libcudacxx/test/libcudacxx/std/utilities/variant -sv -Dexecutor="NoopExecutor()"

That will only build the tests. Depending on you configuration the inital folder name may vary

Artem-B · 2023-11-24T17:26:42Z

Thank you for the instructions. They are very helpful. I'll take a look next week.

Meanwhile, if you're curious, you can try running the slow clang instance with -ftime-trace (see https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang/) and see if a particular header/class/function is the source of the problem. There's also https://github.com/aras-p/ClangBuildAnalyzer which can analyze the traces for multiple compilations.

libcudacxx/include/cuda/std/detail/libcxx/include/variant

Artem-B · 2023-12-08T01:19:05Z

Generally if you want to try and test this locally:

git clone https://github.com/NVIDIA/cccl.git
cd cccl
./ci/test_libcudacxx.sh -cxx PATH_TO_CLANG -cuda PATH_TO_CLANG -std 14 -arch "86"

I gave it a try. For some reason the script produced suspiciously few tests and I do not see the variant tests anywhere under build/libcudacxx-cpp14/

Here's the script output: https://gist.github.com/Artem-B/62041fdcf3baa3c4e0ad83eaed38b030

miscco · 2023-12-08T07:49:10Z

I gave it a try. For some reason the script produced suspiciously few tests and I do not see the variant tests anywhere under build/libcudacxx-cpp14/

Oh damnit, sorry for wasting your time. This has not been merged yet, so you would need to pull in the branch from miscco:variant

Also to get CI green I have marked those tests as unsupported with clang-cuda. You can see this due to the following comment that you would need to remove:

// clang-cuda takes too much time to compile those tests
// UNSUPPORTED: clang && (!nvcc)

Artem-B · 2023-12-08T08:59:12Z

Not a big deal. I'll give it another go tomorrow.

miscco · 2023-12-08T09:32:30Z

My goal is to merge this in the immediate future, so maybe we can ignore the branch

Artem-B · 2023-12-08T19:52:32Z

Just a FYI: on your branch, I see a lot of compilations failing due to using c++17 features, while compiling with -std=c++14:

5: In file included from /usr/local/google/home/tra/work/cccl/libcudacxx/test/libcudacxx/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_and_explicit.pass.cpp:27:
5: In file included from /usr/local/google/home/tra/work/cccl/libcudacxx/test/libcudacxx/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_helpers.h:17:
5: /usr/local/google/home/tra/work/cccl/libcudacxx/test/support/cuda_space_selector.h:130:43: error: template template parameter using 'typename' is a C++17 extension [-Werror,-Wc++17-extensions]
5:   130 |     template<typename, cuda::std::size_t> typename Provider,
5:       |                                           ^~~~~~~~
5:       |                                           class
5: /usr/local/google/home/tra/work/cccl/libcudacxx/test/libcudacxx/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_and_explicit.pass.cpp:30:49: error: template template parameter using 'typename' is a C++17 extension [-Werror,-Wc++17-extensions]
5:    30 | template <class T, template<typename, typename> typename Selector, cuda::thread_scope>
5:       |                                                 ^~~~~~~~
5:       |                                                 class
5: 2 errors generated when compiling for sm_70.
5: --
5:
5: Compilation failed unexpectedly!
5: ********************

Artem-B · 2023-12-08T20:36:27Z

Do you set clang optimization options somewhere?
AFAICT, currently lit runs the tests with clang w/o passing it any -O options.

W/o optimizations, variant.visit/visit_call_operator_forwarding.pass.cpp ends up producing 180MB large PTX file which ptxas struggles with. If I compile with -O2, produced PTX is only about 8MB and it gets compiled in reasonable time.

Unlike clang, nvcc compiles with optimizations enabled by default. For clang you need to enable them explicitly.

miscco · 2023-12-08T21:15:21Z

thats interesting, I will discuss with @wmaxey what we can do

libcudacxx/include/cuda/std/detail/libcxx/include/variant

Co-authored-by: Michał Dominiak <[email protected]>

…tion

…t_alt` call

Dispatching via an array of function pointers is bad for performance on device, as the array will spill into registers. So rather than the function pointers, utilize a recursive approache.

miscco · 2023-12-12T11:10:07Z

thanks a lot @Artem-B I added -O1 to the clang-cuda compile flags and it is running smoothly now

* Enable std::variant for libcu++ --------- Co-authored-by: Michał Dominiak <[email protected]> Modified krr to include `std::variant`.

miscco requested review from a team as code owners November 9, 2023 12:37

miscco requested review from alliepiper, elstehle and wmaxey and removed request for a team November 9, 2023 12:37

miscco force-pushed the variant branch 5 times, most recently from 4304f1e to 3ac76d2 Compare November 9, 2023 20:08

gevtushenko requested changes Nov 9, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/variant Outdated Show resolved Hide resolved

miscco force-pushed the variant branch from 3ac76d2 to 8d5d252 Compare November 15, 2023 17:53

miscco force-pushed the variant branch from 8d5d252 to b466f80 Compare November 17, 2023 14:06

gevtushenko approved these changes Nov 17, 2023

View reviewed changes

miscco force-pushed the variant branch 3 times, most recently from f82bc52 to 078dcc2 Compare November 23, 2023 13:38

miscco requested a review from griwes November 27, 2023 15:48

miscco commented Nov 27, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/variant Show resolved Hide resolved

miscco force-pushed the variant branch 2 times, most recently from 8bc0f9e to e095d2a Compare November 28, 2023 15:50

griwes reviewed Dec 11, 2023

View reviewed changes

libcudacxx/include/cuda/std/detail/libcxx/include/variant Outdated Show resolved Hide resolved

libcudacxx/include/cuda/std/detail/libcxx/include/variant Show resolved Hide resolved

libcudacxx/include/cuda/std/detail/libcxx/include/variant Show resolved Hide resolved

miscco linked an issue Dec 11, 2023 that may be closed by this pull request

Add cuda::std::variant, cuda::std::visit, and cuda::std::monostate #981

Closed

miscco and others added 8 commits December 11, 2023 09:06

Enable std::variant for libcu++

be12908

Co-authored-by: Michał Dominiak <[email protected]>

Replace __visit_{value, alt}_at with a register friendly implementa…

91c73f0

…tion

Rework some of the base class definition to remove a needless `__visi…

aa5077f

…t_alt` call

Rework the visitation implementation

e5835e1

Dispatching via an array of function pointers is bad for performance on device, as the array will spill into registers. So rather than the function pointers, utilize a recursive approache.

Add some annotations in hope it compiles faster

93fa5d0

Mark tests that break clang-cuda as unsupported

cccda66

Use abort rather than unreachable for exception emulation

737b296

Fix synopsis formating

466ca8b

miscco force-pushed the variant branch from e095d2a to 466ca8b Compare December 11, 2023 08:06

griwes approved these changes Dec 11, 2023

View reviewed changes

miscco added 2 commits December 11, 2023 11:24

Add comment about recursive implementation and return type checks

d82a693

Enable optimizations for clang-cuda

240b06e

miscco merged commit 988800d into NVIDIA:main Dec 12, 2023

miscco deleted the variant branch December 12, 2023 11:10

mkuron mentioned this pull request Dec 19, 2023

CUDA: unions in kernel arguments not copied completely if member contains padding llvm/llvm-project#53710

Closed

kririae pushed a commit to kririae/cccl that referenced this pull request Dec 20, 2023

Enable std::variant for libcu++ (NVIDIA#1076)

867533f

* Enable std::variant for libcu++ --------- Co-authored-by: Michał Dominiak <[email protected]> Modified krr to include `std::variant`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable std::variant for libcu++ #1076

Enable std::variant for libcu++ #1076

miscco commented Nov 9, 2023

miscco commented Nov 10, 2023

gevtushenko commented Nov 15, 2023

gevtushenko commented Nov 15, 2023

miscco commented Nov 16, 2023

Artem-B commented Nov 16, 2023

gevtushenko left a comment

miscco commented Nov 24, 2023

Artem-B commented Nov 24, 2023

Artem-B commented Dec 8, 2023

miscco commented Dec 8, 2023

Artem-B commented Dec 8, 2023

miscco commented Dec 8, 2023

Artem-B commented Dec 8, 2023

Artem-B commented Dec 8, 2023

miscco commented Dec 8, 2023

miscco commented Dec 12, 2023

Enable std::variant for libcu++ #1076

Enable std::variant for libcu++ #1076

Conversation

miscco commented Nov 9, 2023

miscco commented Nov 10, 2023

gevtushenko commented Nov 15, 2023

gevtushenko commented Nov 15, 2023

miscco commented Nov 16, 2023

Artem-B commented Nov 16, 2023

gevtushenko left a comment

Choose a reason for hiding this comment

miscco commented Nov 24, 2023

Artem-B commented Nov 24, 2023

Artem-B commented Dec 8, 2023

miscco commented Dec 8, 2023

Artem-B commented Dec 8, 2023

miscco commented Dec 8, 2023

Artem-B commented Dec 8, 2023

Artem-B commented Dec 8, 2023

miscco commented Dec 8, 2023

miscco commented Dec 12, 2023