Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable std::variant for libcu++ #1076

Merged
merged 10 commits into from
Dec 12, 2023
Merged

Enable std::variant for libcu++ #1076

merged 10 commits into from
Dec 12, 2023

Conversation

miscco
Copy link
Collaborator

@miscco miscco commented Nov 9, 2023

This is work from @griwes that was stuck on the old libcudacxx repo.

Windows is soo much func, especially because I nuked all fixes I had in place 🤡

@miscco miscco requested review from a team as code owners November 9, 2023 12:37
@miscco miscco requested review from alliepiper, elstehle and wmaxey and removed request for a team November 9, 2023 12:37
@miscco miscco force-pushed the variant branch 5 times, most recently from 4304f1e to 3ac76d2 Compare November 9, 2023 20:08
@miscco
Copy link
Collaborator Author

miscco commented Nov 10, 2023

I had a look at the clang-cuda fails and it looks spicy 😬

Slowest Tests:
--------------------------------------------------------------------------
1219.56s: libcu++ :: std/utilities/variant/variant.visit_return/visit_call_operator_forwarding.pass.cpp
504.29s: libcu++ :: std/utilities/variant/variant.visit_return/visit_argument_forwarding.pass.cpp
463.56s: libcu++ :: std/utilities/variant/variant.visit/visit_call_operator_forwarding.pass.cpp
248.87s: libcu++ :: std/utilities/variant/variant.visit/visit_argument_forwarding.pass.cpp
15.63s: libcu++ :: std/utilities/variant/variant.visit/visit_return_type.pass.cpp
7.91s: libcu++ :: std/utilities/variant/variant.visit_return/visit_constexpr.pass.cpp
5.86s: libcu++ :: std/utilities/variant/variant.visit/visit_constexpr.pass.cpp
5.15s: libcu++ :: std/utilities/variant/variant.variant/variant.swap/swap.pass.cpp
4.50s: libcu++ :: std/utilities/variant/variant.visit/visit_derived.pass.cpp
4.48s: libcu++ :: std/utilities/variant/variant.variant/variant.assign/move.pass.cpp
4.45s: libcu++ :: std/utilities/variant/variant.variant/variant.mod/emplace_index_args.pass.cpp
4.44s: libcu++ :: std/utilities/variant/variant.visit_return/visit_return_type.pass.cpp
4.42s: libcu++ :: std/utilities/variant/variant.variant/variant.ctor/move.pass.cpp
4.41s: libcu++ :: std/utilities/variant/variant.visit/visit_sfinae.pass.cpp
4.39s: libcu++ :: std/utilities/variant/variant.variant/variant.ctor/T.pass.cpp
4.38s: libcu++ :: std/utilities/variant/variant.visit_return/visit_derived.pass.cpp
4.36s: libcu++ :: std/utilities/variant/variant.variant/variant.status/valueless_by_exception.pass.cpp
4.33s: libcu++ :: std/utilities/variant/variant.variant/variant.ctor/default.pass.cpp
4.31s: libcu++ :: std/utilities/variant/variant.variant/variant.assign/T.pass.cpp
4.30s: libcu++ :: std/utilities/variant/variant.helpers/variant_alternative.pass.cpp

Tests Times:
--------------------------------------------------------------------------
[    Range    ] :: [               Percentage               ] :: [Count]
--------------------------------------------------------------------------
[1200s,1300s) :: [                                        ] :: [ 1/60]
[1100s,1200s) :: [                                        ] :: [ 0/60]
[1000s,1100s) :: [                                        ] :: [ 0/60]
[ 900s,1000s) :: [                                        ] :: [ 0/60]
[ 800s, 900s) :: [                                        ] :: [ 0/60]
[ 700s, 800s) :: [                                        ] :: [ 0/60]
[ 600s, 700s) :: [                                        ] :: [ 0/60]
[ 500s, 600s) :: [                                        ] :: [ 1/60]
[ 400s, 500s) :: [                                        ] :: [ 1/60]
[ 300s, 400s) :: [                                        ] :: [ 0/60]
[ 200s, 300s) :: [                                        ] :: [ 1/60]
[ 100s, 200s) :: [                                        ] :: [ 0/60]
[   0s, 100s) :: [*************************************   ] :: [56/60]
--------------------------------------------------------------------------

@gevtushenko
Copy link
Collaborator

@griwes, @miscco per our offline discussion, attaching CPU benchmark:

#include <benchmark/benchmark.h>
#include <cuda/std/variant>
#include <random>
#include "variant.h"

using target = std::uint32_t;

template <template <class... Vs> class Variant>
using variant_t = Variant<std::int8_t, std::uint8_t, std::int16_t, std::uint16_t, std::int32_t, target, std::int64_t, std::uint64_t, float, double>;

void table_bench(benchmark::State &state) {
  variant_t<cuda::std::variant> v;
  v.emplace<target>(rand());
  for (auto _ : state) {
    cuda::std::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
  }
}
BENCHMARK(table_bench);

void recursion_bench(benchmark::State &state) {
  variant_t<nvexec::variant_t> v;
  v.emplace<target>(rand());
  for (auto _ : state) {
    nvexec::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
  }
}
BENCHMARK(recursion_bench);

BENCHMARK_MAIN();

Recursion is ~8 times faster on CPU:
image

Table-based approach slows down as elements are added, but at a much slower rate than recursive one. On Threadripper PRO 5975WX, recursion is more expensive at 42 elements:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
table_bench           77.4 ns         77.4 ns      9266148
recursion_bench       96.5 ns         96.5 ns      7076926

I think it makes sense to optimize for variants with less than 42 elements, but I'm open for discussion. In worst case, we could use recursive implementation for up to dozens of elements and then fallback to table-based approach.

@gevtushenko
Copy link
Collaborator

I forgot to turn on optimizations. With optimizations (g++-12), I can see no perf regression on variants with large number of elements. Updated benchmark to eliminate possible interference on branch predictor side:

#include <benchmark/benchmark.h>
#include <cuda/std/variant>
#include <random>
#include <type_traits>
#include "variant.h"

using target = std::uint32_t;

template <template <class... Vs> class Variant>
using variant_t = Variant<
    std::int8_t, 
    std::uint8_t, 
    std::int16_t, 
    std::uint16_t,
    std::int32_t, 
    target, 
    std::int64_t, 
    std::uint64_t, 
    float, 
    double,
    std::integral_constant<int, 1>,
    std::integral_constant<int, 2>,
    std::integral_constant<int, 3>,
    std::integral_constant<int, 4>,
    std::integral_constant<int, 5>,
    std::integral_constant<int, 6>,
    std::integral_constant<int, 7>,
    std::integral_constant<int, 8>,
    std::integral_constant<int, 9>,
    std::integral_constant<int, 10>,
    std::integral_constant<int, 11>,
    std::integral_constant<int, 12>,
    std::integral_constant<int, 13>,
    std::integral_constant<int, 14>,
    std::integral_constant<int, 15>,
    std::integral_constant<int, 16>,
    std::integral_constant<int, 17>,
    std::integral_constant<int, 18>,
    std::integral_constant<int, 19>,
    std::integral_constant<int, 20>,
    std::integral_constant<int, 21>,
    std::integral_constant<int, 22>,
    std::integral_constant<int, 23>,
    std::integral_constant<int, 24>,
    std::integral_constant<int, 25>,
    std::integral_constant<int, 26>,
    std::integral_constant<int, 27>,
    std::integral_constant<int, 28>,
    std::integral_constant<int, 29>,
    std::integral_constant<int, 32>>;

template <template <class... Vs> class Variant>
std::vector<variant_t<Variant>> random() {
  std::vector<variant_t<Variant>> v(10000);
  for (int i = 0; i < v.size(); ++i) {
    int index = rand() % 30;
    switch (index) {
      case 0: v[i].template emplace<std::int8_t>(rand()); break;
      case 1: v[i].template emplace<std::uint8_t>(rand()); break;
      case 2: v[i].template emplace<std::int16_t>(rand()); break;
      case 3: v[i].template emplace<std::uint16_t>(rand()); break;
      case 4: v[i].template emplace<std::int32_t>(rand()); break;
      case 5: v[i].template emplace<target>(rand()); break;
      case 6: v[i].template emplace<std::uint64_t>(rand()); break;
      case 7: v[i].template emplace<std::int64_t>(rand()); break;
      default: v[i].template emplace<float>(rand());
    }
  }
  return v;
}

void table_bench(benchmark::State &state) {
  auto vs = random<cuda::std::variant>();
  for (auto _ : state) {
    for (auto &v : vs) {
      cuda::std::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
    }
  }
}
BENCHMARK(table_bench);

void recursion_bench(benchmark::State &state) {
  auto vs = random<nvexec::variant_t>();
  for (auto _ : state) {
    for (auto &v : vs) {
      nvexec::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
    }
  }
}
BENCHMARK(recursion_bench);

BENCHMARK_MAIN();

Results:

----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
table_bench         147860 ns       147845 ns         4460
recursion_bench      45125 ns        45122 ns        15274

In this case, speedup of recursive version is ~3x for small and large (40 elements) variants.

@miscco
Copy link
Collaborator Author

miscco commented Nov 16, 2023

@Artem-B this one might be interesting for you. It seems clang-cuda has issues with the test visit_call_operator_forwarding.pass.cpp where it times out in CI and also takes about 20 minutes to compile the test locally on a beefy machine

@Artem-B
Copy link
Contributor

Artem-B commented Nov 16, 2023

@Artem-B this one might be interesting for you. It seems clang-cuda has issues with the test visit_call_operator_forwarding.pass.cpp where it times out in CI and also takes about 20 minutes to compile the test locally on a beefy machine

Can you post the complete clang command line and clang --version output?
If you have a corresponding NVCC command line, that would be useful, too, as a baseline for the code the test is expected to produce..

Copy link
Collaborator

@gevtushenko gevtushenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no performance issues now.

@miscco miscco force-pushed the variant branch 3 times, most recently from f82bc52 to 078dcc2 Compare November 23, 2023 13:38
@miscco
Copy link
Collaborator Author

miscco commented Nov 24, 2023

@Artem-B sorry for the delay, I was working on replacing our visit implementation and it seems it made the issue worse :(

Attached are two files with the lit runtime of clang and nvcc and the command line of one of the tests. You can replace the name of the test in that command line as all tests are compiled with the same flags. I observed ptxas eating up all my 256GB of RAM and more

timings_with_clang.txt
timings_with_nvcc.txt

Generally if you want to try and test this locally:

git clone https://github.com/NVIDIA/cccl.git
cd cccl
./ci/test_libcudacxx.sh -cxx PATH_TO_CLANG -cuda PATH_TO_CLANG -std 14 -arch "86"

This will run the whole test suite against clang which might be excessive. After the configure is run through you can go into the build folder and run a selection of tests like:

cd build
lit libcudacxx-cpp14/libcudacxx/test/libcudacxx/std/utilities/variant -sv -Dexecutor="NoopExecutor()"

That will only build the tests. Depending on you configuration the inital folder name may vary

@Artem-B
Copy link
Contributor

Artem-B commented Nov 24, 2023

Thank you for the instructions. They are very helpful. I'll take a look next week.

Meanwhile, if you're curious, you can try running the slow clang instance with -ftime-trace (see https://aras-p.info/blog/2019/01/16/time-trace-timeline-flame-chart-profiler-for-Clang/) and see if a particular header/class/function is the source of the problem. There's also https://github.com/aras-p/ClangBuildAnalyzer which can analyze the traces for multiple compilations.

@miscco miscco requested a review from griwes November 27, 2023 15:48
@miscco miscco force-pushed the variant branch 2 times, most recently from 8bc0f9e to e095d2a Compare November 28, 2023 15:50
@Artem-B
Copy link
Contributor

Artem-B commented Dec 8, 2023

Generally if you want to try and test this locally:

git clone https://github.com/NVIDIA/cccl.git
cd cccl
./ci/test_libcudacxx.sh -cxx PATH_TO_CLANG -cuda PATH_TO_CLANG -std 14 -arch "86"

I gave it a try. For some reason the script produced suspiciously few tests and I do not see the variant tests anywhere under build/libcudacxx-cpp14/

Here's the script output: https://gist.github.com/Artem-B/62041fdcf3baa3c4e0ad83eaed38b030

@miscco
Copy link
Collaborator Author

miscco commented Dec 8, 2023

I gave it a try. For some reason the script produced suspiciously few tests and I do not see the variant tests anywhere under build/libcudacxx-cpp14/

Oh damnit, sorry for wasting your time. This has not been merged yet, so you would need to pull in the branch from miscco:variant

Also to get CI green I have marked those tests as unsupported with clang-cuda. You can see this due to the following comment that you would need to remove:

// clang-cuda takes too much time to compile those tests
// UNSUPPORTED: clang && (!nvcc)

@Artem-B
Copy link
Contributor

Artem-B commented Dec 8, 2023

Not a big deal. I'll give it another go tomorrow.

@miscco
Copy link
Collaborator Author

miscco commented Dec 8, 2023

My goal is to merge this in the immediate future, so maybe we can ignore the branch

@Artem-B
Copy link
Contributor

Artem-B commented Dec 8, 2023

Just a FYI: on your branch, I see a lot of compilations failing due to using c++17 features, while compiling with -std=c++14:

5: In file included from /usr/local/google/home/tra/work/cccl/libcudacxx/test/libcudacxx/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_and_explicit.pass.cpp:27:
5: In file included from /usr/local/google/home/tra/work/cccl/libcudacxx/test/libcudacxx/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_helpers.h:17:
5: /usr/local/google/home/tra/work/cccl/libcudacxx/test/support/cuda_space_selector.h:130:43: error: template template parameter using 'typename' is a C++17 extension [-Werror,-Wc++17-extensions]
5:   130 |     template<typename, cuda::std::size_t> typename Provider,
5:       |                                           ^~~~~~~~
5:       |                                           class
5: /usr/local/google/home/tra/work/cccl/libcudacxx/test/libcudacxx/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_and_explicit.pass.cpp:30:49: error: template template parameter using 'typename' is a C++17 extension [-Werror,-Wc++17-extensions]
5:    30 | template <class T, template<typename, typename> typename Selector, cuda::thread_scope>
5:       |                                                 ^~~~~~~~
5:       |                                                 class
5: 2 errors generated when compiling for sm_70.
5: --
5:
5: Compilation failed unexpectedly!
5: ********************

@Artem-B
Copy link
Contributor

Artem-B commented Dec 8, 2023

Do you set clang optimization options somewhere?
AFAICT, currently lit runs the tests with clang w/o passing it any -O options.

W/o optimizations, variant.visit/visit_call_operator_forwarding.pass.cpp ends up producing 180MB large PTX file which ptxas struggles with. If I compile with -O2, produced PTX is only about 8MB and it gets compiled in reasonable time.

Unlike clang, nvcc compiles with optimizations enabled by default. For clang you need to enable them explicitly.

@miscco
Copy link
Collaborator Author

miscco commented Dec 8, 2023

thats interesting, I will discuss with @wmaxey what we can do

@miscco miscco linked an issue Dec 11, 2023 that may be closed by this pull request
@miscco
Copy link
Collaborator Author

miscco commented Dec 12, 2023

thanks a lot @Artem-B I added -O1 to the clang-cuda compile flags and it is running smoothly now

@miscco miscco merged commit 988800d into NVIDIA:main Dec 12, 2023
@miscco miscco deleted the variant branch December 12, 2023 11:10
kririae pushed a commit to kririae/cccl that referenced this pull request Dec 20, 2023
* Enable std::variant for libcu++
---------

Co-authored-by: Michał Dominiak <[email protected]>

Modified krr to include `std::variant`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Add cuda::std::variant, cuda::std::visit, and cuda::std::monostate
4 participants