-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable std::variant for libcu++ #1076
Conversation
4304f1e
to
3ac76d2
Compare
I had a look at the clang-cuda fails and it looks spicy 😬
|
@griwes, @miscco per our offline discussion, attaching CPU benchmark: #include <benchmark/benchmark.h>
#include <cuda/std/variant>
#include <random>
#include "variant.h"
using target = std::uint32_t;
template <template <class... Vs> class Variant>
using variant_t = Variant<std::int8_t, std::uint8_t, std::int16_t, std::uint16_t, std::int32_t, target, std::int64_t, std::uint64_t, float, double>;
void table_bench(benchmark::State &state) {
variant_t<cuda::std::variant> v;
v.emplace<target>(rand());
for (auto _ : state) {
cuda::std::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
}
}
BENCHMARK(table_bench);
void recursion_bench(benchmark::State &state) {
variant_t<nvexec::variant_t> v;
v.emplace<target>(rand());
for (auto _ : state) {
nvexec::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
}
}
BENCHMARK(recursion_bench);
BENCHMARK_MAIN(); Recursion is ~8 times faster on CPU: Table-based approach slows down as elements are added, but at a much slower rate than recursive one. On Threadripper PRO 5975WX, recursion is more expensive at 42 elements:
I think it makes sense to optimize for variants with less than 42 elements, but I'm open for discussion. In worst case, we could use recursive implementation for up to dozens of elements and then fallback to table-based approach. |
I forgot to turn on optimizations. With optimizations (g++-12), I can see no perf regression on variants with large number of elements. Updated benchmark to eliminate possible interference on branch predictor side: #include <benchmark/benchmark.h>
#include <cuda/std/variant>
#include <random>
#include <type_traits>
#include "variant.h"
using target = std::uint32_t;
template <template <class... Vs> class Variant>
using variant_t = Variant<
std::int8_t,
std::uint8_t,
std::int16_t,
std::uint16_t,
std::int32_t,
target,
std::int64_t,
std::uint64_t,
float,
double,
std::integral_constant<int, 1>,
std::integral_constant<int, 2>,
std::integral_constant<int, 3>,
std::integral_constant<int, 4>,
std::integral_constant<int, 5>,
std::integral_constant<int, 6>,
std::integral_constant<int, 7>,
std::integral_constant<int, 8>,
std::integral_constant<int, 9>,
std::integral_constant<int, 10>,
std::integral_constant<int, 11>,
std::integral_constant<int, 12>,
std::integral_constant<int, 13>,
std::integral_constant<int, 14>,
std::integral_constant<int, 15>,
std::integral_constant<int, 16>,
std::integral_constant<int, 17>,
std::integral_constant<int, 18>,
std::integral_constant<int, 19>,
std::integral_constant<int, 20>,
std::integral_constant<int, 21>,
std::integral_constant<int, 22>,
std::integral_constant<int, 23>,
std::integral_constant<int, 24>,
std::integral_constant<int, 25>,
std::integral_constant<int, 26>,
std::integral_constant<int, 27>,
std::integral_constant<int, 28>,
std::integral_constant<int, 29>,
std::integral_constant<int, 32>>;
template <template <class... Vs> class Variant>
std::vector<variant_t<Variant>> random() {
std::vector<variant_t<Variant>> v(10000);
for (int i = 0; i < v.size(); ++i) {
int index = rand() % 30;
switch (index) {
case 0: v[i].template emplace<std::int8_t>(rand()); break;
case 1: v[i].template emplace<std::uint8_t>(rand()); break;
case 2: v[i].template emplace<std::int16_t>(rand()); break;
case 3: v[i].template emplace<std::uint16_t>(rand()); break;
case 4: v[i].template emplace<std::int32_t>(rand()); break;
case 5: v[i].template emplace<target>(rand()); break;
case 6: v[i].template emplace<std::uint64_t>(rand()); break;
case 7: v[i].template emplace<std::int64_t>(rand()); break;
default: v[i].template emplace<float>(rand());
}
}
return v;
}
void table_bench(benchmark::State &state) {
auto vs = random<cuda::std::variant>();
for (auto _ : state) {
for (auto &v : vs) {
cuda::std::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
}
}
}
BENCHMARK(table_bench);
void recursion_bench(benchmark::State &state) {
auto vs = random<nvexec::variant_t>();
for (auto _ : state) {
for (auto &v : vs) {
nvexec::visit([](auto val) { benchmark::DoNotOptimize(val); }, v);
}
}
}
BENCHMARK(recursion_bench);
BENCHMARK_MAIN(); Results:
In this case, speedup of recursive version is ~3x for small and large (40 elements) variants. |
@Artem-B this one might be interesting for you. It seems clang-cuda has issues with the test visit_call_operator_forwarding.pass.cpp where it times out in CI and also takes about 20 minutes to compile the test locally on a beefy machine |
Can you post the complete clang command line and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no performance issues now.
f82bc52
to
078dcc2
Compare
@Artem-B sorry for the delay, I was working on replacing our visit implementation and it seems it made the issue worse :( Attached are two files with the lit runtime of clang and nvcc and the command line of one of the tests. You can replace the name of the test in that command line as all tests are compiled with the same flags. I observed ptxas eating up all my 256GB of RAM and more timings_with_clang.txt Generally if you want to try and test this locally:
This will run the whole test suite against clang which might be excessive. After the configure is run through you can go into the build folder and run a selection of tests like:
That will only build the tests. Depending on you configuration the inital folder name may vary |
Thank you for the instructions. They are very helpful. I'll take a look next week. Meanwhile, if you're curious, you can try running the slow clang instance with |
8bc0f9e
to
e095d2a
Compare
I gave it a try. For some reason the script produced suspiciously few tests and I do not see the Here's the script output: https://gist.github.com/Artem-B/62041fdcf3baa3c4e0ad83eaed38b030 |
Oh damnit, sorry for wasting your time. This has not been merged yet, so you would need to pull in the branch from miscco:variant Also to get CI green I have marked those tests as unsupported with clang-cuda. You can see this due to the following comment that you would need to remove:
|
Not a big deal. I'll give it another go tomorrow. |
My goal is to merge this in the immediate future, so maybe we can ignore the branch |
Just a FYI: on your branch, I see a lot of compilations failing due to using c++17 features, while compiling with
|
Do you set clang optimization options somewhere? W/o optimizations, Unlike |
thats interesting, I will discuss with @wmaxey what we can do |
Co-authored-by: Michał Dominiak <[email protected]>
Dispatching via an array of function pointers is bad for performance on device, as the array will spill into registers. So rather than the function pointers, utilize a recursive approache.
thanks a lot @Artem-B I added |
* Enable std::variant for libcu++ --------- Co-authored-by: Michał Dominiak <[email protected]> Modified krr to include `std::variant`.
This is work from @griwes that was stuck on the old libcudacxx repo.
Windows is soo much func, especially because I nuked all fixes I had in place 🤡