Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KOKKOS_FUNCTION-annotated for_each and transform_reduce #708

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

blegouix
Copy link
Collaborator

@blegouix blegouix commented Dec 15, 2024

Closes #172

@blegouix blegouix marked this pull request as draft December 15, 2024 14:13
tests/for_each.cpp Outdated Show resolved Hide resolved
tests/transform_reduce.cpp Outdated Show resolved Hide resolved
tests/transform_reduce.cpp Outdated Show resolved Hide resolved
include/ddc/for_each.hpp Outdated Show resolved Hide resolved
include/ddc/for_each.hpp Outdated Show resolved Hide resolved
include/ddc/transform_reduce.hpp Outdated Show resolved Hide resolved
include/ddc/transform_reduce.hpp Outdated Show resolved Hide resolved
tests/for_each.cpp Outdated Show resolved Hide resolved
tests/transform_reduce.cpp Outdated Show resolved Hide resolved
tests/transform_reduce.cpp Outdated Show resolved Hide resolved
@blegouix blegouix marked this pull request as ready for review December 16, 2024 13:50
@blegouix
Copy link
Collaborator Author

blegouix commented Dec 16, 2024

@tpadioleau can you take a look at the two issues I mention in the first message ? The second one just seems to be an annoying bug with the support of [[maybe_unused]] in hipcc but the first one is more worrying, it seems the loop https://github.com/CExA-project/ddc/pull/708/files#diff-46765d278566dca059db5ef8ac1fc5e486964e46a1f7303302c7b7d80b9b4ab0R47 does not iterate correctly with nvcc

@tpadioleau
Copy link
Member

I don't know if i will have time to have a closer look this week. You could look for the differences with the implementation in the first PR, i think the tests were passing

@blegouix
Copy link
Collaborator Author

I think I have the same implementation as the closed branch. But you don't have 2D test for nested for_each, and I think the bug with nvcc appears only for 2D+ cases.

@tpadioleau
Copy link
Member

and I think the bug with nvcc appears only for 2D+ cases.

Did you test ?

@blegouix
Copy link
Collaborator Author

I just tested it and you were right to ask because the 1D case does not work to. I tried you branch and I don't get the bug. This looks very weird to me, either I am missing something obvious, either something broke somewhere since March. If you have any idea I'd take it, otherwise I will take my time for this because it does not look easy to debug.

@blegouix blegouix marked this pull request as draft December 17, 2024 10:54
tests/for_each.cpp Outdated Show resolved Hide resolved
tests/for_each.cpp Outdated Show resolved Hide resolved
if constexpr (I == N) {
f(RetType(is...));
} else {
for (Element ii = static_cast<int>(begin[I]); ii < end[I]; ++ii) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static_cast<int> is a nonsense workaround since Element aliases std::size_t. Without it the loop does not iterate with nvcc in the new tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it print if you remove the cast ?

Copy link
Collaborator Author

@blegouix blegouix Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

begin[I] is 0 as expected 🤷 but still it does not iterate.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if I print something inside the loop it appears only once)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it can print something inside the loop it means it can iterate once ?

What is the value of end[I] ?

Have you tried to run in debug ?

Copy link
Collaborator Author

@blegouix blegouix Dec 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry what do you want me to print inside the loop ? In my tests i was just printing a string like "test", not "%i" involved

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could print ii ?

Copy link
Collaborator Author

@blegouix blegouix Dec 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kokkos::printf("%llu\n", ii); prints 0
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do reproduce with g++-12, but not with -G, so it seems to be an optimization issue indeed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything with compute-sanitizer (tried every --tool)

@blegouix blegouix marked this pull request as ready for review December 30, 2024 12:49
@blegouix
Copy link
Collaborator Author

blegouix commented Jan 1, 2025

I have found another bug. This is not visible in the tests (should I add a test which reproduces it ?). I have a use-case where the annotated_transform_reduce returns 0 with CUDA (it should not). Here is a fix:

result = reduce(
        result,
        annotated_transform_reduce_serial(
                domain,
                neutral,
                reduce,
                transform,
                dcoords...,
                ii));

Must be replaced with:

T tmp;
Kokkos::atomic_store(
      &tmp,
      annotated_transform_reduce_serial(
                domain,
                neutral,
                reduce,
                transform,
                dcoords...,
                ii));
Kokkos::atomic_store(&result, reduce(result, tmp));

(I don't understand what is the problem as result is a local per-thread variable though)

result = reduce(

@tpadioleau
Copy link
Member

All of that is weird, can you provide more information about your environment ?

  • the hardware, cpu and gpu
  • the version of your compiler
  • the version of the libraries, cmake, cuda, kokkos...
  • the complete cmake command line
  • anything you think relevant

Also do all other DDC tests pass ?
Do kokkos tests pass ?

It feels like the compiler optimized away some function calls

@blegouix
Copy link
Collaborator Author

blegouix commented Jan 1, 2025

  • Ubuntu 24
  • Intel(R) Xeon(R) Gold 6226R + V100S
  • nvcc 12.6 + g++13.3
  • cmake 3.28, cuda 12.6, the Kokkos is the DDC submodule up-to-date with the main DDC branch.
  • cmake -DCMAKE_CXX_COMPILER=$KOKKOS_BIN/nvcc_wrapper -DCMAKE_BUILD_TYPE=Debug -DKokkos_ENABLE_CUDA=ON -DKokkos_ARCH_VOLTA70=ON -DCMAKE_CXX_FLAGS="-Wfatal-errors" ..

All DDC tests pass. I don't know how to compile the Kokkos tests.

I will try quite soon on a 1070Ti with mostly the same software configuration. I will try to dig in compute-sanitizer also. Maybe if you can see if my first problem reproduces on Ruche or Persée by going in this branch and compiling with or without the static_cast<int> it may be helpful ? I agree all this is weird and i'm feeling nvcc is capricious, but it is possible that I am missing something.

@tpadioleau
Copy link
Member

From the table https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#system-requirements you are one minor version above the maximum tested host compiler version, i.e. 13.2.

Can you try with a different compiler ?

@blegouix blegouix marked this pull request as draft January 2, 2025 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ensure serial algorithms can be used on devices
2 participants