Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Add atomics for floating point types. #286

Merged
merged 19 commits into from
Jul 21, 2022

Conversation

sleeepyjack
Copy link
Contributor

@sleeepyjack sleeepyjack commented Jun 22, 2022

This PR is a draft to add support for float/double atomics.

Please review and let me know what is missing.
Unfortunately, the diff between the old and new codegen output is a mess due to the reordering of operations.

Also rolls back #282 and fixes #279

@wmaxey wmaxey self-assigned this Jun 22, 2022
@wmaxey wmaxey added the bug: functional Does not work as intended. label Jun 22, 2022
@wmaxey wmaxey added this to the 1.9.0 milestone Jun 22, 2022
codegen/codegen.cpp Outdated Show resolved Hide resolved
@wmaxey
Copy link
Member

wmaxey commented Jun 22, 2022

I'm overall happy with these changes. I'll start up CI, and since Windows is not having issues I don't expect issues elsewhere either.

@@ -1239,7 +1239,7 @@ _LIBCUDACXX_INLINE_VISIBILITY void __cxx_atomic_wait(__cxx_atomic_impl<_Tp, _Sco
}

// general atomic<T>/atomic_ref<T>
template <class _Tp, int _Sco = 0, bool = is_integral<_Tp>::value && !is_same<_Tp, bool>::value>
template <class _Tp, int _Sco = 0, bool = (is_integral<_Tp>::value || is_floating_point<_Tp>::value) && !is_same<_Tp, bool>::value>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we are enabling the arithmetic operators in cuda::std::atomics as well we would need to extend the tests to exercise floating point.

These tests live in tests/std/atomics/ rather than tests/cuda/atomics

Copy link
Contributor Author

@sleeepyjack sleeepyjack Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.arith is this the right directory to extend the fp arithmetic tests?

Copy link
Contributor Author

@sleeepyjack sleeepyjack Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I figure I need to add some more tests under the std/atomics directory. Basically I need to mirror all tests over integral types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the integral tests would be a good start. Replicating the tests and doing the dispatch yourself for float/double would probably be sufficient. I don't think figuring out how to fit bitwise/arithmetic dispatches into the atomics_helpers there would be worth the effort.

@sleeepyjack
Copy link
Contributor Author

sleeepyjack commented Jun 24, 2022

Whups, while writing some more tests, I just stumbled over the fp min/max problem, i.e. not having specific instructions for them. I'm gonna adjust codegen.cpp so it emits CAS loop specializations.

@wmaxey
Copy link
Member

wmaxey commented Jun 24, 2022

Whups, while writing some more tests, I just stumbled over the fp min/max problem, i.e. not having specific instructions for them. I'm gonna adjust codegen.cpp so it emits CAS loop specializations.

Hmm, yeah, I'll take a second pass over this component when I make fixes for the current issues as well. There's some overlap here with another problem. #279

@sleeepyjack
Copy link
Contributor Author

After extending the tests, I get some weird errors that I am yet unable to track down:

********************
FAIL: libcu++ :: std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp (137 of 1215)
******************** TEST 'libcu++ :: std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp' FAILED ********************
Command: ['/usr/local/cuda/bin/nvcc', '-o', '/workspaces/libcudacxx/build/test/std/atomics/atomics.types.operations/atomics.types.operations.req/Output/atomic_fetch_sub_explicit.pass.cpp.o', '-x', 'cu', '/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp', '-c', '-v', '-ftemplate-depth=270', '-std=c++17', '-include', '/workspaces/libcudacxx/test/support/nasty_macros.h', '-I/workspaces/libcudacxx/include', '-D__STDC_FORMAT_MACROS', '-D__STDC_LIMIT_MACROS', '-D__STDC_CONSTANT_MACROS', '-Xcompiler', '-fno-exceptions', '-Xcompiler', '-fno-rtti', '-D_LIBCUDACXX_NO_RTTI', '-I/workspaces/libcudacxx/test/support', '-include', '/workspaces/libcudacxx/test/force_include.h', '-I/workspaces/libcudacxx/include', '--extended-lambda', '-gencode=arch=compute_61,code=sm_61', '-Xcudafe', '--display_error_number', '-Werror', 'all-warnings', '-Xcompiler', '-Wall', '-Xcompiler', '-Wextra', '-Xcompiler', '-Werror', '-Xcompiler', '-Wno-literal-suffix', '-Xcompiler', '-Wno-unused-parameter', '-Xcompiler', '-Wno-deprecated-declarations', '-Xcompiler', '-Wno-noexcept-type', '-Xcompiler', '-Wno-unused-function', '-D_LIBCUDACXX_DISABLE_PRAGMA_GCC_SYSTEM_HEADER', '-c']
Exit Code: 1
Standard Error:
--
#$ _NVVM_BRANCH_=nvvm
#$ _SPACE_= 
#$ _CUDART_=cudart
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ _TARGET_SIZE_=
#$ _TARGET_DIR_=
#$ _TARGET_DIR_=targets/x86_64-linux
#$ TOP=/usr/local/cuda/bin/..
#$ NVVMIR_LIBRARY_DIR=/usr/local/cuda/bin/../nvvm/libdevice
#$ LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
#$ PATH=/usr/local/cuda/bin/../nvvm/bin:/usr/local/cuda/bin:/vscode/vscode-server/bin/linux-x64/30d9c6cd9483b2cc586687151bcbcd635f373630/bin/remote-cli:/conda/bin:/conda/condabin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../targets/x86_64-linux/include"  
#$ LIBRARIES=  "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib/stubs" "-L/usr/local/cuda/bin/../targets/x86_64-linux/lib"
#$ CUDAFE_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -std=c++17 -D__CUDA_ARCH__=610 -D__CUDA_ARCH_LIST__=610 -E -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -D__CUDACC__ -D__NVCC__ -D__CUDACC_EXTENDED_LAMBDA__  -fno-exceptions -fno-rtti -Wall -Wextra -Werror -Wno-literal-suffix -Wno-unused-parameter -Wno-deprecated-declarations -Wno-noexcept-type -Wno-unused-function -I"/workspaces/libcudacxx/include" -I"/workspaces/libcudacxx/test/support" -I"/workspaces/libcudacxx/include" "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D "__STDC_FORMAT_MACROS" -D "__STDC_LIMIT_MACROS" -D "__STDC_CONSTANT_MACROS" -D "_LIBCUDACXX_NO_RTTI" -D "_LIBCUDACXX_DISABLE_PRAGMA_GCC_SYSTEM_HEADER" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -include "/workspaces/libcudacxx/test/support/nasty_macros.h" -include "/workspaces/libcudacxx/test/force_include.h" -Werror -m64 "/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp" -o "/tmp/tmpxft_000013bf_00000000-7_atomic_fetch_sub_explicit.pass.cpp1.ii" 
#$ cicc --c++17 --gnu_version=110200 --promote_warnings --display_error_number --orig_src_file_name "/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp" --orig_src_path_name "/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp" --allow_managed --pending_instantiations=270 --extended-lambda  --display_error_number  -arch compute_61 -m64 --no-version-ident -ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 --include_file_name "tmpxft_000013bf_00000000-3_atomic_fetch_sub_explicit.pass.fatbin.c" -tused --gen_module_id_file --module_id_file_name "/tmp/tmpxft_000013bf_00000000-4_atomic_fetch_sub_explicit.pass.module_id" --gen_c_file_name "/tmp/tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.cudafe1.c" --stub_file_name "/tmp/tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.cudafe1.stub.c" --gen_device_file_name "/tmp/tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.cudafe1.gpu"  "/tmp/tmpxft_000013bf_00000000-7_atomic_fetch_sub_explicit.pass.cpp1.ii" -o "/tmp/tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.ptx"
#$ ptxas --warning-as-error -arch=sm_61 -m64  "/tmp/tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.ptx"  -o "/tmp/tmpxft_000013bf_00000000-8_atomic_fetch_sub_explicit.pass.cubin" 
#$ fatbinary -64 --cicc-cmdline="-ftz=0 -prec_div=1 -prec_sqrt=1 -fmad=1 " "--image3=kind=elf,sm=61,file=/tmp/tmpxft_000013bf_00000000-8_atomic_fetch_sub_explicit.pass.cubin" --embedded-fatbin="/tmp/tmpxft_000013bf_00000000-3_atomic_fetch_sub_explicit.pass.fatbin.c" 
#$ rm /tmp/tmpxft_000013bf_00000000-3_atomic_fetch_sub_explicit.pass.fatbin
#$ gcc -std=c++17 -D__CUDA_ARCH_LIST__=610 -E -x c++ -D__CUDACC__ -D__NVCC__ -D__CUDACC_EXTENDED_LAMBDA__  -fno-exceptions -fno-rtti -Wall -Wextra -Werror -Wno-literal-suffix -Wno-unused-parameter -Wno-deprecated-declarations -Wno-noexcept-type -Wno-unused-function -I"/workspaces/libcudacxx/include" -I"/workspaces/libcudacxx/test/support" -I"/workspaces/libcudacxx/include" "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"    -D "__STDC_FORMAT_MACROS" -D "__STDC_LIMIT_MACROS" -D "__STDC_CONSTANT_MACROS" -D "_LIBCUDACXX_NO_RTTI" -D "_LIBCUDACXX_DISABLE_PRAGMA_GCC_SYSTEM_HEADER" -D__CUDACC_VER_MAJOR__=11 -D__CUDACC_VER_MINOR__=7 -D__CUDACC_VER_BUILD__=64 -D__CUDA_API_VER_MAJOR__=11 -D__CUDA_API_VER_MINOR__=7 -D__NVCC_DIAG_PRAGMA_SUPPORT__=1 -include "cuda_runtime.h" -include "/workspaces/libcudacxx/test/support/nasty_macros.h" -include "/workspaces/libcudacxx/test/force_include.h" -Werror -m64 "/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp" -o "/tmp/tmpxft_000013bf_00000000-5_atomic_fetch_sub_explicit.pass.cpp4.ii" 
#$ cudafe++ --c++17 --gnu_version=110200 --promote_warnings --display_error_number --orig_src_file_name "/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp" --orig_src_path_name "/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp" --allow_managed --pending_instantiations=270 --extended-lambda  --display_error_number --m64 --parse_templates --gen_c_file_name "/tmp/tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.cudafe1.cpp" --stub_file_name "tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.cudafe1.stub.c" --module_id_file_name "/tmp/tmpxft_000013bf_00000000-4_atomic_fetch_sub_explicit.pass.module_id" "/tmp/tmpxft_000013bf_00000000-5_atomic_fetch_sub_explicit.pass.cpp4.ii" 
#$ gcc -std=c++17 -D__CUDA_ARCH__=610 -D__CUDA_ARCH_LIST__=610 -c -x c++  -DCUDA_DOUBLE_MATH_FUNCTIONS -fno-exceptions -fno-rtti -Wall -Wextra -Werror -Wno-literal-suffix -Wno-unused-parameter -Wno-deprecated-declarations -Wno-noexcept-type -Wno-unused-function -I"/workspaces/libcudacxx/include" -I"/workspaces/libcudacxx/test/support" -I"/workspaces/libcudacxx/include" "-I/usr/local/cuda/bin/../targets/x86_64-linux/include"   -Werror  -ftemplate-depth-270 -m64 "/tmp/tmpxft_000013bf_00000000-6_atomic_fetch_sub_explicit.pass.cudafe1.cpp" -o "/workspaces/libcudacxx/build/test/std/atomics/atomics.types.operations/atomics.types.operations.req/Output/atomic_fetch_sub_explicit.pass.cpp.o" 
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_base.h: In instantiation of 'cuda::std::__4::__detail::__host::__cxx_atomic_underlying_t<_Tp> cuda::std::__4::__detail::__host::__cxx_atomic_fetch_sub(_Tp*, _Td, cuda::std::__4::memory_order) [with _Tp = volatile cuda::std::__4::__detail::__host::__cxx_atomic_base_impl<float, 0>; _Td = float; cuda::std::__4::__detail::__host::__cxx_atomic_underlying_t<_Tp> = float; cuda::std::__4::memory_order = cuda::std::__4::memory_order]':
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda.h:284:38:   required from '_Tp cuda::std::__4::__detail::__cxx_atomic_fetch_sub(volatile cuda::std::__4::__detail::__cxx_atomic_base_heterogeneous_impl<_Tp, _Sco, _Ref>*, _Tp, cuda::std::__4::memory_order) [with _Tp = float; int _Sco = 0; bool _Ref = false; cuda::std::__4::memory_order = cuda::std::__4::memory_order]'
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/atomic:1516:32:   required from '_Tp cuda::std::__4::__atomic_base<_Tp, _Sco, true>::fetch_sub(_Tp, cuda::std::__4::memory_order) [with _Tp = float; int _Sco = 0; cuda::std::__4::memory_order = cuda::std::__4::memory_order]'
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/atomic:2288:22:   required from 'typename cuda::std::__4::enable_if<((cuda::std::__4::is_integral<_Tp>::value && (! cuda::std::__4::is_same<_Tp, bool>::value)) || cuda::std::__4::is_floating_point<_Tp>::value), _Tp>::type cuda::std::__4::atomic_fetch_sub_explicit(cuda::std::__4::atomic<_Tp>*, _Tp, cuda::std::__4::memory_order) [with _Tp = float; typename cuda::std::__4::enable_if<((cuda::std::__4::is_integral<_Tp>::value && (! cuda::std::__4::is_same<_Tp, bool>::value)) || cuda::std::__4::is_floating_point<_Tp>::value), _Tp>::type = float; cuda::std::__4::memory_order = cuda::std::__4::memory_order]'
/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp:49:38:   required from 'void TestFn<T, Selector, <anonymous> >::operator()() const [with T = float; Selector = local_memory_selector; cuda::std::__4::__detail::thread_scope <anonymous> = cuda::std::__4::__detail::thread_scope_system]'
/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_helpers.h:74:40:   required from 'void TestEachFloatingPointType<TestFunctor, Selector, Scope>::operator()() const [with TestFunctor = TestFn; Selector = local_memory_selector; cuda::std::__4::__detail::thread_scope Scope = cuda::std::__4::__detail::thread_scope_system]'
/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp:95:61:   required from here
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_base.h:136:26: error: operand type 'volatile float*' is incompatible with argument 1 of '__atomic_fetch_sub'
  136 |   return __atomic_fetch_sub(__a_tmp, __delta * __skip_v,
      |        ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                  
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_base.h: In instantiation of 'cuda::std::__4::__detail::__host::__cxx_atomic_underlying_t<_Tp> cuda::std::__4::__detail::__host::__cxx_atomic_fetch_sub(_Tp*, _Td, cuda::std::__4::memory_order) [with _Tp = volatile cuda::std::__4::__detail::__host::__cxx_atomic_base_impl<double, 0>; _Td = double; cuda::std::__4::__detail::__host::__cxx_atomic_underlying_t<_Tp> = double; cuda::std::__4::memory_order = cuda::std::__4::memory_order]':
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_cuda.h:284:38:   required from '_Tp cuda::std::__4::__detail::__cxx_atomic_fetch_sub(volatile cuda::std::__4::__detail::__cxx_atomic_base_heterogeneous_impl<_Tp, _Sco, _Ref>*, _Tp, cuda::std::__4::memory_order) [with _Tp = double; int _Sco = 0; bool _Ref = false; cuda::std::__4::memory_order = cuda::std::__4::memory_order]'
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/atomic:1516:32:   required from '_Tp cuda::std::__4::__atomic_base<_Tp, _Sco, true>::fetch_sub(_Tp, cuda::std::__4::memory_order) [with _Tp = double; int _Sco = 0; cuda::std::__4::memory_order = cuda::std::__4::memory_order]'
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/atomic:2288:22:   required from 'typename cuda::std::__4::enable_if<((cuda::std::__4::is_integral<_Tp>::value && (! cuda::std::__4::is_same<_Tp, bool>::value)) || cuda::std::__4::is_floating_point<_Tp>::value), _Tp>::type cuda::std::__4::atomic_fetch_sub_explicit(cuda::std::__4::atomic<_Tp>*, _Tp, cuda::std::__4::memory_order) [with _Tp = double; typename cuda::std::__4::enable_if<((cuda::std::__4::is_integral<_Tp>::value && (! cuda::std::__4::is_same<_Tp, bool>::value)) || cuda::std::__4::is_floating_point<_Tp>::value), _Tp>::type = double; cuda::std::__4::memory_order = cuda::std::__4::memory_order]'
/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp:49:38:   required from 'void TestFn<T, Selector, <anonymous> >::operator()() const [with T = double; Selector = local_memory_selector; cuda::std::__4::__detail::thread_scope <anonymous> = cuda::std::__4::__detail::thread_scope_system]'
/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_helpers.h:75:41:   required from 'void TestEachFloatingPointType<TestFunctor, Selector, Scope>::operator()() const [with TestFunctor = TestFn; Selector = local_memory_selector; cuda::std::__4::__detail::thread_scope Scope = cuda::std::__4::__detail::thread_scope_system]'
/workspaces/libcudacxx/.upstream-tests/test/std/atomics/atomics.types.operations/atomics.types.operations.req/atomic_fetch_sub_explicit.pass.cpp:95:61:   required from here
/workspaces/libcudacxx/include/cuda/std/detail/libcxx/include/support/atomic/atomic_base.h:136:26: error: operand type 'volatile double*' is incompatible with argument 1 of '__atomic_fetch_sub'
# --error 0x1 --
--

Compilation failed unexpectedly!
********************

@wmaxey do you know what this could be?

@wmaxey
Copy link
Member

wmaxey commented Jun 29, 2022

After extending the tests, I get some weird errors that I am yet unable to track down:

...

@wmaxey do you know what this could be?

I am not encountering this on your latest, let me try with some other configs. Maybe some changes I had made on top of your branch fixed this.

@sleeepyjack
Copy link
Contributor Author

I am not encountering this on your latest, let me try with some other configs. Maybe some changes I had made on top of your branch fixed this.

For reference: I am running CUDA CTK 11.7.0 with gcc 11.2 in an ubuntu20.04 container. I only have a Pascal card available (sm_61).

@wmaxey
Copy link
Member

wmaxey commented Jul 6, 2022

We'll need to break out float add/sub into CAS loops. On MSVC we brutally cast to long*/int* which causes invalid results.

I can make these changes on top of your changes.

@wmaxey
Copy link
Member

wmaxey commented Jul 7, 2022

I've added a patch that fixes a few issues, let me know if this resolves your problems as well.

@sleeepyjack
Copy link
Contributor Author

We'll need to break out float add/sub into CAS loops.

Ah, I didn't think of the host side. Good catch!

I've added a patch that fixes a few issues,

Thanks a lot! This fixes most of the previously failing tests.

********************
********************
Failed Tests (4):
  libcu++ :: cuda/bad_atomic_alignment.pass.cpp
  libcu++ :: std/atomics/atomics.types.generic/atomic_copyable.pass.cpp
  libcu++ :: std/atomics/atomics.types.generic/integral_ref.pass.cpp
  libcu++ :: std/atomics/atomics.types.generic/integral_ref_constness.pass.cpp


Testing Time: 419.02s
  Unsupported      :  101
  Passed           : 1091
  Expectedly Failed:   19
  Failed           :    4

real    7m1.338s
user    46m8.690s
sys     8m27.502s
################################################################################
Score: 99.63%

The last four tests that still fail throw the exact same error: cudaErrorInvalidAddressSpace: operation not supported on global/shared address space.

@wmaxey
Copy link
Member

wmaxey commented Jul 7, 2022

The last four tests that still fail throw the exact same error: cudaErrorInvalidAddressSpace: operation not supported on global/shared address space.

That's a known issue for Pascal. I do not know the cause, but believe it may have something to do with an unsupported size operand.

I think the only remaining thing would be tests similar to atomics.types.generic/integral.pass.cpp. We don't have CAS/ld/st coverage for floating point types. Which I'm positive works, but we should be complete. :)

@sleeepyjack
Copy link
Contributor Author

I think the only remaining thing would be tests similar to atomics.types.generic/integral.pass.cpp.

/done

Had to break them out into separate files although this introduces a lot of duplicate code.
We could do something like if constexpr (std::is_integral<T>::value) to mask out those operations in the integral_* tests that aren't available for fp types but this would require C++17.

@wmaxey
Copy link
Member

wmaxey commented Jul 8, 2022

Thanks for all the effort Daniel! I'll see if there's any issues with the changes again and if not I think it's okay to merge. Though I might remove some comments that documented the 'unsigned' nature of min/max.

@jrhemstad jrhemstad requested a review from griwes July 11, 2022 14:54
@jrhemstad
Copy link
Collaborator

Would be nice to get @griwes to review this as well.

@wmaxey
Copy link
Member

wmaxey commented Jul 11, 2022

https://builds4u.nvidia.com/dvs/#/change/3154615863024800.1?eventType=Virtual&dvs_showStaging=on
https://scbuilds4u/dvs/#/change/3154615539432407.1?eventType=Virtual

DVS is clean, but SC-DVS found ICEs that I am able to repro on VC129, will try to figure out what's going on there.

@wmaxey
Copy link
Member

wmaxey commented Jul 11, 2022

C:\sbf\libcudacxx\.upstream-tests\test\std\atomics\atomics.types.operations\atomics.types.operations.wait\../atomics.types.operations.req/atomic_helpers.h(87): note: while compiling class template member function 'void TestEachAtomicType<TestFn,shared_memory_selector,cuda::std::__4::__detail::thread_scope_system>::operator ()(void) const'
C:\sbf\libcudacxx\.upstream-tests\test\std\atomics\atomics.types.operations\atomics.types.operations.wait\atomic_wait.pass.cpp(91): note: see reference to function template instantiation 'void TestEachAtomicType<TestFn,shared_memory_selector,cuda::std::__4::__detail::thread_scope_system>::operator ()(void) const' being compiled
C:\sbf\libcudacxx\.upstream-tests\test\std\atomics\atomics.types.operations\atomics.types.operations.wait\atomic_wait.pass.cpp(91): note: see reference to class template instantiation 'TestEachAtomicType<TestFn,shared_memory_selector,cuda::std::__4::__detail::thread_scope_system>' being compiled
INTERNAL COMPILER ERROR in 'C:\msbuild\2019\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64\cl.exe'
    Please choose the Technical Support command on the Visual C++
    Help menu, or open the Technical Support help file for more information
nvcc error   : 'cl' died with status 0xC0000005 (ACCESS_VIOLATION)
# --error 0xc0000005 --

Copy link
Collaborator

@griwes griwes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks good to me; most comments I have are in tests, so I'm going to give this a 👍 and leave it up to @wmaxey as to whether these should gate landing this PR or not. The one comment I have in the actual change is rather minor.

@@ -60,14 +60,58 @@ struct TestFn {
}
};

template <template<typename, typename> typename Selector, cuda::thread_scope ThreadScope>
struct TestFn<int, Selector, ThreadScope> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This specializes specifically for int - shouldn't it, instead, be specialized for any signed integral type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifically, I'm attempting to get a guarantee that signed math is working as expected. It would be completely fair to split it into unsigned and signed specializations. Perhaps more tests for this API are needed. ;)

@@ -60,14 +60,59 @@ struct TestFn {
}
};

template <template<typename, typename> typename Selector, cuda::thread_scope ThreadScope>
struct TestFn<int, Selector, ThreadScope> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the comment above.

__host__ __device__
void operator()() const {
TestFunctor<float, Selector, Scope>()();
TestFunctor<double, Selector, Scope>()();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have a host-only call to TestFunctor<long double, Selector, Scope>()() here?


int main(int, char**)
{
// this test would instantiate more cases than just the ones below
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integral tests here instantiate the test functions for all integer types. Here we only instantiate for two floating point types. It should be fine to remove this comment and have all combinations of scopes and memory selectors actually tested below.


int main(int, char**)
{
// this test would instantiate more cases than just the ones below
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the non-ref version of this test.


int main(int, char**)
{
// this test would instantiate more cases than just the ones below
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And once again here.

__host__ __device__
void operator()() const {
TestFunctor<float, Selector, Scope>()();
TestFunctor<double, Selector, Scope>()();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as earlier - on the host here should also be a call for long double.

__host__ __device__
void operator()() const {
TestFunctor<float, Selector, Scope>()();
TestFunctor<double, Selector, Scope>()();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And one more time here.

Comment on lines 2715 to 2716
typedef atomic<float> atomic_float;
typedef atomic<double> atomic_double;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these two are in the standard. The other ones are there mainly for C interop; do we want these two here? If we do, we should also have a (possibly host-only) atomic_long_double.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to say we probably don't want these. We'll need to open the door eventually for a half and just diverge further with what's available on H/D.

Selector<A, constructor_initializer> sel;
A & t = *sel.construct();
t = int(-1);
assert(t.fetch_max(4) == int(-1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not connect with the comment.

We are still testing a smaller int versus a larger threshold. Why is this changed and could we update the comment?

If this does some horrible conversion to unsigned magic we should test that explicitly and keep the basic 3 vs 2 test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no horrible conversion, it's just specifically testing int types. The cast is, in truth, unnecessary.

Selector<volatile A, constructor_initializer> sel;
volatile A & t = *sel.construct();
t = int(5);
assert(t.fetch_min(-1) == int(5));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is changing the values of the test, when we actually want to only add volatile. I would either keep them the same of change them consistently throughout the file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did want to change the test in this case, but there is greater value in making a proper unsigned/signed split test so that we can guarantee that add/sub/max/min are behaving correctly.

assert(obj == T(1));
assert(obj.load() == T(1));
assert(obj.load(cuda::std::memory_order_acquire) == T(1));
assert(obj.exchange(T(2)) == T(1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the libcxx style, but could we at least add some newlines between the different functionality that is tested.

Reading a gazillion consecutive lines drains a lot of brain power

…appening on MSVC that seems to cause on internal compiler error
template <typename _Tp, int _Sco,
typename _Base = typename conditional<__cxx_is_always_lock_free<_Tp>::__value,
template <typename _Tp, int _Sco>
struct __cxx_atomic_impl_conditional {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about using is cursed. This is very reminiscent of several tuple fixes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVS results will be posted soon, but this fixed builds on all the compilers I was able to get repros on.

@wmaxey wmaxey requested a review from griwes July 16, 2022 01:48
@wmaxey
Copy link
Member

wmaxey commented Jul 16, 2022

@griwes I've made the tests more straightforward, all signed types now just get extra testing and there isn't some strange int only overload.

@sleeepyjack I'd like to see the atomic_float/double removed. That just opens up questions about long double I guess.

@wmaxey
Copy link
Member

wmaxey commented Jul 20, 2022

Latest changes made sure that bitwise types are a superset of arithmetic types. I did some work to refactor the ref/non-ref classes as well, they should be easier to maintain in the future.

@wmaxey wmaxey added the testing: internal ci passed Passed internal NVIDIA CI (DVS). label Jul 20, 2022
__atomic_base_storage(_Storage&& __a) _NOEXCEPT : __a_(forward<_Storage>(__a)) {}
};

template <class _Tp, bool _Cq, typename _Storage>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we get a more descriptive name instead of _Cq?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll push an update with _Cq->_ConstQualified


_LIBCUDACXX_INLINE_VISIBILITY _LIBCUDACXX_CONSTEXPR
__atomic_base_ref(_Tp& __a) _NOEXCEPT : __a_(__a) {}
__atomic_base_core(_Storage&& __a) _NOEXCEPT : __atomic_base_storage<_Tp, _Storage>(forward<_Storage>(__a)) {}
Copy link
Collaborator

@miscco miscco Jul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically we are required to qualify forward as it is a non ugly function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.

__atomic_base(const __atomic_base&) = delete;
__atomic_base(__atomic_base&&) = delete;
__atomic_base_storage() = default;
__atomic_base_storage(const __atomic_base_storage&) = default;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am having some troubles correctly parsing the difference between the various classes. AFAICT the only difference is whether the special member functions are deleted / defaulted.

Given that the implementation of the classes is considerable, would it make sense to just derive from a single base class that does this for us, like optional does

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you referring to all the __atomic_base classes? The main difference is const qualifiers. atomic_ref has to allow value updates through const.

@wmaxey wmaxey added the enhancement New feature or request. label Jul 20, 2022
@wmaxey wmaxey merged commit e489e9b into NVIDIA:main Jul 21, 2022
@sleeepyjack
Copy link
Contributor Author

Wooohoo, first contribution merged ☑️

@sleeepyjack sleeepyjack deleted the feature/fp_atomics branch July 22, 2022 00:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug: functional Does not work as intended. enhancement New feature or request. testing: internal ci passed Passed internal NVIDIA CI (DVS).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cuda::atomic_ref<float>::fetch_min and fetch_max incorrect results on values of different signs
5 participants