Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate cub::{min, max} and replace internal uses with those from libcu++ #3419

Merged
merged 2 commits into from
Jan 18, 2025

Conversation

miscco
Copy link
Collaborator

@miscco miscco commented Jan 16, 2025

Fixes #3404

@miscco miscco requested review from a team as code owners January 16, 2025 09:42
@miscco miscco added cub For all items related to CUB backport branch/2.8.x labels Jan 16, 2025
@miscco miscco requested review from elstehle and wmaxey January 16, 2025 09:43
Copy link
Contributor

🟩 CI finished in 2h 03m: Pass: 100%/78 | Total: 2d 07h | Avg: 43m 01s | Max: 1h 14m | Hits: 104%/12760
  • 🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 53m 35s | Max: 1h 11m | Hits: 36%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  1d 08h | Avg: 53m 22s | Max:  1h 11m | Hits:  36%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 30s | Max: 58m 16s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  5h 01m | Avg:  1h 00m | Max:  1h 07m | Hits:  36%/885   
      🟩 12.5               Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
      🟩 12.6               Pass: 100%/31  | Total:  1d 02h | Avg: 51m 33s | Max:  1h 11m | Hits:  36%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m
      🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 01m | Avg:  1h 00m | Max:  1h 07m | Hits:  36%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
      🟩 nvcc12.6           Pass: 100%/29  | Total:  1d 00h | Avg: 50m 47s | Max:  1h 11m | Hits:  36%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m
      🟩 nvcc               Pass: 100%/36  | Total:  1d 07h | Avg: 53m 05s | Max:  1h 11m | Hits:  36%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  4h 00m | Avg:  1h 00m | Max:  1h 01m
      🟩 Clang15            Pass: 100%/1   | Total: 58m 18s | Avg: 58m 18s | Max: 58m 18s
      🟩 Clang16            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 Clang17            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 Clang18            Pass: 100%/7   | Total:  5h 53m | Avg: 50m 33s | Max:  1h 03m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 51m | Avg: 55m 31s | Max: 55m 36s
      🟩 GCC8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 GCC9               Pass: 100%/2   | Total:  1h 58m | Avg: 59m 15s | Max:  1h 00m
      🟩 GCC10              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
      🟩 GCC11              Pass: 100%/1   | Total: 58m 15s | Avg: 58m 15s | Max: 58m 15s
      🟩 GCC12              Pass: 100%/3   | Total:  2h 11m | Avg: 43m 47s | Max:  1h 04m
      🟩 GCC13              Pass: 100%/8   | Total:  5h 10m | Avg: 38m 47s | Max:  1h 00m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 18m | Avg:  1h 09m | Max:  1h 11m | Hits:  36%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  36%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total: 12h 55m | Avg: 55m 24s | Max:  1h 03m
      🟩 GCC                Pass: 100%/18  | Total: 14h 12m | Avg: 47m 21s | Max:  1h 04m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 31m | Avg:  1h 07m | Max:  1h 11m | Hits:  36%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 28s | Max: 41m 35s
      🟩 v100               Pass: 100%/36  | Total:  1d 08h | Avg: 54m 42s | Max:  1h 11m | Hits:  36%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  1d 06h | Avg: 59m 18s | Max:  1h 11m | Hits:  36%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 31m 54s | Avg: 31m 54s | Max: 31m 54s
      🟩 GraphCapture       Pass: 100%/1   | Total: 14m 45s | Avg: 14m 45s | Max: 14m 45s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 22m | Avg: 27m 23s | Max: 41m 35s
      🟩 TestGPU            Pass: 100%/2   | Total:  1h 09m | Avg: 34m 40s | Max: 39m 48s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total:  1h 06m | Avg: 33m 28s | Max: 41m 35s
      🟩 90a                Pass: 100%/1   | Total: 24m 21s | Avg: 24m 21s | Max: 24m 21s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 14h 30m | Avg:  1h 02m | Max:  1h 11m | Hits:  36%/2655  
      🟩 20                 Pass: 100%/24  | Total: 19h 26m | Avg: 48m 36s | Max:  1h 07m | Hits:  36%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 21h 13m | Avg: 34m 25s | Max: 1h 14m | Hits: 130%/9220

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 37s | Avg: 18m 48s | Max: 25m 51s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 20h 12m | Avg: 34m 39s | Max:  1h 14m | Hits: 130%/9220  
      🟩 arm64              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 27s | Max: 31m 50s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 05m | Avg: 37m 04s | Max:  1h 02m | Hits:  58%/1844  
      🟩 12.5               Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
      🟩 12.6               Pass: 100%/30  | Total: 15h 43m | Avg: 31m 27s | Max:  1h 13m | Hits: 148%/7376  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 53m 13s | Avg: 26m 36s | Max: 27m 51s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 05m | Avg: 37m 04s | Max:  1h 02m | Hits:  58%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
      🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 50m | Avg: 31m 48s | Max:  1h 13m | Hits: 148%/7376  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 53m 13s | Avg: 26m 36s | Max: 27m 51s
      🟩 nvcc               Pass: 100%/35  | Total: 20h 20m | Avg: 34m 52s | Max:  1h 14m | Hits: 130%/9220  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  1h 58m | Avg: 29m 43s | Max: 31m 03s
      🟩 Clang15            Pass: 100%/1   | Total: 31m 27s | Avg: 31m 27s | Max: 31m 27s
      🟩 Clang16            Pass: 100%/1   | Total: 30m 18s | Avg: 30m 18s | Max: 30m 18s
      🟩 Clang17            Pass: 100%/1   | Total: 32m 47s | Avg: 32m 47s | Max: 32m 47s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 47m | Avg: 23m 53s | Max: 32m 44s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 00m | Avg: 30m 23s | Max: 30m 57s
      🟩 GCC8               Pass: 100%/1   | Total: 32m 19s | Avg: 32m 19s | Max: 32m 19s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 09s | Max: 34m 58s
      🟩 GCC10              Pass: 100%/1   | Total: 32m 10s | Avg: 32m 10s | Max: 32m 10s
      🟩 GCC11              Pass: 100%/1   | Total: 36m 14s | Avg: 36m 14s | Max: 36m 14s
      🟩 GCC12              Pass: 100%/1   | Total: 33m 58s | Avg: 33m 58s | Max: 33m 58s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 58m | Avg: 22m 16s | Max: 37m 16s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m | Hits:  82%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  3h 00m | Avg:  1h 00m | Max:  1h 13m | Hits: 162%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  6h 20m | Avg: 27m 11s | Max: 32m 47s
      🟩 GCC                Pass: 100%/16  | Total:  7h 19m | Avg: 27m 29s | Max: 37m 16s
      🟩 MSVC               Pass: 100%/5   | Total:  5h 08m | Avg:  1h 01m | Max:  1h 13m | Hits: 130%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 21h 13m | Avg: 34m 25s | Max:  1h 14m | Hits: 130%/9220  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 19h 45m | Avg: 38m 14s | Max:  1h 14m | Hits:  71%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total: 50m 01s | Avg: 16m 40s | Max: 34m 17s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total: 38m 38s | Avg: 12m 52s | Max: 14m 20s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 17m 49s | Avg: 17m 49s | Max: 17m 49s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  9h 43m | Avg: 41m 42s | Max:  1h 14m | Hits:  75%/5532  
      🟩 20                 Pass: 100%/21  | Total: 10h 52m | Avg: 31m 04s | Max:  1h 13m | Hits: 212%/3688  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 51s | Avg: 5m 55s | Max: 9m 18s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 33s | Avg:  2m 33s | Max:  2m 33s
      🟩 Test               Pass: 100%/1   | Total:  9m 18s | Avg:  9m 18s | Max:  9m 18s
    
  • 🟩 python: Pass: 100%/1 | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@bernhardmgruber
Copy link
Contributor

pre-commit.ci autofix

Copy link

copy-pr-bot bot commented Jan 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@bernhardmgruber
Copy link
Contributor

/ok to test

@elstehle
Copy link
Collaborator

I did a sample check for sass changes in our merge_sort benchmarks and noticed a difference in the generated device code. I will run the benchmarks to verify that we do not see a performance regression.

@elstehle
Copy link
Collaborator

elstehle commented Jan 16, 2025

Performance for merge_sort.keys looks fine:

benchmark results merge_sort.keys on H100
T{ct} OffsetT{ct} Elements{io} Entropy Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I8 I32 2^16 1 49.463 us 0.52% 49.389 us 0.54% -0.074 us -0.15% SAME
I8 I32 2^20 1 121.824 us 0.34% 123.032 us 0.36% 1.208 us 0.99% SLOW
I8 I32 2^24 1 841.153 us 0.21% 828.317 us 0.27% -12.836 us -1.53% FAST
I8 I32 2^28 1 14.761 ms 0.03% 14.635 ms 0.06% -126.264 us -0.86% FAST
I8 I32 2^16 0.201 49.157 us 0.54% 49.198 us 1.17% 0.041 us 0.08% SAME
I8 I32 2^20 0.201 119.114 us 0.27% 120.865 us 0.40% 1.751 us 1.47% SLOW
I8 I32 2^24 0.201 791.923 us 0.15% 780.104 us 0.25% -11.819 us -1.49% FAST
I8 I32 2^28 0.201 13.724 ms 0.04% 13.624 ms 0.06% -100.034 us -0.73% FAST
I8 U32 2^16 1 51.402 us 0.66% 51.218 us 0.52% -0.185 us -0.36% SAME
I8 U32 2^20 1 124.559 us 0.37% 125.102 us 0.36% 0.543 us 0.44% SLOW
I8 U32 2^24 1 847.858 us 0.19% 834.785 us 0.26% -13.073 us -1.54% FAST
I8 U32 2^28 1 14.752 ms 0.03% 14.649 ms 0.05% -102.680 us -0.70% FAST
I8 U32 2^16 0.201 50.447 us 0.54% 50.398 us 0.55% -0.048 us -0.10% SAME
I8 U32 2^20 0.201 120.788 us 0.29% 122.083 us 0.31% 1.295 us 1.07% SLOW
I8 U32 2^24 0.201 794.923 us 0.12% 782.146 us 0.21% -12.777 us -1.61% FAST
I8 U32 2^28 0.201 13.737 ms 0.02% 13.655 ms 0.10% -82.490 us -0.60% FAST
I8 I64 2^16 1 52.552 us 0.64% 52.256 us 0.49% -0.296 us -0.56% FAST
I8 I64 2^20 1 127.572 us 0.36% 127.600 us 0.30% 0.027 us 0.02% SAME
I8 I64 2^24 1 854.442 us 0.20% 841.724 us 0.24% -12.719 us -1.49% FAST
I8 I64 2^28 1 14.929 ms 0.02% 14.831 ms 0.10% -97.301 us -0.65% FAST
I8 I64 2^16 0.201 51.386 us 0.60% 51.269 us 0.59% -0.117 us -0.23% SAME
I8 I64 2^20 0.201 124.622 us 0.24% 124.501 us 0.29% -0.121 us -0.10% SAME
I8 I64 2^24 0.201 804.059 us 0.13% 792.843 us 0.21% -11.216 us -1.39% FAST
I8 I64 2^28 0.201 13.926 ms 0.03% 13.830 ms 0.06% -95.251 us -0.68% FAST
I8 U64 2^16 1 52.276 us 0.69% 51.994 us 0.49% -0.281 us -0.54% FAST
I8 U64 2^20 1 127.315 us 0.36% 126.928 us 0.33% -0.387 us -0.30% SAME
I8 U64 2^24 1 853.614 us 0.21% 841.241 us 0.23% -12.374 us -1.45% FAST
I8 U64 2^28 1 14.932 ms 0.03% 14.831 ms 0.06% -101.315 us -0.68% FAST
I8 U64 2^16 0.201 51.197 us 0.50% 51.036 us 0.51% -0.161 us -0.31% SAME
I8 U64 2^20 0.201 124.022 us 0.27% 123.637 us 0.30% -0.386 us -0.31% FAST
I8 U64 2^24 0.201 803.008 us 0.13% 798.753 us 0.19% -4.255 us -0.53% FAST
I8 U64 2^28 0.201 13.927 ms 0.03% 13.832 ms 0.05% -94.224 us -0.68% FAST
I16 I32 2^16 1 54.006 us 0.50% 53.878 us 0.60% -0.128 us -0.24% SAME
I16 I32 2^20 1 132.508 us 0.31% 132.298 us 0.38% -0.210 us -0.16% SAME
I16 I32 2^24 1 970.299 us 0.11% 966.232 us 0.20% -4.068 us -0.42% FAST
I16 I32 2^28 1 17.404 ms 1.47% 17.443 ms 1.42% 38.204 us 0.22% SAME
I16 I32 2^16 0.201 53.999 us 0.59% 54.077 us 0.67% 0.078 us 0.14% SAME
I16 I32 2^20 0.201 130.158 us 0.31% 130.709 us 0.31% 0.551 us 0.42% SLOW
I16 I32 2^24 0.201 914.123 us 0.13% 914.716 us 0.13% 0.593 us 0.06% SAME
I16 I32 2^28 0.201 15.533 ms 0.04% 15.545 ms 0.03% 12.455 us 0.08% SLOW
I16 U32 2^16 1 53.990 us 0.55% 54.036 us 0.57% 0.046 us 0.09% SAME
I16 U32 2^20 1 132.654 us 0.37% 132.973 us 0.42% 0.319 us 0.24% SAME
I16 U32 2^24 1 972.231 us 0.11% 967.598 us 0.11% -4.633 us -0.48% FAST
I16 U32 2^28 1 17.589 ms 1.32% 17.552 ms 1.37% -37.448 us -0.21% SAME
I16 U32 2^16 0.201 53.487 us 0.81% 53.691 us 0.80% 0.203 us 0.38% SAME
I16 U32 2^20 0.201 128.467 us 0.32% 129.950 us 0.44% 1.483 us 1.15% SLOW
I16 U32 2^24 0.201 912.183 us 0.14% 911.460 us 0.14% -0.722 us -0.08% SAME
I16 U32 2^28 0.201 15.515 ms 0.07% 15.525 ms 0.04% 9.427 us 0.06% SLOW
I16 I64 2^16 1 54.699 us 0.51% 54.935 us 0.55% 0.236 us 0.43% SAME
I16 I64 2^20 1 135.015 us 0.34% 135.977 us 0.42% 0.962 us 0.71% SLOW
I16 I64 2^24 1 975.570 us 0.11% 977.688 us 0.10% 2.117 us 0.22% SLOW
I16 I64 2^28 1 17.829 ms 1.29% 17.808 ms 1.28% -20.584 us -0.12% SAME
I16 I64 2^16 0.201 54.468 us 0.85% 54.696 us 0.69% 0.229 us 0.42% SAME
I16 I64 2^20 0.201 132.462 us 0.34% 133.119 us 0.34% 0.658 us 0.50% SLOW
I16 I64 2^24 0.201 921.121 us 0.13% 922.413 us 0.13% 1.293 us 0.14% SLOW
I16 I64 2^28 0.201 15.606 ms 0.03% 15.610 ms 0.04% 4.031 us 0.03% SAME
I16 U64 2^16 1 54.520 us 0.53% 54.799 us 0.49% 0.279 us 0.51% SLOW
I16 U64 2^20 1 134.027 us 0.34% 135.091 us 0.35% 1.063 us 0.79% SLOW
I16 U64 2^24 1 977.635 us 0.11% 980.469 us 0.10% 2.834 us 0.29% SLOW
I16 U64 2^28 1 17.893 ms 1.29% 17.852 ms 1.33% -40.364 us -0.23% SAME
I16 U64 2^16 0.201 54.355 us 0.76% 54.564 us 1.06% 0.209 us 0.39% SAME
I16 U64 2^20 0.201 131.227 us 0.30% 132.250 us 0.31% 1.023 us 0.78% SLOW
I16 U64 2^24 0.201 921.965 us 0.12% 920.162 us 0.12% -1.803 us -0.20% FAST
I16 U64 2^28 0.201 15.618 ms 0.06% 15.603 ms 0.03% -15.155 us -0.10% FAST
I32 I32 2^16 1 51.814 us 0.61% 52.088 us 0.61% 0.274 us 0.53% SAME
I32 I32 2^20 1 127.786 us 0.33% 128.857 us 0.45% 1.071 us 0.84% SLOW
I32 I32 2^24 1 1.279 ms 0.23% 1.279 ms 0.22% 0.023 us 0.00% SAME
I32 I32 2^28 1 23.779 ms 0.50% 23.762 ms 0.50% -16.417 us -0.07% SAME
I32 I32 2^16 0.201 52.087 us 1.39% 52.200 us 1.12% 0.113 us 0.22% SAME
I32 I32 2^20 0.201 125.105 us 0.32% 126.689 us 0.46% 1.583 us 1.27% SLOW
I32 I32 2^24 0.201 1.265 ms 0.21% 1.266 ms 0.21% 0.588 us 0.05% SAME
I32 I32 2^28 0.201 22.964 ms 0.02% 22.959 ms 0.02% -4.834 us -0.02% SAME
I32 U32 2^16 1 51.310 us 0.71% 51.673 us 0.59% 0.363 us 0.71% SLOW
I32 U32 2^20 1 126.995 us 0.32% 127.608 us 0.43% 0.613 us 0.48% SLOW
I32 U32 2^24 1 1.278 ms 0.23% 1.277 ms 0.23% -0.734 us -0.06% SAME
I32 U32 2^28 1 23.784 ms 0.50% 23.771 ms 0.50% -13.230 us -0.06% SAME
I32 U32 2^16 0.201 51.571 us 1.60% 51.835 us 1.66% 0.264 us 0.51% SAME
I32 U32 2^20 0.201 123.752 us 0.36% 124.536 us 0.42% 0.783 us 0.63% SLOW
I32 U32 2^24 0.201 1.265 ms 0.20% 1.263 ms 0.21% -2.211 us -0.17% SAME
I32 U32 2^28 0.201 22.967 ms 0.02% 22.960 ms 0.02% -6.992 us -0.03% FAST
I32 I64 2^16 1 52.580 us 0.59% 52.453 us 0.61% -0.128 us -0.24% SAME
I32 I64 2^20 1 130.837 us 0.51% 129.731 us 0.38% -1.106 us -0.85% FAST
I32 I64 2^24 1 1.281 ms 0.23% 1.281 ms 0.24% 0.425 us 0.03% SAME
I32 I64 2^28 1 23.801 ms 0.50% 23.792 ms 0.50% -8.630 us -0.04% SAME
I32 I64 2^16 0.201 52.714 us 1.69% 52.662 us 1.18% -0.052 us -0.10% SAME
I32 I64 2^20 0.201 128.234 us 0.36% 127.459 us 0.29% -0.775 us -0.60% FAST
I32 I64 2^24 0.201 1.267 ms 0.21% 1.267 ms 0.20% -0.283 us -0.02% SAME
I32 I64 2^28 0.201 22.964 ms 0.02% 22.965 ms 0.02% 0.817 us 0.00% SAME
I32 U64 2^16 1 52.284 us 0.64% 52.160 us 0.64% -0.124 us -0.24% SAME
I32 U64 2^20 1 129.375 us 0.45% 128.636 us 0.38% -0.740 us -0.57% FAST
I32 U64 2^24 1 1.280 ms 0.22% 1.280 ms 0.21% -0.053 us -0.00% SAME
I32 U64 2^28 1 23.809 ms 0.50% 23.802 ms 0.50% -7.001 us -0.03% SAME
I32 U64 2^16 0.201 52.437 us 1.29% 52.420 us 1.35% -0.017 us -0.03% SAME
I32 U64 2^20 0.201 127.115 us 0.44% 126.071 us 0.32% -1.044 us -0.82% FAST
I32 U64 2^24 0.201 1.267 ms 0.21% 1.266 ms 0.19% -0.901 us -0.07% SAME
I32 U64 2^28 0.201 22.964 ms 0.02% 22.964 ms 0.02% 0.301 us 0.00% SAME
I64 I32 2^16 1 60.495 us 0.43% 60.007 us 0.50% -0.488 us -0.81% FAST
I64 I32 2^20 1 185.400 us 0.31% 185.140 us 0.29% -0.260 us -0.14% SAME
I64 I32 2^24 1 2.764 ms 0.14% 2.762 ms 0.13% -2.301 us -0.08% SAME
I64 I32 2^28 1 53.141 ms 0.50% 53.095 ms 0.50% -46.386 us -0.09% SAME
I64 I32 2^16 0.201 61.127 us 0.57% 60.656 us 0.53% -0.471 us -0.77% FAST
I64 I32 2^20 0.201 195.889 us 0.33% 194.767 us 0.35% -1.122 us -0.57% FAST
I64 I32 2^24 0.201 2.778 ms 0.11% 2.776 ms 0.11% -1.581 us -0.06% SAME
I64 I32 2^28 0.201 52.697 ms 0.01% 52.694 ms 0.02% -2.473 us -0.00% SAME
I64 U32 2^16 1 59.634 us 0.48% 60.215 us 0.52% 0.581 us 0.97% SLOW
I64 U32 2^20 1 184.376 us 0.33% 184.530 us 0.30% 0.154 us 0.08% SAME
I64 U32 2^24 1 2.763 ms 0.13% 2.762 ms 0.14% -0.775 us -0.03% SAME
I64 U32 2^28 1 52.817 ms 0.36% 52.842 ms 0.44% 25.201 us 0.05% SAME
I64 U32 2^16 0.201 60.195 us 0.63% 60.716 us 0.60% 0.521 us 0.86% SLOW
I64 U32 2^20 0.201 194.821 us 0.38% 194.750 us 0.34% -0.071 us -0.04% SAME
I64 U32 2^24 0.201 2.777 ms 0.11% 2.776 ms 0.12% -0.673 us -0.02% SAME
I64 U32 2^28 0.201 52.698 ms 0.01% 52.688 ms 0.01% -10.225 us -0.02% FAST
I64 I64 2^16 1 61.323 us 0.43% 61.280 us 0.46% -0.042 us -0.07% SAME
I64 I64 2^20 1 186.691 us 0.34% 186.530 us 0.34% -0.161 us -0.09% SAME
I64 I64 2^24 1 2.767 ms 0.13% 2.766 ms 0.14% -1.289 us -0.05% SAME
I64 I64 2^28 1 52.829 ms 0.44% 52.810 ms 0.36% -19.020 us -0.04% SAME
I64 I64 2^16 0.201 61.953 us 0.68% 61.625 us 0.66% -0.328 us -0.53% SAME
I64 I64 2^20 0.201 197.460 us 0.31% 197.068 us 0.37% -0.392 us -0.20% SAME
I64 I64 2^24 0.201 2.780 ms 0.11% 2.777 ms 0.12% -2.925 us -0.11% SAME
I64 I64 2^28 0.201 52.706 ms 0.02% 52.663 ms 0.01% -42.700 us -0.08% FAST
I64 U64 2^16 1 60.934 us 0.45% 60.641 us 0.45% -0.294 us -0.48% FAST
I64 U64 2^20 1 186.067 us 0.41% 185.806 us 0.45% -0.262 us -0.14% SAME
I64 U64 2^24 1 2.767 ms 0.14% 2.765 ms 0.13% -1.605 us -0.06% SAME
I64 U64 2^28 1 52.759 ms 0.18% 52.775 ms 0.29% 16.576 us 0.03% SAME
I64 U64 2^16 0.201 61.636 us 0.56% 61.390 us 0.70% -0.247 us -0.40% SAME
I64 U64 2^20 0.201 196.865 us 0.37% 196.585 us 0.36% -0.280 us -0.14% SAME
I64 U64 2^24 0.201 2.780 ms 0.12% 2.776 ms 0.12% -3.732 us -0.13% FAST
I64 U64 2^28 0.201 52.704 ms 0.01% 52.664 ms 0.01% -39.657 us -0.08% FAST
I128 I32 2^16 1 72.446 us 0.57% 71.836 us 0.42% -0.611 us -0.84% FAST
I128 I32 2^20 1 335.160 us 1.34% 332.492 us 1.36% -2.668 us -0.80% SAME
I128 I32 2^24 1 5.734 ms 0.07% 5.728 ms 0.05% -6.101 us -0.11% FAST
I128 I32 2^28 1 110.950 ms 0.01% 110.918 ms 0.02% -32.141 us -0.03% FAST
I128 I32 2^16 0.201 72.556 us 0.47% 72.046 us 0.54% -0.510 us -0.70% FAST
I128 I32 2^20 0.201 325.343 us 0.87% 323.338 us 0.95% -2.006 us -0.62% SAME
I128 I32 2^24 0.201 5.715 ms 0.06% 5.706 ms 0.06% -8.360 us -0.15% FAST
I128 I32 2^28 0.201 110.225 ms 0.02% 110.129 ms 0.01% -95.265 us -0.09% FAST
I128 U32 2^16 1 71.821 us 0.42% 71.876 us 0.42% 0.055 us 0.08% SAME
I128 U32 2^20 1 336.999 us 1.39% 334.955 us 1.43% -2.044 us -0.61% SAME
I128 U32 2^24 1 5.732 ms 0.07% 5.729 ms 0.06% -3.320 us -0.06% FAST
I128 U32 2^28 1 110.933 ms 0.02% 110.900 ms 0.02% -33.301 us -0.03% FAST
I128 U32 2^16 0.201 71.973 us 0.47% 72.125 us 0.47% 0.152 us 0.21% SAME
I128 U32 2^20 0.201 327.640 us 1.05% 326.253 us 1.08% -1.386 us -0.42% SAME
I128 U32 2^24 0.201 5.713 ms 0.06% 5.706 ms 0.06% -6.616 us -0.12% FAST
I128 U32 2^28 0.201 110.193 ms 0.02% 110.121 ms 0.02% -72.510 us -0.07% FAST
I128 I64 2^16 1 72.978 us 0.45% 72.808 us 0.39% -0.170 us -0.23% SAME
I128 I64 2^20 1 337.441 us 1.36% 336.383 us 1.39% -1.058 us -0.31% SAME
I128 I64 2^24 1 5.731 ms 0.07% 5.727 ms 0.05% -4.749 us -0.08% FAST
I128 I64 2^28 1 110.901 ms 0.01% 110.884 ms 0.03% -16.693 us -0.02% FAST
I128 I64 2^16 0.201 72.930 us 0.40% 72.888 us 0.39% -0.042 us -0.06% SAME
I128 I64 2^20 0.201 328.445 us 1.10% 327.929 us 1.12% -0.516 us -0.16% SAME
I128 I64 2^24 0.201 5.713 ms 0.06% 5.705 ms 0.06% -8.065 us -0.14% FAST
I128 I64 2^28 0.201 110.188 ms 0.01% 110.099 ms 0.02% -88.670 us -0.08% FAST
I128 U64 2^16 1 72.663 us 0.44% 72.545 us 0.46% -0.118 us -0.16% SAME
I128 U64 2^20 1 336.947 us 1.45% 335.660 us 1.41% -1.287 us -0.38% SAME
I128 U64 2^24 1 5.732 ms 0.07% 5.727 ms 0.06% -4.177 us -0.07% FAST
I128 U64 2^28 1 110.915 ms 0.02% 110.876 ms 0.02% -38.708 us -0.03% FAST
I128 U64 2^16 0.201 72.656 us 0.53% 72.620 us 0.48% -0.036 us -0.05% SAME
I128 U64 2^20 0.201 327.896 us 1.12% 327.699 us 1.09% -0.197 us -0.06% SAME
I128 U64 2^24 0.201 5.712 ms 0.06% 5.705 ms 0.06% -7.235 us -0.13% FAST
I128 U64 2^28 0.201 110.176 ms 0.02% 110.090 ms 0.01% -85.872 us -0.08% FAST
F32 I32 2^16 1 52.345 us 0.67% 52.118 us 0.65% -0.227 us -0.43% SAME
F32 I32 2^20 1 129.412 us 0.43% 128.640 us 0.36% -0.771 us -0.60% FAST
F32 I32 2^24 1 1.280 ms 0.24% 1.280 ms 0.23% -0.878 us -0.07% SAME
F32 I32 2^28 1 23.604 ms 0.50% 23.604 ms 0.50% -0.176 us -0.00% SAME
F32 I32 2^16 0.201 52.372 us 0.75% 52.443 us 1.15% 0.071 us 0.13% SAME
F32 I32 2^20 0.201 126.933 us 0.48% 126.520 us 0.44% -0.414 us -0.33% SAME
F32 I32 2^24 0.201 1.268 ms 0.20% 1.268 ms 0.23% 0.434 us 0.03% SAME
F32 I32 2^28 0.201 23.001 ms 0.02% 23.000 ms 0.02% -0.999 us -0.00% SAME
F32 U32 2^16 1 51.780 us 0.62% 52.034 us 0.61% 0.254 us 0.49% SAME
F32 U32 2^20 1 128.815 us 0.47% 128.124 us 0.42% -0.692 us -0.54% FAST
F32 U32 2^24 1 1.280 ms 0.23% 1.279 ms 0.24% -1.866 us -0.15% SAME
F32 U32 2^28 1 23.608 ms 0.50% 23.608 ms 0.50% 0.136 us 0.00% SAME
F32 U32 2^16 0.201 51.788 us 0.72% 51.985 us 0.95% 0.198 us 0.38% SAME
F32 U32 2^20 0.201 125.591 us 0.39% 124.759 us 0.40% -0.832 us -0.66% FAST
F32 U32 2^24 0.201 1.267 ms 0.22% 1.265 ms 0.21% -2.122 us -0.17% SAME
F32 U32 2^28 0.201 22.997 ms 0.02% 22.995 ms 0.02% -2.512 us -0.01% SAME
F32 I64 2^16 1 52.887 us 0.63% 52.960 us 0.61% 0.073 us 0.14% SAME
F32 I64 2^20 1 131.997 us 0.51% 130.698 us 0.30% -1.298 us -0.98% FAST
F32 I64 2^24 1 1.283 ms 0.24% 1.283 ms 0.23% -0.214 us -0.02% SAME
F32 I64 2^28 1 23.614 ms 0.50% 23.627 ms 0.50% 12.826 us 0.05% SAME
F32 I64 2^16 0.201 52.899 us 0.96% 53.021 us 1.04% 0.122 us 0.23% SAME
F32 I64 2^20 0.201 129.387 us 0.37% 128.519 us 0.28% -0.868 us -0.67% FAST
F32 I64 2^24 0.201 1.270 ms 0.21% 1.270 ms 0.21% -0.374 us -0.03% SAME
F32 I64 2^28 0.201 23.000 ms 0.01% 23.003 ms 0.02% 2.527 us 0.01% SAME
F32 U64 2^16 1 52.604 us 0.58% 52.682 us 0.61% 0.079 us 0.15% SAME
F32 U64 2^20 1 129.902 us 0.43% 130.013 us 0.40% 0.111 us 0.09% SAME
F32 U64 2^24 1 1.282 ms 0.22% 1.282 ms 0.24% 0.197 us 0.02% SAME
F32 U64 2^28 1 23.631 ms 0.50% 23.640 ms 0.50% 9.257 us 0.04% SAME
F32 U64 2^16 0.201 52.633 us 0.91% 52.770 us 0.90% 0.137 us 0.26% SAME
F32 U64 2^20 0.201 128.060 us 0.57% 128.016 us 0.49% -0.044 us -0.03% SAME
F32 U64 2^24 0.201 1.269 ms 0.21% 1.269 ms 0.21% -0.131 us -0.01% SAME
F32 U64 2^28 0.201 22.999 ms 0.02% 23.001 ms 0.01% 1.865 us 0.01% SAME
F64 I32 2^16 1 60.024 us 0.55% 59.565 us 0.58% -0.459 us -0.76% FAST
F64 I32 2^20 1 184.603 us 0.25% 183.754 us 0.28% -0.849 us -0.46% FAST
F64 I32 2^24 1 2.762 ms 0.12% 2.761 ms 0.14% -1.359 us -0.05% SAME
F64 I32 2^28 1 52.873 ms 0.48% 52.882 ms 0.50% 8.943 us 0.02% SAME
F64 I32 2^16 0.201 60.552 us 0.75% 60.091 us 0.88% -0.461 us -0.76% FAST
F64 I32 2^20 0.201 192.480 us 0.35% 191.912 us 0.36% -0.568 us -0.29% SAME
F64 I32 2^24 0.201 2.767 ms 0.13% 2.766 ms 0.12% -1.145 us -0.04% SAME
F64 I32 2^28 0.201 52.606 ms 0.01% 52.602 ms 0.01% -3.604 us -0.01% SAME
F64 U32 2^16 1 59.004 us 0.46% 59.245 us 0.54% 0.241 us 0.41% SAME
F64 U32 2^20 1 184.120 us 0.30% 184.523 us 0.34% 0.403 us 0.22% SAME
F64 U32 2^24 1 2.762 ms 0.14% 2.761 ms 0.13% -0.609 us -0.02% SAME
F64 U32 2^28 1 52.742 ms 0.18% 53.076 ms 0.50% 334.739 us 0.63% SLOW
F64 U32 2^16 0.201 59.574 us 0.60% 59.871 us 0.66% 0.298 us 0.50% SAME
F64 U32 2^20 0.201 191.985 us 0.36% 192.162 us 0.41% 0.177 us 0.09% SAME
F64 U32 2^24 0.201 2.766 ms 0.12% 2.766 ms 0.12% -0.801 us -0.03% SAME
F64 U32 2^28 0.201 52.613 ms 0.01% 52.603 ms 0.02% -9.463 us -0.02% FAST
F64 I64 2^16 1 60.671 us 0.62% 60.774 us 0.48% 0.104 us 0.17% SAME
F64 I64 2^20 1 187.345 us 0.30% 186.353 us 0.28% -0.992 us -0.53% FAST
F64 I64 2^24 1 2.764 ms 0.13% 2.764 ms 0.12% -0.332 us -0.01% SAME
F64 I64 2^28 1 52.863 ms 0.48% 52.848 ms 0.42% -14.382 us -0.03% SAME
F64 I64 2^16 0.201 61.110 us 0.73% 61.197 us 0.79% 0.086 us 0.14% SAME
F64 I64 2^20 0.201 194.839 us 0.34% 194.243 us 0.38% -0.596 us -0.31% SAME
F64 I64 2^24 0.201 2.768 ms 0.13% 2.768 ms 0.12% 0.174 us 0.01% SAME
F64 I64 2^28 0.201 52.586 ms 0.01% 52.582 ms 0.01% -3.797 us -0.01% SAME
F64 U64 2^16 1 60.073 us 0.43% 60.206 us 0.46% 0.133 us 0.22% SAME
F64 U64 2^20 1 186.214 us 0.38% 185.539 us 0.41% -0.675 us -0.36% SAME
F64 U64 2^24 1 2.763 ms 0.13% 2.764 ms 0.12% 1.205 us 0.04% SAME
F64 U64 2^28 1 52.852 ms 0.47% 52.843 ms 0.39% -9.376 us -0.02% SAME
F64 U64 2^16 0.201 60.541 us 0.80% 60.662 us 0.73% 0.120 us 0.20% SAME
F64 U64 2^20 0.201 194.178 us 0.38% 193.416 us 0.40% -0.762 us -0.39% FAST
F64 U64 2^24 0.201 2.767 ms 0.13% 2.767 ms 0.12% 0.291 us 0.01% SAME
F64 U64 2^28 0.201 52.581 ms 0.01% 52.579 ms 0.01% -2.507 us -0.00% SAME
C64 I32 2^16 1 200.329 us 0.43% 201.152 us 0.45% 0.823 us 0.41% SAME
C64 I32 2^20 1 472.413 us 0.36% 471.564 us 0.32% -0.849 us -0.18% SAME
C64 I32 2^24 1 6.350 ms 0.14% 6.354 ms 0.13% 3.842 us 0.06% SAME
C64 I32 2^28 1 128.020 ms 0.52% 127.579 ms 0.44% -441.791 us -0.35% SAME
C64 I32 2^16 0.201 317.884 us 0.48% 318.706 us 0.49% 0.822 us 0.26% SAME
C64 I32 2^20 0.201 793.874 us 0.66% 793.814 us 0.71% -0.060 us -0.01% SAME
C64 I32 2^24 0.201 12.760 ms 0.26% 12.574 ms 0.24% -185.835 us -1.46% FAST
C64 I32 2^28 0.201 215.159 ms 0.06% 212.199 ms 0.06% -2960.179 us -1.38% FAST
C64 U32 2^16 1 200.423 us 0.43% 200.626 us 0.45% 0.203 us 0.10% SAME
C64 U32 2^20 1 471.598 us 0.32% 471.826 us 0.31% 0.228 us 0.05% SAME
C64 U32 2^24 1 6.366 ms 0.15% 6.364 ms 0.11% -2.155 us -0.03% SAME
C64 U32 2^28 1 128.758 ms 0.54% 127.890 ms 0.49% -867.447 us -0.67% FAST
C64 U32 2^16 0.201 317.681 us 0.51% 317.670 us 0.50% -0.011 us -0.00% SAME
C64 U32 2^20 0.201 795.349 us 0.74% 796.398 us 0.68% 1.049 us 0.13% SAME
C64 U32 2^24 0.201 12.813 ms 0.25% 12.806 ms 0.25% -6.934 us -0.05% SAME
C64 U32 2^28 0.201 216.113 ms 0.04% 216.083 ms 0.05% -30.040 us -0.01% SAME
C64 I64 2^16 1 201.727 us 0.47% 201.636 us 0.45% -0.091 us -0.04% SAME
C64 I64 2^20 1 475.004 us 0.33% 476.260 us 0.35% 1.257 us 0.26% SAME
C64 I64 2^24 1 6.343 ms 0.12% 6.357 ms 0.16% 13.675 us 0.22% SLOW
C64 I64 2^28 1 128.902 ms 0.57% 127.369 ms 0.39% -1532.931 us -1.19% FAST
C64 I64 2^16 0.201 318.934 us 0.54% 320.012 us 0.44% 1.077 us 0.34% SAME
C64 I64 2^20 0.201 798.287 us 0.71% 801.935 us 0.73% 3.647 us 0.46% SAME
C64 I64 2^24 0.201 12.782 ms 0.19% 12.838 ms 0.22% 56.532 us 0.44% SLOW
C64 I64 2^28 0.201 215.607 ms 0.04% 216.589 ms 0.04% 981.892 us 0.46% SLOW
C64 U64 2^16 1 201.418 us 0.50% 202.143 us 0.51% 0.724 us 0.36% SAME
C64 U64 2^20 1 475.012 us 0.30% 476.798 us 0.30% 1.787 us 0.38% SLOW
C64 U64 2^24 1 6.383 ms 0.15% 6.393 ms 0.18% 10.415 us 0.16% SLOW
C64 U64 2^28 1 129.860 ms 0.51% 127.940 ms 0.35% -1919.541 us -1.48% FAST
C64 U64 2^16 0.201 318.325 us 0.54% 319.237 us 0.52% 0.912 us 0.29% SAME
C64 U64 2^20 0.201 797.370 us 0.73% 799.869 us 0.75% 2.499 us 0.31% SAME
C64 U64 2^24 0.201 12.821 ms 0.26% 12.830 ms 0.20% 8.816 us 0.07% SAME
C64 U64 2^28 0.201 216.320 ms 0.05% 216.655 ms 0.06% 335.448 us 0.16% SLOW

Copy link
Contributor

🟩 CI finished in 1h 50m: Pass: 100%/78 | Total: 2d 04h | Avg: 40m 19s | Max: 1h 11m | Hits: 261%/12760
  • 🟩 cub: Pass: 100%/38 | Total: 1d 07h | Avg: 49m 55s | Max: 1h 11m | Hits: 371%/3540

    🟩 cpu
      🟩 amd64              Pass: 100%/36  | Total:  1d 05h | Avg: 49m 11s | Max:  1h 11m | Hits: 371%/3540  
      🟩 arm64              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  4h 53m | Avg: 58m 45s | Max:  1h 06m | Hits: 371%/885   
      🟩 12.5               Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
      🟩 12.6               Pass: 100%/31  | Total:  1d 00h | Avg: 47m 19s | Max:  1h 09m | Hits: 371%/2655  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 53m | Avg: 56m 37s | Max: 56m 59s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 53m | Avg: 58m 45s | Max:  1h 06m | Hits: 371%/885   
      🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
      🟩 nvcc12.6           Pass: 100%/29  | Total: 22h 33m | Avg: 46m 40s | Max:  1h 09m | Hits: 371%/2655  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 37s | Max: 56m 59s
      🟩 nvcc               Pass: 100%/36  | Total:  1d 05h | Avg: 49m 32s | Max:  1h 11m | Hits: 371%/3540  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  3h 41m | Avg: 55m 25s | Max: 58m 13s
      🟩 Clang15            Pass: 100%/1   | Total: 55m 56s | Avg: 55m 56s | Max: 55m 56s
      🟩 Clang16            Pass: 100%/1   | Total: 54m 52s | Avg: 54m 52s | Max: 54m 52s
      🟩 Clang17            Pass: 100%/1   | Total: 53m 30s | Avg: 53m 30s | Max: 53m 30s
      🟩 Clang18            Pass: 100%/7   | Total:  5h 40m | Avg: 48m 36s | Max:  1h 02m
      🟩 GCC7               Pass: 100%/2   | Total:  1h 48m | Avg: 54m 29s | Max: 57m 38s
      🟩 GCC8               Pass: 100%/1   | Total: 52m 02s | Avg: 52m 02s | Max: 52m 02s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 48m | Avg: 54m 08s | Max: 55m 38s
      🟩 GCC10              Pass: 100%/1   | Total: 53m 48s | Avg: 53m 48s | Max: 53m 48s
      🟩 GCC11              Pass: 100%/1   | Total: 53m 05s | Avg: 53m 05s | Max: 53m 05s
      🟩 GCC12              Pass: 100%/3   | Total:  1h 43m | Avg: 34m 35s | Max: 58m 00s
      🟩 GCC13              Pass: 100%/8   | Total:  4h 44m | Avg: 35m 33s | Max:  1h 03m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m | Hits: 371%/1770  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 09m | Hits: 371%/1770  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total: 12h 06m | Avg: 51m 52s | Max:  1h 02m
      🟩 GCC                Pass: 100%/18  | Total: 12h 44m | Avg: 42m 27s | Max:  1h 03m
      🟩 MSVC               Pass: 100%/4   | Total:  4h 30m | Avg:  1h 07m | Max:  1h 09m | Hits: 371%/3540  
      🟩 NVHPC              Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
    🟩 gpu
      🟩 h100               Pass: 100%/2   | Total: 45m 46s | Avg: 22m 53s | Max: 26m 03s
      🟩 v100               Pass: 100%/36  | Total:  1d 06h | Avg: 51m 25s | Max:  1h 11m | Hits: 371%/3540  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total:  1d 04h | Avg: 55m 52s | Max:  1h 11m | Hits: 371%/3540  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 31s | Avg: 22m 31s | Max: 22m 31s
      🟩 GraphCapture       Pass: 100%/1   | Total: 19m 07s | Avg: 19m 07s | Max: 19m 07s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 06m | Avg: 22m 04s | Max: 25m 36s
      🟩 TestGPU            Pass: 100%/2   | Total: 56m 49s | Avg: 28m 24s | Max: 30m 51s
    🟩 sm
      🟩 90                 Pass: 100%/2   | Total: 45m 46s | Avg: 22m 53s | Max: 26m 03s
      🟩 90a                Pass: 100%/1   | Total: 23m 37s | Avg: 23m 37s | Max: 23m 37s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total: 13h 38m | Avg: 58m 29s | Max:  1h 11m | Hits: 371%/2655  
      🟩 20                 Pass: 100%/24  | Total: 17h 58m | Avg: 44m 55s | Max:  1h 08m | Hits: 370%/885   
    
  • 🟩 thrust: Pass: 100%/37 | Total: 20h 13m | Avg: 32m 48s | Max: 1h 08m | Hits: 219%/9220

    🟩 cmake_options
      🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 36m 24s | Avg: 18m 12s | Max: 25m 09s
    🟩 cpu
      🟩 amd64              Pass: 100%/35  | Total: 19h 16m | Avg: 33m 01s | Max:  1h 08m | Hits: 219%/9220  
      🟩 arm64              Pass: 100%/2   | Total: 57m 35s | Avg: 28m 47s | Max: 30m 29s
    🟩 ctk
      🟩 12.0               Pass: 100%/5   | Total:  3h 03m | Avg: 36m 41s | Max: 56m 00s | Hits: 182%/1844  
      🟩 12.5               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
      🟩 12.6               Pass: 100%/30  | Total: 15h 22m | Avg: 30m 44s | Max:  1h 08m | Hits: 228%/7376  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 52m 23s | Avg: 26m 11s | Max: 27m 07s
      🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 03m | Avg: 36m 41s | Max: 56m 00s | Hits: 182%/1844  
      🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
      🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 30m | Avg: 31m 04s | Max:  1h 08m | Hits: 228%/7376  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 52m 23s | Avg: 26m 11s | Max: 27m 07s
      🟩 nvcc               Pass: 100%/35  | Total: 19h 21m | Avg: 33m 10s | Max:  1h 08m | Hits: 219%/9220  
    🟩 cxx
      🟩 Clang14            Pass: 100%/4   | Total:  2h 04m | Avg: 31m 00s | Max: 32m 36s
      🟩 Clang15            Pass: 100%/1   | Total: 33m 36s | Avg: 33m 36s | Max: 33m 36s
      🟩 Clang16            Pass: 100%/1   | Total: 33m 51s | Avg: 33m 51s | Max: 33m 51s
      🟩 Clang17            Pass: 100%/1   | Total: 32m 35s | Avg: 32m 35s | Max: 32m 35s
      🟩 Clang18            Pass: 100%/7   | Total:  2h 47m | Avg: 23m 54s | Max: 33m 40s
      🟩 GCC7               Pass: 100%/2   | Total:  1h 02m | Avg: 31m 20s | Max: 32m 19s
      🟩 GCC8               Pass: 100%/1   | Total: 31m 01s | Avg: 31m 01s | Max: 31m 01s
      🟩 GCC9               Pass: 100%/2   | Total:  1h 04m | Avg: 32m 03s | Max: 32m 50s
      🟩 GCC10              Pass: 100%/1   | Total: 30m 32s | Avg: 30m 32s | Max: 30m 32s
      🟩 GCC11              Pass: 100%/1   | Total: 34m 23s | Avg: 34m 23s | Max: 34m 23s
      🟩 GCC12              Pass: 100%/1   | Total: 34m 51s | Avg: 34m 51s | Max: 34m 51s
      🟩 GCC13              Pass: 100%/8   | Total:  2h 54m | Avg: 21m 47s | Max: 36m 10s
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 42s | Max:  1h 01m | Hits: 182%/3688  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 45m | Avg: 55m 04s | Max:  1h 08m | Hits: 243%/5532  
      🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/14  | Total:  6h 31m | Avg: 27m 57s | Max: 33m 51s
      🟩 GCC                Pass: 100%/16  | Total:  7h 11m | Avg: 26m 59s | Max: 36m 10s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 42m | Avg: 56m 31s | Max:  1h 08m | Hits: 219%/9220  
      🟩 NVHPC              Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
    🟩 gpu
      🟩 v100               Pass: 100%/37  | Total: 20h 13m | Avg: 32m 48s | Max:  1h 08m | Hits: 219%/9220  
    🟩 jobs
      🟩 Build              Pass: 100%/31  | Total: 18h 41m | Avg: 36m 10s | Max:  1h 08m | Hits: 182%/7376  
      🟩 TestCPU            Pass: 100%/3   | Total: 53m 33s | Avg: 17m 51s | Max: 37m 37s | Hits: 365%/1844  
      🟩 TestGPU            Pass: 100%/3   | Total: 38m 27s | Avg: 12m 49s | Max: 16m 24s
    🟩 sm
      🟩 90a                Pass: 100%/1   | Total: 18m 21s | Avg: 18m 21s | Max: 18m 21s
    🟩 std
      🟩 17                 Pass: 100%/14  | Total:  9h 00m | Avg: 38m 36s | Max:  1h 01m | Hits: 182%/5532  
      🟩 20                 Pass: 100%/21  | Total: 10h 36m | Avg: 30m 19s | Max:  1h 08m | Hits: 274%/3688  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 22s | Avg: 4m 41s | Max: 7m 14s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 08s | Avg:  2m 08s | Max:  2m 08s
      🟩 Test               Pass: 100%/1   | Total:  7m 14s | Avg:  7m 14s | Max:  7m 14s
    
  • 🟩 python: Pass: 100%/1 | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 78)

# Runner
53 linux-amd64-cpu16
11 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16
4 linux-arm64-cpu16
1 linux-amd64-gpu-h100-latest-1-testing

@miscco miscco merged commit cac3738 into NVIDIA:main Jan 18, 2025
92 of 95 checks passed
@miscco miscco deleted the drop_cub_min_max branch January 18, 2025 08:46
Copy link
Contributor

Backport failed for branch/2.8.x, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin branch/2.8.x
git worktree add -d .worktree/backport-3419-to-branch/2.8.x origin/branch/2.8.x
cd .worktree/backport-3419-to-branch/2.8.x
git checkout -b backport-3419-to-branch/2.8.x
ancref=$(git merge-base 3267f42592a4b048aaf5d1bc8756859854e4a93f a0b0250fec1bfe3af1ccb9575fba53a1e8687445)
git cherry-pick -x $ancref..a0b0250fec1bfe3af1ccb9575fba53a1e8687445

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025
… libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404
davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025
… libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404
miscco added a commit to miscco/cccl that referenced this pull request Jan 19, 2025
… libcu++ (NVIDIA#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes NVIDIA#3404
davebayer added a commit to davebayer/cccl that referenced this pull request Jan 20, 2025
implement `add_sat`

split `signed`/`unsigned` implementation, improve implementation for MSVC

improve device `add_sat` implementation

add `add_sat` test

improve generic `add_sat` implementation for signed types

implement `sub_sat`

allow more msvc intrinsics on x86

add op tests

partially implement `mul_sat`

implement `div_sat` and `saturate_cast`

add `saturate_cast` test

simplify `div_sat` test

Deprectate C++11 and C++14 for libcu++ (#3173)

* Deprectate C++11 and C++14 for libcu++

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Implement `abs` and `div` from `cstdlib` (#3153)

* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral

Fix missing radix sort policies (#3174)

Fixes NVBug 5009941

Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)

* introduces new arg{min,max} interface with two output iterators

* adds fp inf tests

* fixes docs

* improves code example

* fixes exec space specifier

* trying to fix deprecation warning for more compilers

* inlines unzip operator

* trying to fix deprecation warning for nvhpc

* integrates supression fixes in diagnostics

* pre-ctk 11.5 deprecation suppression

* fixes icc

* fix for pre-ctk11.5

* cleans up deprecation suppression

* cleanup

Extend tuning documentation (#3179)

Add codespell pre-commit hook, fix typos in CCCL (#3168)

* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.

Fix parameter space for TUNE_LOAD in scan benchmark (#3176)

fix various old compiler checks (#3178)

implement C++26 `std::projected` (#3175)

Fix pre-commit config for codespell and remaining typos (#3182)

Massive cleanup of our config (#3155)

Fix UB in atomics with automatic storage (#2586)

* Adds specialized local cuda atomics and injects them into most atomics paths.

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>

* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

* Remove extraneous double brackets in unformatted code.

* Merge unsafe atomic logic into `__cuda_is_local`.

* Use `const_cast` for type conversions in cuda_local.h

* Fix build issues from interface changes

* Fix missing __nanosleep on sm70-

* Guard __isLocal from NVHPC

* Use PTX instead of running nothing from NVHPC

* fixup /s/nvrtc/nvhpc

* Fixup missing CUDA ifdef surrounding device code

* Fix codegen

* Bypass some sort of compiler bug on GCC7

* Apply suggestions from code review

* Use unsafe automatic storage atomics in codegen tests

---------

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>

Refactor the source code layout for `cuda.parallel` (#3177)

* Refactor the source layout for cuda.parallel

* Add copyright

* Address review feedback

* Don't import anything into `experimental` namespace

* fix import

---------

Co-authored-by: Ashwin Srinath <[email protected]>

new type-erased memory resources (#2824)

s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)

Document address stability of `thrust::transform` (#3181)

* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses

Fixes: #3053

turn off cuda version check for clangd (#3194)

[STF] jacobi example based on parallel_for (#3187)

* Simple jacobi example with parallel for and reductions

* clang-format

* remove useless capture list

fixes pre-nv_diag suppression issues (#3189)

Prefer c2h::type_name over c2h::demangle (#3195)

Fix memcpy_async* tests (#3197)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

Add type annotations and mypy checks for `cuda.parallel`  (#3180)

* Refactor the source layout for cuda.parallel

* Add initial type annotations

* Update pre-commit config

* More typing

* Fix bad merge

* Fix TYPE_CHECKING and numpy annotations

* typing bindings.py correctly

* Address review feedback

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Fix rendering of cuda.parallel docs (#3192)

* Fix pre-commit config for codespell and remaining typos

* Fix rendering of docs for cuda.parallel

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Enable PDL for DeviceMergeSortBlockSortKernel (#3199)

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)

* adds benchmarks for reduce::arg{min,max}

* preliminary streaming arg-extremum reduction

* fixes implicit conversion

* uses streaming dispatch class

* changes arg benches to use new streaming reduce

* streaming arg-extrema reduction

* fixes style

* fixes compilation failures

* cleanups

* adds rst style comments

* declare vars const and use clamp

* consolidates argmin argmax benchmarks

* fixes thrust usage

* drops offset type in arg-extrema benchmarks

* fixes clang cuda

* exec space macros

* switch to signed global offset type for slightly better perf

* clarifies documentation

* applies minor benchmark style changes from review comments

* fixes interface documentation and comments

* list-init accumulating output op

* improves style, comments, and tests

* cleans up aggregate init

* renames dispatch class usage in benchmarks

* fixes merge conflicts

* addresses review comments

* addresses review comments

* fixes assertion

* removes superseded implementation

* changes large problem tests to use new interface

* removes obsolete tests for deprecated interface

Fixes for Python 3.7 docs environment (#3206)

Co-authored-by: Ashwin Srinath <[email protected]>

Adds support for large number of items to `DeviceTransform` (#3172)

* moves large problem test helper to common file

* adds support for large num items to device transform

* adds tests for large number of items to device interface

* fixes format

* addresses review comments

cp_async_bulk: Fix test (#3198)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

* cp_async_bulk: Fix test

The global memory pointer could be misaligned.

cudax fixes for msvc 14.41 (#3200)

avoid instantiating class templates in `is_same` implementation when possible (#3203)

Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)

* Fix: make launchers a CUB detail; make kernel source functions hidden.

* [pre-commit.ci] auto code formatting

* Address review comments, fix which macro gets fixed.

help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)

unify macros and cmake options that control the suppression of deprecation warnings (#3220)

* unify macros and cmake options that control the suppression of deprecation warnings

* suppress nvcc warning #186 in thrust header tests

* suppress c++ dialect deprecation warnings in libcudacxx header tests

Fx thread-reduce performance regression (#3225)

cuda.parallel: In-memory caching of build objects (#3216)

* Define __eq__ and __hash__ for Iterators

* Define cache_with_key utility and use it to cache Reduce objects

* Add tests for caching Reduce objects

* Tighten up types

* Updates to support 3.7

* Address review feedback

* Introduce IteratorKind to hold iterator type information

* Use the .kind to generate an abi_name

* Remove __eq__ and __hash__ methods from IteratorBase

* Move helper function

* Formatting

* Don't unpack tuple in cache key

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Just enough ranges for c++14 `span` (#3211)

use generalized concepts portability macros to simplify the `range` concept (#3217)

fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`

Use Ruff to sort imports (#3230)

* Update pyproject.tomls for import sorting

* Update files after running pre-commit

* Move ruff config to pyproject.toml

---------

Co-authored-by: Ashwin Srinath <[email protected]>

fix tuning_scan sm90 config issue (#3236)

Co-authored-by: Shijie Chen <[email protected]>

[STF] Logical token (#3196)

* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.

* Add missing files

* Check if a task implementation can match a prototype where the void_interface arguments are ignored

* Implement ctx.abstract_logical_data() which relies on a void data interface

* Illustrate how to use abstract handles in local contexts

* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages

* Small improvements in the examples

* Do not try to allocate or move void data

* Do not use I as a variable

* fix linkage error

* rename abtract_logical_data into logical_token

* Document logical token

* fix spelling error

* fix sphinx error

* reflect name changes

* use meaningful variable names

* simplify logical_token implementation because writeback is already disabled

* add a unit test for token elision

* implement token elision in host_launch

* Remove unused type

* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens

* Much simpler is_tuple_invocable_with_filtered implementation

* Fix buggy test

* Factorize code

* Document that we can ignore tokens for task and host_launch

* Documentation for logical data freeze

Fix ReduceByKey tuning (#3240)

Fix RLE tuning (#3239)

cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)

* Forbid non-contiguous arrays as inputs (or outputs)

* Implement a more robust way to check for contiguity

* Don't bother if cublas unavailable

* Fix how we check for zero-element arrays

* sort imports

---------

Co-authored-by: Ashwin Srinath <[email protected]>

expands support for more offset types in segmented benchmark (#3231)

Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)

* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects

* Do not add option twice

ptx: Add add_instruction.py (#3190)

This file helps create the necessary structure for new PTX instructions.

Co-authored-by: Allard Hendriksen <[email protected]>

Bump main to 2.9.0. (#3247)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop cub::Mutex (#3251)

Fixes: #3250

Remove legacy macros from CUB util_arch.cuh (#3257)

Fixes: #3256

Remove thrust::[unary|binary]_traits (#3260)

Fixes: #3259

Architecture and OS identification macros (#3237)

Bump main to 3.0.0. (#3265)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop thrust not1 and not2 (#3264)

Fixes: #3263

CCCL Internal macro documentation (#3238)

Deprecate GridBarrier and GridBarrierLifetime (#3258)

Fixes: #1389

Require at least gcc7 (#3268)

Fixes: #3267

Drop thrust::[unary|binary]_function (#3274)

Fixes: #3273

Drop ICC from CI (#3277)

[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)

* Add a test to reproduce a bug observed with parallel_for on a host place

* clang-format

* use _CCCL_ASSERT

* Attempt to debug

* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead

* fix lambda expression

* clang-format

Enable thrust::identity test for non-MSVC (#3281)

This seems to be an oversight when the test was added

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Enable PDL in triple chevron launch (#3282)

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.

Disambiguate line continuations and macro continuations in <nv/target> (#3244)

Drop VS 2017 from CI (#3287)

Fixes: #3286

Drop ICC support in code (#3279)

* Drop ICC from code

Fixes: #3278

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Make CUB NVRTC commandline arguments come from a cmake template (#3292)

Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)

Use process isolation instead of default hyper-v for Windows. (#3294)

Try improving build times by using process isolation instead of hyper-v

Co-authored-by: Michael Schellenberger Costa <[email protected]>

[pre-commit.ci] pre-commit autoupdate (#3248)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop Thrust legacy arch macros (#3298)

Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS

Drop Thrust's compiler_fence.h (#3300)

Drop CTK 11.x from CI (#3275)

* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers

Fixes: #3249

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop CUB's util_compiler.cuh (#3302)

All contained macros were deprecated

Update packman and repo_docs versions (#3293)

Co-authored-by: Ashwin Srinath <[email protected]>

Drop Thrust's deprecated compiler macros (#3301)

Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)

Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)

* adds support for large number of items to three-way partition

* adapts interface to use choose_signed_offset_t

* integrates applicable feedback from device-select pr

* changes behavior for empty problems

* unifies grid constant macro

* fixes kernel template specialization mismatch

* integrates _CCCL_GRID_CONSTANT changes

* resolve merge conflicts

* fixes checks in test

* fixes test verification

* improves tests

* makes few improvements to streaming dispatch

* improves code comment on test

* fixes unrelated compiler error

* minor style improvements

Refactor scan tunings (#3262)

Require C++17 for compiling Thrust and CUB (#3255)

* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal

Fixes: #3252

Implement `views::empty` (#3254)

* Disable pair conversion of subrange with clang in C++17

* Fix namespace views

* Implement `views::empty`

This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view

Refactor `limits` and `climits` (#3221)

* implement builtins for huge val, nan and nans

* change `INFINITY` and `NAN` implementation for NVRTC

cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)

* Add tests demonstrating usage of different iterators

* Update documentation of reduce_into by merging import code snippet with the rest of the example

* Add documentation for current iterators

* Run pre-commit checks and update accordingly

* Fix comments to refer to the proper lines in the code snippets in the docs

Drop clang<14 from CI, update devcontainers. (#3309)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

[STF] Cleanup task dependencies object constructors (#3291)

* Define tag types for access modes

* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data

* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums

* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes

Disable test with a gcc-14 regression (#3297)

Deprecate Thrust's cpp_compatibility.h macros (#3299)

Remove dropped function objects from docs (#3319)

Document `NV_TARGET` macros (#3313)

[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)

* Define ctx.pick_stream() which was missing for the unified context

* clang-format

Deprecate cub::IterateThreadStore (#3337)

Drop CUB's BinaryFlip operator (#3332)

Deprecate cub::Swap (#3333)

Clarify transform output can overlap input (#3323)

Drop CUB APIs with a debug_synchronous parameter (#3330)

Fixes: #3329

Drop CUB's util_compiler.cuh for real (#3340)

PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.

Drop cub::ValueCache (#3346)

limits offset types for merge sort (#3328)

Drop CDPv1 (#3344)

Fixes: #3341

Drop thrust::void_t (#3362)

Use cuda::std::addressof in Thrust (#3363)

Fix all_of documentation for empty ranges (#3358)

all_of always returns true on an empty range.

[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)

* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.

* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code

* When not storing the dangling events, we must still perform the deinit operations that were producing these events !

Extract scan kernels into NVRTC-compilable header (#3334)

* Extract scan kernels into NVRTC-compilable header

* Update cub/cub/device/dispatch/dispatch_scan.cuh

Co-authored-by: Georgii Evtushenko <[email protected]>

---------

Co-authored-by: Ashwin Srinath <[email protected]>
Co-authored-by: Georgii Evtushenko <[email protected]>

Drop deprecated aliases in Thrust functional (#3272)

Fixes: #3271

Drop cub::DivideAndRoundUp (#3347)

Use cuda::std::min/max in Thrust (#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (#2773)

Deprecate thrust::null_type (#3367)

Deprecate cub::DeviceSpmv (#3320)

Fixes: #896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (#3385)

cuda.parallel: Support structured types as algorithm inputs (#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (#3324)

Fixes: #100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (#3366)

Replace `typedef` with `using` in libcu++ (#3368)

Deprecate thrust::optional (#3307)

Fixes: #3306

Upgrade to Catch2 3.8  (#3310)

Fixes: #1724

refactor `<cuda/std/cstdint>` (#3325)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Update CODEOWNERS (#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (#3408)

Implement more cmath functions to be usable on host and device (#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Fix assert definition for NVHPC due to constexpr issues (#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes #3411

Extend CUB reduce benchmarks (#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: #3283

Update upload-pages-artifact to v3 (#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Replace and deprecate thrust::cuda_cub::terminate (#3421)

`std::linalg` accessors and `transposed_layout` (#2962)

Add round up/down to multiple (#3234)

[FEA]: Introduce Python module with CCCL headers (#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.

Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.

* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes #3404

move to c++17, finalize device optimization

fix msvc compilation, update tests

Deprectate C++11 and C++14 for libcu++ (#3173)

* Deprectate C++11 and C++14 for libcu++

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Implement `abs` and `div` from `cstdlib` (#3153)

* implement integer abs functions
* improve tests, fix constexpr support
* just use the our implementation
* implement `cuda::std::div`
* prefer host's `div_t` like types
* provide `cuda::std::abs` overloads for floats
* allow fp abs for NVRTC
* silence msvc's warning about conversion from floating point to integral

Fix missing radix sort policies (#3174)

Fixes NVBug 5009941

Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148)

* introduces new arg{min,max} interface with two output iterators

* adds fp inf tests

* fixes docs

* improves code example

* fixes exec space specifier

* trying to fix deprecation warning for more compilers

* inlines unzip operator

* trying to fix deprecation warning for nvhpc

* integrates supression fixes in diagnostics

* pre-ctk 11.5 deprecation suppression

* fixes icc

* fix for pre-ctk11.5

* cleans up deprecation suppression

* cleanup

Extend tuning documentation (#3179)

Add codespell pre-commit hook, fix typos in CCCL (#3168)

* Add codespell pre-commit hook
* Automatic changes from codespell.
* Manual changes.

Fix parameter space for TUNE_LOAD in scan benchmark (#3176)

fix various old compiler checks (#3178)

implement C++26 `std::projected` (#3175)

Fix pre-commit config for codespell and remaining typos (#3182)

Massive cleanup of our config (#3155)

Fix UB in atomics with automatic storage (#2586)

* Adds specialized local cuda atomics and injects them into most atomics paths.

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>

* Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478

* Remove extraneous double brackets in unformatted code.

* Merge unsafe atomic logic into `__cuda_is_local`.

* Use `const_cast` for type conversions in cuda_local.h

* Fix build issues from interface changes

* Fix missing __nanosleep on sm70-

* Guard __isLocal from NVHPC

* Use PTX instead of running nothing from NVHPC

* fixup /s/nvrtc/nvhpc

* Fixup missing CUDA ifdef surrounding device code

* Fix codegen

* Bypass some sort of compiler bug on GCC7

* Apply suggestions from code review

* Use unsafe automatic storage atomics in codegen tests

---------

Co-authored-by: Georgy Evtushenko <[email protected]>
Co-authored-by: gonzalobg <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>

Refactor the source code layout for `cuda.parallel` (#3177)

* Refactor the source layout for cuda.parallel

* Add copyright

* Address review feedback

* Don't import anything into `experimental` namespace

* fix import

---------

Co-authored-by: Ashwin Srinath <[email protected]>

new type-erased memory resources (#2824)

s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186)

Document address stability of `thrust::transform` (#3181)

* Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS
* Reformat and fix UnaryFunction/BinaryFunction in transform docs
* Mention transform can use proclaim_copyable_arguments
* Document cuda::proclaims_copyable_arguments better
* Deprecate depending on transform functor argument addresses

Fixes: #3053

turn off cuda version check for clangd (#3194)

[STF] jacobi example based on parallel_for (#3187)

* Simple jacobi example with parallel for and reductions

* clang-format

* remove useless capture list

fixes pre-nv_diag suppression issues (#3189)

Prefer c2h::type_name over c2h::demangle (#3195)

Fix memcpy_async* tests (#3197)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

Add type annotations and mypy checks for `cuda.parallel`  (#3180)

* Refactor the source layout for cuda.parallel

* Add initial type annotations

* Update pre-commit config

* More typing

* Fix bad merge

* Fix TYPE_CHECKING and numpy annotations

* typing bindings.py correctly

* Address review feedback

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Fix rendering of cuda.parallel docs (#3192)

* Fix pre-commit config for codespell and remaining typos

* Fix rendering of docs for cuda.parallel

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Enable PDL for DeviceMergeSortBlockSortKernel (#3199)

The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC.
This commit enables PDL when launching the kernel.

Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647)

* adds benchmarks for reduce::arg{min,max}

* preliminary streaming arg-extremum reduction

* fixes implicit conversion

* uses streaming dispatch class

* changes arg benches to use new streaming reduce

* streaming arg-extrema reduction

* fixes style

* fixes compilation failures

* cleanups

* adds rst style comments

* declare vars const and use clamp

* consolidates argmin argmax benchmarks

* fixes thrust usage

* drops offset type in arg-extrema benchmarks

* fixes clang cuda

* exec space macros

* switch to signed global offset type for slightly better perf

* clarifies documentation

* applies minor benchmark style changes from review comments

* fixes interface documentation and comments

* list-init accumulating output op

* improves style, comments, and tests

* cleans up aggregate init

* renames dispatch class usage in benchmarks

* fixes merge conflicts

* addresses review comments

* addresses review comments

* fixes assertion

* removes superseded implementation

* changes large problem tests to use new interface

* removes obsolete tests for deprecated interface

Fixes for Python 3.7 docs environment (#3206)

Co-authored-by: Ashwin Srinath <[email protected]>

Adds support for large number of items to `DeviceTransform` (#3172)

* moves large problem test helper to common file

* adds support for large num items to device transform

* adds tests for large number of items to device interface

* fixes format

* addresses review comments

cp_async_bulk: Fix test (#3198)

* memcpy_async_tx: Fix bug in test

Two bugs, one of which occurs in practice:

1. There is a missing fence.proxy.space::global between the writes to
   global memory and the memcpy_async_tx. (Occurs in practice)

2. The end of the kernel should be fenced with `__syncthreads()`,
   because the barrier is invalidated in the destructor. If other
   threads are still waiting on it, there will be UB. (Has not yet
   manifested itself)

* cp_async_bulk_tensor: Pre-emptively fence more in test

* cp_async_bulk: Fix test

The global memory pointer could be misaligned.

cudax fixes for msvc 14.41 (#3200)

avoid instantiating class templates in `is_same` implementation when possible (#3203)

Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209)

* Fix: make launchers a CUB detail; make kernel source functions hidden.

* [pre-commit.ci] auto code formatting

* Address review comments, fix which macro gets fixed.

help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202)

unify macros and cmake options that control the suppression of deprecation warnings (#3220)

* unify macros and cmake options that control the suppression of deprecation warnings

* suppress nvcc warning #186 in thrust header tests

* suppress c++ dialect deprecation warnings in libcudacxx header tests

Fx thread-reduce performance regression (#3225)

cuda.parallel: In-memory caching of build objects (#3216)

* Define __eq__ and __hash__ for Iterators

* Define cache_with_key utility and use it to cache Reduce objects

* Add tests for caching Reduce objects

* Tighten up types

* Updates to support 3.7

* Address review feedback

* Introduce IteratorKind to hold iterator type information

* Use the .kind to generate an abi_name

* Remove __eq__ and __hash__ methods from IteratorBase

* Move helper function

* Formatting

* Don't unpack tuple in cache key

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Just enough ranges for c++14 `span` (#3211)

use generalized concepts portability macros to simplify the `range` concept (#3217)

fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR`

Use Ruff to sort imports (#3230)

* Update pyproject.tomls for import sorting

* Update files after running pre-commit

* Move ruff config to pyproject.toml

---------

Co-authored-by: Ashwin Srinath <[email protected]>

fix tuning_scan sm90 config issue (#3236)

Co-authored-by: Shijie Chen <[email protected]>

[STF] Logical token (#3196)

* Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs.

* Add missing files

* Check if a task implementation can match a prototype where the void_interface arguments are ignored

* Implement ctx.abstract_logical_data() which relies on a void data interface

* Illustrate how to use abstract handles in local contexts

* Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages

* Small improvements in the examples

* Do not try to allocate or move void data

* Do not use I as a variable

* fix linkage error

* rename abtract_logical_data into logical_token

* Document logical token

* fix spelling error

* fix sphinx error

* reflect name changes

* use meaningful variable names

* simplify logical_token implementation because writeback is already disabled

* add a unit test for token elision

* implement token elision in host_launch

* Remove unused type

* Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens

* Much simpler is_tuple_invocable_with_filtered implementation

* Fix buggy test

* Factorize code

* Document that we can ignore tokens for task and host_launch

* Documentation for logical data freeze

Fix ReduceByKey tuning (#3240)

Fix RLE tuning (#3239)

cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233)

* Forbid non-contiguous arrays as inputs (or outputs)

* Implement a more robust way to check for contiguity

* Don't bother if cublas unavailable

* Fix how we check for zero-element arrays

* sort imports

---------

Co-authored-by: Ashwin Srinath <[email protected]>

expands support for more offset types in segmented benchmark (#3231)

Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253)

* Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects

* Do not add option twice

ptx: Add add_instruction.py (#3190)

This file helps create the necessary structure for new PTX instructions.

Co-authored-by: Allard Hendriksen <[email protected]>

Bump main to 2.9.0. (#3247)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop cub::Mutex (#3251)

Fixes: #3250

Remove legacy macros from CUB util_arch.cuh (#3257)

Fixes: #3256

Remove thrust::[unary|binary]_traits (#3260)

Fixes: #3259

Architecture and OS identification macros (#3237)

Bump main to 3.0.0. (#3265)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

Drop thrust not1 and not2 (#3264)

Fixes: #3263

CCCL Internal macro documentation (#3238)

Deprecate GridBarrier and GridBarrierLifetime (#3258)

Fixes: #1389

Require at least gcc7 (#3268)

Fixes: #3267

Drop thrust::[unary|binary]_function (#3274)

Fixes: #3273

Drop ICC from CI (#3277)

[STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270)

* Add a test to reproduce a bug observed with parallel_for on a host place

* clang-format

* use _CCCL_ASSERT

* Attempt to debug

* do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead

* fix lambda expression

* clang-format

Enable thrust::identity test for non-MSVC (#3281)

This seems to be an oversight when the test was added

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Enable PDL in triple chevron launch (#3282)

It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed
to _CCCL_HAS_PDL during the review introducing the feature.

Disambiguate line continuations and macro continuations in <nv/target> (#3244)

Drop VS 2017 from CI (#3287)

Fixes: #3286

Drop ICC support in code (#3279)

* Drop ICC from code

Fixes: #3278

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Make CUB NVRTC commandline arguments come from a cmake template (#3292)

Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295)

Use process isolation instead of default hyper-v for Windows. (#3294)

Try improving build times by using process isolation instead of hyper-v

Co-authored-by: Michael Schellenberger Costa <[email protected]>

[pre-commit.ci] pre-commit autoupdate (#3248)

* [pre-commit.ci] pre-commit autoupdate

updates:
- [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6)
- [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6)
- [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1)

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Drop Thrust legacy arch macros (#3298)

Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS

Drop Thrust's compiler_fence.h (#3300)

Drop CTK 11.x from CI (#3275)

* Add cuda12.0-gcc7 devcontainer
* Move MSVC2017 jobs to CTK 12.6
Those is the only combination where rapidsai has devcontainers
* Add /Zc:__cplusplus for the libcudacxx tests
* Only add excape hatch for affected CTKs
* Workaround missing cudaLaunchKernelEx on MSVC
cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK.
* Workaround nvcc+MSVC issue
* Regenerate devcontainers

Fixes: #3249

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Update packman and repo_docs versions (#3293)

Co-authored-by: Ashwin Srinath <[email protected]>

Drop Thrust's deprecated compiler macros (#3301)

Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305)

Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506)

* adds support for large number of items to three-way partition

* adapts interface to use choose_signed_offset_t

* integrates applicable feedback from device-select pr

* changes behavior for empty problems

* unifies grid constant macro

* fixes kernel template specialization mismatch

* integrates _CCCL_GRID_CONSTANT changes

* resolve merge conflicts

* fixes checks in test

* fixes test verification

* improves tests

* makes few improvements to streaming dispatch

* improves code comment on test

* fixes unrelated compiler error

* minor style improvements

Refactor scan tunings (#3262)

Require C++17 for compiling Thrust and CUB (#3255)

* Issue an unsuppressable warning when compiling with < C++17
* Remove C++11/14 presets
* Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers
* Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14]
* Remove CUB_ENABLE_DIALECT_CPP[11|14]
* Update CI runs
* Remove C++11/14 CI runs for CUB and Thrust
* Raise compiler minimum versions for C++17
* Update ReadMe
* Drop Thrust's cpp14_required.h
* Add escape hatch for C++17 removal

Fixes: #3252

Implement `views::empty` (#3254)

* Disable pair conversion of subrange with clang in C++17

* Fix namespace views

* Implement `views::empty`

This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view

Refactor `limits` and `climits` (#3221)

* implement builtins for huge val, nan and nans

* change `INFINITY` and `NAN` implementation for NVRTC

cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311)

* Add tests demonstrating usage of different iterators

* Update documentation of reduce_into by merging import code snippet with the rest of the example

* Add documentation for current iterators

* Run pre-commit checks and update accordingly

* Fix comments to refer to the proper lines in the code snippets in the docs

Drop clang<14 from CI, update devcontainers. (#3309)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

[STF] Cleanup task dependencies object constructors (#3291)

* Define tag types for access modes

* - Rework how we build task_dep objects based on access mode tags
- pack_state is now responsible for using a const_cast for read only data

* Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums

* It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes

Disable test with a gcc-14 regression (#3297)

Deprecate Thrust's cpp_compatibility.h macros (#3299)

Remove dropped function objects from docs (#3319)

Document `NV_TARGET` macros (#3313)

[STF] Define ctx.pick_stream() which was missing for the unified context (#3326)

* Define ctx.pick_stream() which was missing for the unified context

* clang-format

Deprecate cub::IterateThreadStore (#3337)

Drop CUB's BinaryFlip operator (#3332)

Deprecate cub::Swap (#3333)

Clarify transform output can overlap input (#3323)

Drop CUB APIs with a debug_synchronous parameter (#3330)

Fixes: #3329

Drop CUB's util_compiler.cuh for real (#3340)

PR #3302 planned to drop the file, but only dropped its content. This
was an oversight. So let's drop the entire file.

Drop cub::ValueCache (#3346)

limits offset types for merge sort (#3328)

Drop CDPv1 (#3344)

Fixes: #3341

Drop thrust::void_t (#3362)

Use cuda::std::addressof in Thrust (#3363)

Fix all_of documentation for empty ranges (#3358)

all_of always returns true on an empty range.

[STF] Do not keep track of dangling events in a CUDA graph backend (#3327)

* Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when
the CUDA graph completes. Therefore keeping track of "dangling events" is a
waste of time and resources.

* replace can_ignore_dangling_events by track_dangling_events which leads to more readable code

* When not storing the dangling events, we must still perform the deinit operations that were producing these events !

Extract scan kernels into NVRTC-compilable header (#3334)

* Extract scan kernels into NVRTC-compilable header

* Update cub/cub/device/dispatch/dispatch_scan.cuh

Co-authored-by: Georgii Evtushenko <[email protected]>

---------

Co-authored-by: Ashwin Srinath <[email protected]>
Co-authored-by: Georgii Evtushenko <[email protected]>

Drop deprecated aliases in Thrust functional (#3272)

Fixes: #3271

Drop cub::DivideAndRoundUp (#3347)

Use cuda::std::min/max in Thrust (#3364)

Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361)

* implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16`

Cleanup util_arch (#2773)

Deprecate thrust::null_type (#3367)

Deprecate cub::DeviceSpmv (#3320)

Fixes: #896

Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* fixes spelling

* adds tests for large number of segments

* fixes narrowing conversion in tests

* addresses review comments

* fixes includes

Compile basic infra test with C++17 (#3377)

Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308)

* fixes segment offset generation

* switches to analytical verification

* switches to analytical verification for pairs

* addresses review comments

* introduces segment offset type

* adds tests for large number of segments

* adds support for large number of segments

* drops segment offset type

* fixes thrust namespace

* removes about-to-be-deprecated cub iterators

* no exec specifier on defaulted ctor

* fixes gcc7 linker error

* uses local_segment_index_t throughout

* determine offset type based on type returned by segment iterator begin/end iterators

* minor style improvements

Exit with error when RAPIDS CI fails. (#3385)

cuda.parallel: Support structured types as algorithm inputs (#3218)

* Introduce gpu_struct decorator and typing

* Enable `reduce` to accept arrays of structs as inputs

* Add test for reducing arrays-of-struct

* Update documentation

* Use a numpy array rather than ctypes object

* Change zeros -> empty for output array and temp storage

* Add a TODO for typing GpuStruct

* Documentation udpates

* Remove test_reduce_struct_type from test_reduce.py

* Revert to `to_cccl_value()` accepting ndarray + GpuStruct

* Bump copyrights

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Deprecate thrust::async (#3324)

Fixes: #100

Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342)

Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314)

* add compiler-specific path
* fix device code path
* add _CCC_ASSUME

Deprecate thrust::numeric_limits (#3366)

Replace `typedef` with `using` in libcu++ (#3368)

Deprecate thrust::optional (#3307)

Fixes: #3306

Upgrade to Catch2 3.8  (#3310)

Fixes: #1724

refactor `<cuda/std/cstdint>` (#3325)

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

Update CODEOWNERS (#3331)

* Update CODEOWNERS

* Update CODEOWNERS

* Update CODEOWNERS

* [pre-commit.ci] auto code formatting

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Fix sign-compare warning (#3408)

Implement more cmath functions to be usable on host and device (#3382)

* Implement more cmath functions to be usable on host and device

* Implement math roots functions

* Implement exponential functions

Redefine and deprecate thrust::remove_cvref (#3394)

* Redefine and deprecate thrust::remove_cvref

Co-authored-by: Michael Schellenberger Costa <[email protected]>

Fix assert definition for NVHPC due to constexpr issues (#3418)

NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it.

Fix this by always using the host definition which should also work on device.

Fixes #3411

Extend CUB reduce benchmarks (#3401)

* Rename max.cu to custom.cu, since it uses a custom operator
* Extend types covered my min.cu to all fundamental types
* Add some notes on how to collect tuning parameters

Fixes: #3283

Update upload-pages-artifact to v3 (#3423)

* Update upload-pages-artifact to v3

* Empty commit

---------

Co-authored-by: Ashwin Srinath <[email protected]>

Replace and deprecate thrust::cuda_cub::terminate (#3421)

`std::linalg` accessors and `transposed_layout` (#2962)

Add round up/down to multiple (#3234)

[FEA]: Introduce Python module with CCCL headers (#3201)

* Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative

* Run `copy_cccl_headers_to_aude_include()` before `setup()`

* Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path.

* Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel

* Bug fix: cuda/_include only exists after shutil.copytree() ran.

* Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py

* Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions)

* Replace := operator (needs Python 3.8+)

* Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md

* Restore original README.md: `pip3 install -e` now works on first pass.

* cuda_cccl/README.md: FOR INTERNAL USE ONLY

* Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917)

Command used: ci/update_version.sh 2 8 0

* Modernize pyproject.toml, setup.py

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996

* Install CCCL headers under cuda.cccl.include

Trigger for this change:

* https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562

Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely.

* Factor out cuda_cccl/cuda/cccl/include_paths.py

* Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative

* Add missing Copyright notice.

* Add missing __init__.py (cuda.cccl)

* Add `"cuda.cccl"` to `autodoc.mock_imports`

* Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.)

* Add # TODO: move this to a module-level import

* Modernize cuda_cooperative/pyproject.toml, setup.py

* Convert cuda_cooperative to use hatchling as build backend.

* Revert "Convert cuda_cooperative to use hatchling as build backend."

This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe.

* Move numpy from [build-system] requires -> [project] dependencies

* Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH

* Remove copy_license() and use license_files=["../../LICENSE"] instead.

* Further modernize cuda_cccl/setup.py to use pathlib

* Trivial simplifications in cuda_cccl/pyproject.toml

* Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code

* Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml

* Add taplo-pre-commit to .pre-commit-config.yaml

* taplo-pre-commit auto-fixes

* Use pathlib in cuda_cooperative/setup.py

* CCCL_PYTHON_PATH in cuda_cooperative/setup.py

* Modernize cuda_parallel/pyproject.toml, setup.py

* Use pathlib in cuda_parallel/setup.py

* Add `# TOML lint & format` comment.

* Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml

* Use pathlib in cuda/cccl/include_paths.py

* pre-commit autoupdate (EXCEPT clang-format, which was manually restored)

* Fixes after git merge main

* Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result'

```
=========================================================================== warnings summary ===========================================================================
tests/test_reduce.py::test_reduce_non_contiguous
  /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080>

  Traceback (most recent call last):
    File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__
      bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result))
                                                       ^^^^^^^^^^^^^^^^^
  AttributeError: '_Reduce' object has no attribute 'build_result'

    warnings.warn(pytest.PytestUnraisableExceptionWarning(msg))

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ==============================================================
```

* Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy`

* Introduce cuda_cooperative/constraints.txt

* Also add cuda_parallel/constraints.txt

* Add `--constraint constraints.txt` in ci/test_python.sh

* Update Copyright dates

* Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024)

For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI.

* Remove unused cuda_parallel jinja2 dependency (noticed by chance).

* Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead.

* Make cuda_cooperative, cuda_parallel testing completely independent.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]"

This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2.

Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971

* Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Restore original ci/matrix.yaml [skip-rapids]

* Use for loop in test_python.sh to avoid code duplication.

* Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]

* Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc]

* Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]"

This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d.

* Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460)

* Address feedback by @leofang

---------

Co-authored-by: Bernhard Manfred Gruber <[email protected]>

cuda.parallel: Add optional stream argument to reduce_into() (#3348)

* Add optional stream argument to reduce_into()

* Add tests to check for reduce_into() stream behavior

* Move protocol related utils to separate file and rework __cuda_stream__ error messages

* Fix synchronization issue in stream test and add one more invalid stream test case

* Rename cuda stream validation function after removing leading underscore

* Unpack values from __cuda_stream__ instead of indexing

* Fix linting errors

* Handle TypeError when unpacking invalid __cuda_stream__ return

* Use stream to allocate cupy memory in new stream test

Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434)

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419)

* Deprecate `cub::{min, max}` and replace internal uses with those from libcu++

Fixes #3404

Fix CI issues (#3443)

update docs

fix review

restrict allowed types

replace constexpr implementations with generic

optimize `__is_arithmetic_integral`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport branch/2.8.x cub For all items related to CUB
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Deprecate and replace cub::min/max by cuda::std::min/max
4 participants