Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ #3419

miscco · 2025-01-16T09:42:26Z

… libcu++ Fixes NVIDIA#3404

github-actions · 2025-01-16T11:47:13Z

🟩 CI finished in 2h 03m: Pass: 100%/78 | Total: 2d 07h | Avg: 43m 01s | Max: 1h 14m | Hits: 104%/12760

🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 53m 35s | Max: 1h 11m | Hits: 36%/3540

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total:  1d 08h | Avg: 53m 22s | Max:  1h 11m | Hits:  36%/3540  
  🟩 arm64              Pass: 100%/2   | Total:  1h 55m | Avg: 57m 30s | Max: 58m 16s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  5h 01m | Avg:  1h 00m | Max:  1h 07m | Hits:  36%/885   
  🟩 12.5               Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
  🟩 12.6               Pass: 100%/31  | Total:  1d 02h | Avg: 51m 33s | Max:  1h 11m | Hits:  36%/2655  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m
  🟩 nvcc12.0           Pass: 100%/5   | Total:  5h 01m | Avg:  1h 00m | Max:  1h 07m | Hits:  36%/885   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
  🟩 nvcc12.6           Pass: 100%/29  | Total:  1d 00h | Avg: 50m 47s | Max:  1h 11m | Hits:  36%/2655  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  2h 05m | Avg:  1h 02m | Max:  1h 03m
  🟩 nvcc               Pass: 100%/36  | Total:  1d 07h | Avg: 53m 05s | Max:  1h 11m | Hits:  36%/3540  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  4h 00m | Avg:  1h 00m | Max:  1h 01m
  🟩 Clang15            Pass: 100%/1   | Total: 58m 18s | Avg: 58m 18s | Max: 58m 18s
  🟩 Clang16            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
  🟩 Clang17            Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
  🟩 Clang18            Pass: 100%/7   | Total:  5h 53m | Avg: 50m 33s | Max:  1h 03m
  🟩 GCC7               Pass: 100%/2   | Total:  1h 51m | Avg: 55m 31s | Max: 55m 36s
  🟩 GCC8               Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
  🟩 GCC9               Pass: 100%/2   | Total:  1h 58m | Avg: 59m 15s | Max:  1h 00m
  🟩 GCC10              Pass: 100%/1   | Total:  1h 01m | Avg:  1h 01m | Max:  1h 01m
  🟩 GCC11              Pass: 100%/1   | Total: 58m 15s | Avg: 58m 15s | Max: 58m 15s
  🟩 GCC12              Pass: 100%/3   | Total:  2h 11m | Avg: 43m 47s | Max:  1h 04m
  🟩 GCC13              Pass: 100%/8   | Total:  5h 10m | Avg: 38m 47s | Max:  1h 00m
  🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 18m | Avg:  1h 09m | Max:  1h 11m | Hits:  36%/1770  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 13m | Avg:  1h 06m | Max:  1h 07m | Hits:  36%/1770  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total: 12h 55m | Avg: 55m 24s | Max:  1h 03m
  🟩 GCC                Pass: 100%/18  | Total: 14h 12m | Avg: 47m 21s | Max:  1h 04m
  🟩 MSVC               Pass: 100%/4   | Total:  4h 31m | Avg:  1h 07m | Max:  1h 11m | Hits:  36%/3540  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 28s | Max: 41m 35s
  🟩 v100               Pass: 100%/36  | Total:  1d 08h | Avg: 54m 42s | Max:  1h 11m | Hits:  36%/3540  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  1d 06h | Avg: 59m 18s | Max:  1h 11m | Hits:  36%/3540  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 31m 54s | Avg: 31m 54s | Max: 31m 54s
  🟩 GraphCapture       Pass: 100%/1   | Total: 14m 45s | Avg: 14m 45s | Max: 14m 45s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 22m | Avg: 27m 23s | Max: 41m 35s
  🟩 TestGPU            Pass: 100%/2   | Total:  1h 09m | Avg: 34m 40s | Max: 39m 48s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total:  1h 06m | Avg: 33m 28s | Max: 41m 35s
  🟩 90a                Pass: 100%/1   | Total: 24m 21s | Avg: 24m 21s | Max: 24m 21s
🟩 std
  🟩 17                 Pass: 100%/14  | Total: 14h 30m | Avg:  1h 02m | Max:  1h 11m | Hits:  36%/2655  
  🟩 20                 Pass: 100%/24  | Total: 19h 26m | Avg: 48m 36s | Max:  1h 07m | Hits:  36%/885

🟩 thrust: Pass: 100%/37 | Total: 21h 13m | Avg: 34m 25s | Max: 1h 14m | Hits: 130%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 37m 37s | Avg: 18m 48s | Max: 25m 51s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total: 20h 12m | Avg: 34m 39s | Max:  1h 14m | Hits: 130%/9220  
  🟩 arm64              Pass: 100%/2   | Total:  1h 00m | Avg: 30m 27s | Max: 31m 50s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  3h 05m | Avg: 37m 04s | Max:  1h 02m | Hits:  58%/1844  
  🟩 12.5               Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
  🟩 12.6               Pass: 100%/30  | Total: 15h 43m | Avg: 31m 27s | Max:  1h 13m | Hits: 148%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 53m 13s | Avg: 26m 36s | Max: 27m 51s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 05m | Avg: 37m 04s | Max:  1h 02m | Hits:  58%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
  🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 50m | Avg: 31m 48s | Max:  1h 13m | Hits: 148%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 53m 13s | Avg: 26m 36s | Max: 27m 51s
  🟩 nvcc               Pass: 100%/35  | Total: 20h 20m | Avg: 34m 52s | Max:  1h 14m | Hits: 130%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  1h 58m | Avg: 29m 43s | Max: 31m 03s
  🟩 Clang15            Pass: 100%/1   | Total: 31m 27s | Avg: 31m 27s | Max: 31m 27s
  🟩 Clang16            Pass: 100%/1   | Total: 30m 18s | Avg: 30m 18s | Max: 30m 18s
  🟩 Clang17            Pass: 100%/1   | Total: 32m 47s | Avg: 32m 47s | Max: 32m 47s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 47m | Avg: 23m 53s | Max: 32m 44s
  🟩 GCC7               Pass: 100%/2   | Total:  1h 00m | Avg: 30m 23s | Max: 30m 57s
  🟩 GCC8               Pass: 100%/1   | Total: 32m 19s | Avg: 32m 19s | Max: 32m 19s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 06m | Avg: 33m 09s | Max: 34m 58s
  🟩 GCC10              Pass: 100%/1   | Total: 32m 10s | Avg: 32m 10s | Max: 32m 10s
  🟩 GCC11              Pass: 100%/1   | Total: 36m 14s | Avg: 36m 14s | Max: 36m 14s
  🟩 GCC12              Pass: 100%/1   | Total: 33m 58s | Avg: 33m 58s | Max: 33m 58s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 58m | Avg: 22m 16s | Max: 37m 16s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 07m | Avg:  1h 03m | Max:  1h 04m | Hits:  82%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  3h 00m | Avg:  1h 00m | Max:  1h 13m | Hits: 162%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  6h 20m | Avg: 27m 11s | Max: 32m 47s
  🟩 GCC                Pass: 100%/16  | Total:  7h 19m | Avg: 27m 29s | Max: 37m 16s
  🟩 MSVC               Pass: 100%/5   | Total:  5h 08m | Avg:  1h 01m | Max:  1h 13m | Hits: 130%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 24m | Avg:  1h 12m | Max:  1h 14m
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total: 21h 13m | Avg: 34m 25s | Max:  1h 14m | Hits: 130%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 19h 45m | Avg: 38m 14s | Max:  1h 14m | Hits:  71%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 50m 01s | Avg: 16m 40s | Max: 34m 17s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total: 38m 38s | Avg: 12m 52s | Max: 14m 20s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 17m 49s | Avg: 17m 49s | Max: 17m 49s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  9h 43m | Avg: 41m 42s | Max:  1h 14m | Hits:  75%/5532  
  🟩 20                 Pass: 100%/21  | Total: 10h 52m | Avg: 31m 04s | Max:  1h 13m | Hits: 212%/3688

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 51s | Avg: 5m 55s | Max: 9m 18s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total: 11m 51s | Avg:  5m 55s | Max:  9m 18s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 33s | Avg:  2m 33s | Max:  2m 33s
  🟩 Test               Pass: 100%/1   | Total:  9m 18s | Avg:  9m 18s | Max:  9m 18s

🟩 python: Pass: 100%/1 | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

cub/benchmarks/bench/radix_sort/keys.cu

bernhardmgruber · 2025-01-16T13:16:00Z

pre-commit.ci autofix

copy-pr-bot · 2025-01-16T13:16:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

bernhardmgruber · 2025-01-16T13:18:37Z

/ok to test

elstehle · 2025-01-16T14:35:07Z

I did a sample check for sass changes in our merge_sort benchmarks and noticed a difference in the generated device code. I will run the benchmarks to verify that we do not see a performance regression.

elstehle · 2025-01-16T14:57:21Z

Performance for merge_sort.keys looks fine:

benchmark results merge_sort.keys on H100

T{ct}	OffsetT{ct}	Elements{io}	Entropy	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I8	I32	2^16	1	49.463 us	0.52%	49.389 us	0.54%	-0.074 us	-0.15%	SAME
I8	I32	2^20	1	121.824 us	0.34%	123.032 us	0.36%	1.208 us	0.99%	SLOW
I8	I32	2^24	1	841.153 us	0.21%	828.317 us	0.27%	-12.836 us	-1.53%	FAST
I8	I32	2^28	1	14.761 ms	0.03%	14.635 ms	0.06%	-126.264 us	-0.86%	FAST
I8	I32	2^16	0.201	49.157 us	0.54%	49.198 us	1.17%	0.041 us	0.08%	SAME
I8	I32	2^20	0.201	119.114 us	0.27%	120.865 us	0.40%	1.751 us	1.47%	SLOW
I8	I32	2^24	0.201	791.923 us	0.15%	780.104 us	0.25%	-11.819 us	-1.49%	FAST
I8	I32	2^28	0.201	13.724 ms	0.04%	13.624 ms	0.06%	-100.034 us	-0.73%	FAST
I8	U32	2^16	1	51.402 us	0.66%	51.218 us	0.52%	-0.185 us	-0.36%	SAME
I8	U32	2^20	1	124.559 us	0.37%	125.102 us	0.36%	0.543 us	0.44%	SLOW
I8	U32	2^24	1	847.858 us	0.19%	834.785 us	0.26%	-13.073 us	-1.54%	FAST
I8	U32	2^28	1	14.752 ms	0.03%	14.649 ms	0.05%	-102.680 us	-0.70%	FAST
I8	U32	2^16	0.201	50.447 us	0.54%	50.398 us	0.55%	-0.048 us	-0.10%	SAME
I8	U32	2^20	0.201	120.788 us	0.29%	122.083 us	0.31%	1.295 us	1.07%	SLOW
I8	U32	2^24	0.201	794.923 us	0.12%	782.146 us	0.21%	-12.777 us	-1.61%	FAST
I8	U32	2^28	0.201	13.737 ms	0.02%	13.655 ms	0.10%	-82.490 us	-0.60%	FAST
I8	I64	2^16	1	52.552 us	0.64%	52.256 us	0.49%	-0.296 us	-0.56%	FAST
I8	I64	2^20	1	127.572 us	0.36%	127.600 us	0.30%	0.027 us	0.02%	SAME
I8	I64	2^24	1	854.442 us	0.20%	841.724 us	0.24%	-12.719 us	-1.49%	FAST
I8	I64	2^28	1	14.929 ms	0.02%	14.831 ms	0.10%	-97.301 us	-0.65%	FAST
I8	I64	2^16	0.201	51.386 us	0.60%	51.269 us	0.59%	-0.117 us	-0.23%	SAME
I8	I64	2^20	0.201	124.622 us	0.24%	124.501 us	0.29%	-0.121 us	-0.10%	SAME
I8	I64	2^24	0.201	804.059 us	0.13%	792.843 us	0.21%	-11.216 us	-1.39%	FAST
I8	I64	2^28	0.201	13.926 ms	0.03%	13.830 ms	0.06%	-95.251 us	-0.68%	FAST
I8	U64	2^16	1	52.276 us	0.69%	51.994 us	0.49%	-0.281 us	-0.54%	FAST
I8	U64	2^20	1	127.315 us	0.36%	126.928 us	0.33%	-0.387 us	-0.30%	SAME
I8	U64	2^24	1	853.614 us	0.21%	841.241 us	0.23%	-12.374 us	-1.45%	FAST
I8	U64	2^28	1	14.932 ms	0.03%	14.831 ms	0.06%	-101.315 us	-0.68%	FAST
I8	U64	2^16	0.201	51.197 us	0.50%	51.036 us	0.51%	-0.161 us	-0.31%	SAME
I8	U64	2^20	0.201	124.022 us	0.27%	123.637 us	0.30%	-0.386 us	-0.31%	FAST
I8	U64	2^24	0.201	803.008 us	0.13%	798.753 us	0.19%	-4.255 us	-0.53%	FAST
I8	U64	2^28	0.201	13.927 ms	0.03%	13.832 ms	0.05%	-94.224 us	-0.68%	FAST
I16	I32	2^16	1	54.006 us	0.50%	53.878 us	0.60%	-0.128 us	-0.24%	SAME
I16	I32	2^20	1	132.508 us	0.31%	132.298 us	0.38%	-0.210 us	-0.16%	SAME
I16	I32	2^24	1	970.299 us	0.11%	966.232 us	0.20%	-4.068 us	-0.42%	FAST
I16	I32	2^28	1	17.404 ms	1.47%	17.443 ms	1.42%	38.204 us	0.22%	SAME
I16	I32	2^16	0.201	53.999 us	0.59%	54.077 us	0.67%	0.078 us	0.14%	SAME
I16	I32	2^20	0.201	130.158 us	0.31%	130.709 us	0.31%	0.551 us	0.42%	SLOW
I16	I32	2^24	0.201	914.123 us	0.13%	914.716 us	0.13%	0.593 us	0.06%	SAME
I16	I32	2^28	0.201	15.533 ms	0.04%	15.545 ms	0.03%	12.455 us	0.08%	SLOW
I16	U32	2^16	1	53.990 us	0.55%	54.036 us	0.57%	0.046 us	0.09%	SAME
I16	U32	2^20	1	132.654 us	0.37%	132.973 us	0.42%	0.319 us	0.24%	SAME
I16	U32	2^24	1	972.231 us	0.11%	967.598 us	0.11%	-4.633 us	-0.48%	FAST
I16	U32	2^28	1	17.589 ms	1.32%	17.552 ms	1.37%	-37.448 us	-0.21%	SAME
I16	U32	2^16	0.201	53.487 us	0.81%	53.691 us	0.80%	0.203 us	0.38%	SAME
I16	U32	2^20	0.201	128.467 us	0.32%	129.950 us	0.44%	1.483 us	1.15%	SLOW
I16	U32	2^24	0.201	912.183 us	0.14%	911.460 us	0.14%	-0.722 us	-0.08%	SAME
I16	U32	2^28	0.201	15.515 ms	0.07%	15.525 ms	0.04%	9.427 us	0.06%	SLOW
I16	I64	2^16	1	54.699 us	0.51%	54.935 us	0.55%	0.236 us	0.43%	SAME
I16	I64	2^20	1	135.015 us	0.34%	135.977 us	0.42%	0.962 us	0.71%	SLOW
I16	I64	2^24	1	975.570 us	0.11%	977.688 us	0.10%	2.117 us	0.22%	SLOW
I16	I64	2^28	1	17.829 ms	1.29%	17.808 ms	1.28%	-20.584 us	-0.12%	SAME
I16	I64	2^16	0.201	54.468 us	0.85%	54.696 us	0.69%	0.229 us	0.42%	SAME
I16	I64	2^20	0.201	132.462 us	0.34%	133.119 us	0.34%	0.658 us	0.50%	SLOW
I16	I64	2^24	0.201	921.121 us	0.13%	922.413 us	0.13%	1.293 us	0.14%	SLOW
I16	I64	2^28	0.201	15.606 ms	0.03%	15.610 ms	0.04%	4.031 us	0.03%	SAME
I16	U64	2^16	1	54.520 us	0.53%	54.799 us	0.49%	0.279 us	0.51%	SLOW
I16	U64	2^20	1	134.027 us	0.34%	135.091 us	0.35%	1.063 us	0.79%	SLOW
I16	U64	2^24	1	977.635 us	0.11%	980.469 us	0.10%	2.834 us	0.29%	SLOW
I16	U64	2^28	1	17.893 ms	1.29%	17.852 ms	1.33%	-40.364 us	-0.23%	SAME
I16	U64	2^16	0.201	54.355 us	0.76%	54.564 us	1.06%	0.209 us	0.39%	SAME
I16	U64	2^20	0.201	131.227 us	0.30%	132.250 us	0.31%	1.023 us	0.78%	SLOW
I16	U64	2^24	0.201	921.965 us	0.12%	920.162 us	0.12%	-1.803 us	-0.20%	FAST
I16	U64	2^28	0.201	15.618 ms	0.06%	15.603 ms	0.03%	-15.155 us	-0.10%	FAST
I32	I32	2^16	1	51.814 us	0.61%	52.088 us	0.61%	0.274 us	0.53%	SAME
I32	I32	2^20	1	127.786 us	0.33%	128.857 us	0.45%	1.071 us	0.84%	SLOW
I32	I32	2^24	1	1.279 ms	0.23%	1.279 ms	0.22%	0.023 us	0.00%	SAME
I32	I32	2^28	1	23.779 ms	0.50%	23.762 ms	0.50%	-16.417 us	-0.07%	SAME
I32	I32	2^16	0.201	52.087 us	1.39%	52.200 us	1.12%	0.113 us	0.22%	SAME
I32	I32	2^20	0.201	125.105 us	0.32%	126.689 us	0.46%	1.583 us	1.27%	SLOW
I32	I32	2^24	0.201	1.265 ms	0.21%	1.266 ms	0.21%	0.588 us	0.05%	SAME
I32	I32	2^28	0.201	22.964 ms	0.02%	22.959 ms	0.02%	-4.834 us	-0.02%	SAME
I32	U32	2^16	1	51.310 us	0.71%	51.673 us	0.59%	0.363 us	0.71%	SLOW
I32	U32	2^20	1	126.995 us	0.32%	127.608 us	0.43%	0.613 us	0.48%	SLOW
I32	U32	2^24	1	1.278 ms	0.23%	1.277 ms	0.23%	-0.734 us	-0.06%	SAME
I32	U32	2^28	1	23.784 ms	0.50%	23.771 ms	0.50%	-13.230 us	-0.06%	SAME
I32	U32	2^16	0.201	51.571 us	1.60%	51.835 us	1.66%	0.264 us	0.51%	SAME
I32	U32	2^20	0.201	123.752 us	0.36%	124.536 us	0.42%	0.783 us	0.63%	SLOW
I32	U32	2^24	0.201	1.265 ms	0.20%	1.263 ms	0.21%	-2.211 us	-0.17%	SAME
I32	U32	2^28	0.201	22.967 ms	0.02%	22.960 ms	0.02%	-6.992 us	-0.03%	FAST
I32	I64	2^16	1	52.580 us	0.59%	52.453 us	0.61%	-0.128 us	-0.24%	SAME
I32	I64	2^20	1	130.837 us	0.51%	129.731 us	0.38%	-1.106 us	-0.85%	FAST
I32	I64	2^24	1	1.281 ms	0.23%	1.281 ms	0.24%	0.425 us	0.03%	SAME
I32	I64	2^28	1	23.801 ms	0.50%	23.792 ms	0.50%	-8.630 us	-0.04%	SAME
I32	I64	2^16	0.201	52.714 us	1.69%	52.662 us	1.18%	-0.052 us	-0.10%	SAME
I32	I64	2^20	0.201	128.234 us	0.36%	127.459 us	0.29%	-0.775 us	-0.60%	FAST
I32	I64	2^24	0.201	1.267 ms	0.21%	1.267 ms	0.20%	-0.283 us	-0.02%	SAME
I32	I64	2^28	0.201	22.964 ms	0.02%	22.965 ms	0.02%	0.817 us	0.00%	SAME
I32	U64	2^16	1	52.284 us	0.64%	52.160 us	0.64%	-0.124 us	-0.24%	SAME
I32	U64	2^20	1	129.375 us	0.45%	128.636 us	0.38%	-0.740 us	-0.57%	FAST
I32	U64	2^24	1	1.280 ms	0.22%	1.280 ms	0.21%	-0.053 us	-0.00%	SAME
I32	U64	2^28	1	23.809 ms	0.50%	23.802 ms	0.50%	-7.001 us	-0.03%	SAME
I32	U64	2^16	0.201	52.437 us	1.29%	52.420 us	1.35%	-0.017 us	-0.03%	SAME
I32	U64	2^20	0.201	127.115 us	0.44%	126.071 us	0.32%	-1.044 us	-0.82%	FAST
I32	U64	2^24	0.201	1.267 ms	0.21%	1.266 ms	0.19%	-0.901 us	-0.07%	SAME
I32	U64	2^28	0.201	22.964 ms	0.02%	22.964 ms	0.02%	0.301 us	0.00%	SAME
I64	I32	2^16	1	60.495 us	0.43%	60.007 us	0.50%	-0.488 us	-0.81%	FAST
I64	I32	2^20	1	185.400 us	0.31%	185.140 us	0.29%	-0.260 us	-0.14%	SAME
I64	I32	2^24	1	2.764 ms	0.14%	2.762 ms	0.13%	-2.301 us	-0.08%	SAME
I64	I32	2^28	1	53.141 ms	0.50%	53.095 ms	0.50%	-46.386 us	-0.09%	SAME
I64	I32	2^16	0.201	61.127 us	0.57%	60.656 us	0.53%	-0.471 us	-0.77%	FAST
I64	I32	2^20	0.201	195.889 us	0.33%	194.767 us	0.35%	-1.122 us	-0.57%	FAST
I64	I32	2^24	0.201	2.778 ms	0.11%	2.776 ms	0.11%	-1.581 us	-0.06%	SAME
I64	I32	2^28	0.201	52.697 ms	0.01%	52.694 ms	0.02%	-2.473 us	-0.00%	SAME
I64	U32	2^16	1	59.634 us	0.48%	60.215 us	0.52%	0.581 us	0.97%	SLOW
I64	U32	2^20	1	184.376 us	0.33%	184.530 us	0.30%	0.154 us	0.08%	SAME
I64	U32	2^24	1	2.763 ms	0.13%	2.762 ms	0.14%	-0.775 us	-0.03%	SAME
I64	U32	2^28	1	52.817 ms	0.36%	52.842 ms	0.44%	25.201 us	0.05%	SAME
I64	U32	2^16	0.201	60.195 us	0.63%	60.716 us	0.60%	0.521 us	0.86%	SLOW
I64	U32	2^20	0.201	194.821 us	0.38%	194.750 us	0.34%	-0.071 us	-0.04%	SAME
I64	U32	2^24	0.201	2.777 ms	0.11%	2.776 ms	0.12%	-0.673 us	-0.02%	SAME
I64	U32	2^28	0.201	52.698 ms	0.01%	52.688 ms	0.01%	-10.225 us	-0.02%	FAST
I64	I64	2^16	1	61.323 us	0.43%	61.280 us	0.46%	-0.042 us	-0.07%	SAME
I64	I64	2^20	1	186.691 us	0.34%	186.530 us	0.34%	-0.161 us	-0.09%	SAME
I64	I64	2^24	1	2.767 ms	0.13%	2.766 ms	0.14%	-1.289 us	-0.05%	SAME
I64	I64	2^28	1	52.829 ms	0.44%	52.810 ms	0.36%	-19.020 us	-0.04%	SAME
I64	I64	2^16	0.201	61.953 us	0.68%	61.625 us	0.66%	-0.328 us	-0.53%	SAME
I64	I64	2^20	0.201	197.460 us	0.31%	197.068 us	0.37%	-0.392 us	-0.20%	SAME
I64	I64	2^24	0.201	2.780 ms	0.11%	2.777 ms	0.12%	-2.925 us	-0.11%	SAME
I64	I64	2^28	0.201	52.706 ms	0.02%	52.663 ms	0.01%	-42.700 us	-0.08%	FAST
I64	U64	2^16	1	60.934 us	0.45%	60.641 us	0.45%	-0.294 us	-0.48%	FAST
I64	U64	2^20	1	186.067 us	0.41%	185.806 us	0.45%	-0.262 us	-0.14%	SAME
I64	U64	2^24	1	2.767 ms	0.14%	2.765 ms	0.13%	-1.605 us	-0.06%	SAME
I64	U64	2^28	1	52.759 ms	0.18%	52.775 ms	0.29%	16.576 us	0.03%	SAME
I64	U64	2^16	0.201	61.636 us	0.56%	61.390 us	0.70%	-0.247 us	-0.40%	SAME
I64	U64	2^20	0.201	196.865 us	0.37%	196.585 us	0.36%	-0.280 us	-0.14%	SAME
I64	U64	2^24	0.201	2.780 ms	0.12%	2.776 ms	0.12%	-3.732 us	-0.13%	FAST
I64	U64	2^28	0.201	52.704 ms	0.01%	52.664 ms	0.01%	-39.657 us	-0.08%	FAST
I128	I32	2^16	1	72.446 us	0.57%	71.836 us	0.42%	-0.611 us	-0.84%	FAST
I128	I32	2^20	1	335.160 us	1.34%	332.492 us	1.36%	-2.668 us	-0.80%	SAME
I128	I32	2^24	1	5.734 ms	0.07%	5.728 ms	0.05%	-6.101 us	-0.11%	FAST
I128	I32	2^28	1	110.950 ms	0.01%	110.918 ms	0.02%	-32.141 us	-0.03%	FAST
I128	I32	2^16	0.201	72.556 us	0.47%	72.046 us	0.54%	-0.510 us	-0.70%	FAST
I128	I32	2^20	0.201	325.343 us	0.87%	323.338 us	0.95%	-2.006 us	-0.62%	SAME
I128	I32	2^24	0.201	5.715 ms	0.06%	5.706 ms	0.06%	-8.360 us	-0.15%	FAST
I128	I32	2^28	0.201	110.225 ms	0.02%	110.129 ms	0.01%	-95.265 us	-0.09%	FAST
I128	U32	2^16	1	71.821 us	0.42%	71.876 us	0.42%	0.055 us	0.08%	SAME
I128	U32	2^20	1	336.999 us	1.39%	334.955 us	1.43%	-2.044 us	-0.61%	SAME
I128	U32	2^24	1	5.732 ms	0.07%	5.729 ms	0.06%	-3.320 us	-0.06%	FAST
I128	U32	2^28	1	110.933 ms	0.02%	110.900 ms	0.02%	-33.301 us	-0.03%	FAST
I128	U32	2^16	0.201	71.973 us	0.47%	72.125 us	0.47%	0.152 us	0.21%	SAME
I128	U32	2^20	0.201	327.640 us	1.05%	326.253 us	1.08%	-1.386 us	-0.42%	SAME
I128	U32	2^24	0.201	5.713 ms	0.06%	5.706 ms	0.06%	-6.616 us	-0.12%	FAST
I128	U32	2^28	0.201	110.193 ms	0.02%	110.121 ms	0.02%	-72.510 us	-0.07%	FAST
I128	I64	2^16	1	72.978 us	0.45%	72.808 us	0.39%	-0.170 us	-0.23%	SAME
I128	I64	2^20	1	337.441 us	1.36%	336.383 us	1.39%	-1.058 us	-0.31%	SAME
I128	I64	2^24	1	5.731 ms	0.07%	5.727 ms	0.05%	-4.749 us	-0.08%	FAST
I128	I64	2^28	1	110.901 ms	0.01%	110.884 ms	0.03%	-16.693 us	-0.02%	FAST
I128	I64	2^16	0.201	72.930 us	0.40%	72.888 us	0.39%	-0.042 us	-0.06%	SAME
I128	I64	2^20	0.201	328.445 us	1.10%	327.929 us	1.12%	-0.516 us	-0.16%	SAME
I128	I64	2^24	0.201	5.713 ms	0.06%	5.705 ms	0.06%	-8.065 us	-0.14%	FAST
I128	I64	2^28	0.201	110.188 ms	0.01%	110.099 ms	0.02%	-88.670 us	-0.08%	FAST
I128	U64	2^16	1	72.663 us	0.44%	72.545 us	0.46%	-0.118 us	-0.16%	SAME
I128	U64	2^20	1	336.947 us	1.45%	335.660 us	1.41%	-1.287 us	-0.38%	SAME
I128	U64	2^24	1	5.732 ms	0.07%	5.727 ms	0.06%	-4.177 us	-0.07%	FAST
I128	U64	2^28	1	110.915 ms	0.02%	110.876 ms	0.02%	-38.708 us	-0.03%	FAST
I128	U64	2^16	0.201	72.656 us	0.53%	72.620 us	0.48%	-0.036 us	-0.05%	SAME
I128	U64	2^20	0.201	327.896 us	1.12%	327.699 us	1.09%	-0.197 us	-0.06%	SAME
I128	U64	2^24	0.201	5.712 ms	0.06%	5.705 ms	0.06%	-7.235 us	-0.13%	FAST
I128	U64	2^28	0.201	110.176 ms	0.02%	110.090 ms	0.01%	-85.872 us	-0.08%	FAST
F32	I32	2^16	1	52.345 us	0.67%	52.118 us	0.65%	-0.227 us	-0.43%	SAME
F32	I32	2^20	1	129.412 us	0.43%	128.640 us	0.36%	-0.771 us	-0.60%	FAST
F32	I32	2^24	1	1.280 ms	0.24%	1.280 ms	0.23%	-0.878 us	-0.07%	SAME
F32	I32	2^28	1	23.604 ms	0.50%	23.604 ms	0.50%	-0.176 us	-0.00%	SAME
F32	I32	2^16	0.201	52.372 us	0.75%	52.443 us	1.15%	0.071 us	0.13%	SAME
F32	I32	2^20	0.201	126.933 us	0.48%	126.520 us	0.44%	-0.414 us	-0.33%	SAME
F32	I32	2^24	0.201	1.268 ms	0.20%	1.268 ms	0.23%	0.434 us	0.03%	SAME
F32	I32	2^28	0.201	23.001 ms	0.02%	23.000 ms	0.02%	-0.999 us	-0.00%	SAME
F32	U32	2^16	1	51.780 us	0.62%	52.034 us	0.61%	0.254 us	0.49%	SAME
F32	U32	2^20	1	128.815 us	0.47%	128.124 us	0.42%	-0.692 us	-0.54%	FAST
F32	U32	2^24	1	1.280 ms	0.23%	1.279 ms	0.24%	-1.866 us	-0.15%	SAME
F32	U32	2^28	1	23.608 ms	0.50%	23.608 ms	0.50%	0.136 us	0.00%	SAME
F32	U32	2^16	0.201	51.788 us	0.72%	51.985 us	0.95%	0.198 us	0.38%	SAME
F32	U32	2^20	0.201	125.591 us	0.39%	124.759 us	0.40%	-0.832 us	-0.66%	FAST
F32	U32	2^24	0.201	1.267 ms	0.22%	1.265 ms	0.21%	-2.122 us	-0.17%	SAME
F32	U32	2^28	0.201	22.997 ms	0.02%	22.995 ms	0.02%	-2.512 us	-0.01%	SAME
F32	I64	2^16	1	52.887 us	0.63%	52.960 us	0.61%	0.073 us	0.14%	SAME
F32	I64	2^20	1	131.997 us	0.51%	130.698 us	0.30%	-1.298 us	-0.98%	FAST
F32	I64	2^24	1	1.283 ms	0.24%	1.283 ms	0.23%	-0.214 us	-0.02%	SAME
F32	I64	2^28	1	23.614 ms	0.50%	23.627 ms	0.50%	12.826 us	0.05%	SAME
F32	I64	2^16	0.201	52.899 us	0.96%	53.021 us	1.04%	0.122 us	0.23%	SAME
F32	I64	2^20	0.201	129.387 us	0.37%	128.519 us	0.28%	-0.868 us	-0.67%	FAST
F32	I64	2^24	0.201	1.270 ms	0.21%	1.270 ms	0.21%	-0.374 us	-0.03%	SAME
F32	I64	2^28	0.201	23.000 ms	0.01%	23.003 ms	0.02%	2.527 us	0.01%	SAME
F32	U64	2^16	1	52.604 us	0.58%	52.682 us	0.61%	0.079 us	0.15%	SAME
F32	U64	2^20	1	129.902 us	0.43%	130.013 us	0.40%	0.111 us	0.09%	SAME
F32	U64	2^24	1	1.282 ms	0.22%	1.282 ms	0.24%	0.197 us	0.02%	SAME
F32	U64	2^28	1	23.631 ms	0.50%	23.640 ms	0.50%	9.257 us	0.04%	SAME
F32	U64	2^16	0.201	52.633 us	0.91%	52.770 us	0.90%	0.137 us	0.26%	SAME
F32	U64	2^20	0.201	128.060 us	0.57%	128.016 us	0.49%	-0.044 us	-0.03%	SAME
F32	U64	2^24	0.201	1.269 ms	0.21%	1.269 ms	0.21%	-0.131 us	-0.01%	SAME
F32	U64	2^28	0.201	22.999 ms	0.02%	23.001 ms	0.01%	1.865 us	0.01%	SAME
F64	I32	2^16	1	60.024 us	0.55%	59.565 us	0.58%	-0.459 us	-0.76%	FAST
F64	I32	2^20	1	184.603 us	0.25%	183.754 us	0.28%	-0.849 us	-0.46%	FAST
F64	I32	2^24	1	2.762 ms	0.12%	2.761 ms	0.14%	-1.359 us	-0.05%	SAME
F64	I32	2^28	1	52.873 ms	0.48%	52.882 ms	0.50%	8.943 us	0.02%	SAME
F64	I32	2^16	0.201	60.552 us	0.75%	60.091 us	0.88%	-0.461 us	-0.76%	FAST
F64	I32	2^20	0.201	192.480 us	0.35%	191.912 us	0.36%	-0.568 us	-0.29%	SAME
F64	I32	2^24	0.201	2.767 ms	0.13%	2.766 ms	0.12%	-1.145 us	-0.04%	SAME
F64	I32	2^28	0.201	52.606 ms	0.01%	52.602 ms	0.01%	-3.604 us	-0.01%	SAME
F64	U32	2^16	1	59.004 us	0.46%	59.245 us	0.54%	0.241 us	0.41%	SAME
F64	U32	2^20	1	184.120 us	0.30%	184.523 us	0.34%	0.403 us	0.22%	SAME
F64	U32	2^24	1	2.762 ms	0.14%	2.761 ms	0.13%	-0.609 us	-0.02%	SAME
F64	U32	2^28	1	52.742 ms	0.18%	53.076 ms	0.50%	334.739 us	0.63%	SLOW
F64	U32	2^16	0.201	59.574 us	0.60%	59.871 us	0.66%	0.298 us	0.50%	SAME
F64	U32	2^20	0.201	191.985 us	0.36%	192.162 us	0.41%	0.177 us	0.09%	SAME
F64	U32	2^24	0.201	2.766 ms	0.12%	2.766 ms	0.12%	-0.801 us	-0.03%	SAME
F64	U32	2^28	0.201	52.613 ms	0.01%	52.603 ms	0.02%	-9.463 us	-0.02%	FAST
F64	I64	2^16	1	60.671 us	0.62%	60.774 us	0.48%	0.104 us	0.17%	SAME
F64	I64	2^20	1	187.345 us	0.30%	186.353 us	0.28%	-0.992 us	-0.53%	FAST
F64	I64	2^24	1	2.764 ms	0.13%	2.764 ms	0.12%	-0.332 us	-0.01%	SAME
F64	I64	2^28	1	52.863 ms	0.48%	52.848 ms	0.42%	-14.382 us	-0.03%	SAME
F64	I64	2^16	0.201	61.110 us	0.73%	61.197 us	0.79%	0.086 us	0.14%	SAME
F64	I64	2^20	0.201	194.839 us	0.34%	194.243 us	0.38%	-0.596 us	-0.31%	SAME
F64	I64	2^24	0.201	2.768 ms	0.13%	2.768 ms	0.12%	0.174 us	0.01%	SAME
F64	I64	2^28	0.201	52.586 ms	0.01%	52.582 ms	0.01%	-3.797 us	-0.01%	SAME
F64	U64	2^16	1	60.073 us	0.43%	60.206 us	0.46%	0.133 us	0.22%	SAME
F64	U64	2^20	1	186.214 us	0.38%	185.539 us	0.41%	-0.675 us	-0.36%	SAME
F64	U64	2^24	1	2.763 ms	0.13%	2.764 ms	0.12%	1.205 us	0.04%	SAME
F64	U64	2^28	1	52.852 ms	0.47%	52.843 ms	0.39%	-9.376 us	-0.02%	SAME
F64	U64	2^16	0.201	60.541 us	0.80%	60.662 us	0.73%	0.120 us	0.20%	SAME
F64	U64	2^20	0.201	194.178 us	0.38%	193.416 us	0.40%	-0.762 us	-0.39%	FAST
F64	U64	2^24	0.201	2.767 ms	0.13%	2.767 ms	0.12%	0.291 us	0.01%	SAME
F64	U64	2^28	0.201	52.581 ms	0.01%	52.579 ms	0.01%	-2.507 us	-0.00%	SAME
C64	I32	2^16	1	200.329 us	0.43%	201.152 us	0.45%	0.823 us	0.41%	SAME
C64	I32	2^20	1	472.413 us	0.36%	471.564 us	0.32%	-0.849 us	-0.18%	SAME
C64	I32	2^24	1	6.350 ms	0.14%	6.354 ms	0.13%	3.842 us	0.06%	SAME
C64	I32	2^28	1	128.020 ms	0.52%	127.579 ms	0.44%	-441.791 us	-0.35%	SAME
C64	I32	2^16	0.201	317.884 us	0.48%	318.706 us	0.49%	0.822 us	0.26%	SAME
C64	I32	2^20	0.201	793.874 us	0.66%	793.814 us	0.71%	-0.060 us	-0.01%	SAME
C64	I32	2^24	0.201	12.760 ms	0.26%	12.574 ms	0.24%	-185.835 us	-1.46%	FAST
C64	I32	2^28	0.201	215.159 ms	0.06%	212.199 ms	0.06%	-2960.179 us	-1.38%	FAST
C64	U32	2^16	1	200.423 us	0.43%	200.626 us	0.45%	0.203 us	0.10%	SAME
C64	U32	2^20	1	471.598 us	0.32%	471.826 us	0.31%	0.228 us	0.05%	SAME
C64	U32	2^24	1	6.366 ms	0.15%	6.364 ms	0.11%	-2.155 us	-0.03%	SAME
C64	U32	2^28	1	128.758 ms	0.54%	127.890 ms	0.49%	-867.447 us	-0.67%	FAST
C64	U32	2^16	0.201	317.681 us	0.51%	317.670 us	0.50%	-0.011 us	-0.00%	SAME
C64	U32	2^20	0.201	795.349 us	0.74%	796.398 us	0.68%	1.049 us	0.13%	SAME
C64	U32	2^24	0.201	12.813 ms	0.25%	12.806 ms	0.25%	-6.934 us	-0.05%	SAME
C64	U32	2^28	0.201	216.113 ms	0.04%	216.083 ms	0.05%	-30.040 us	-0.01%	SAME
C64	I64	2^16	1	201.727 us	0.47%	201.636 us	0.45%	-0.091 us	-0.04%	SAME
C64	I64	2^20	1	475.004 us	0.33%	476.260 us	0.35%	1.257 us	0.26%	SAME
C64	I64	2^24	1	6.343 ms	0.12%	6.357 ms	0.16%	13.675 us	0.22%	SLOW
C64	I64	2^28	1	128.902 ms	0.57%	127.369 ms	0.39%	-1532.931 us	-1.19%	FAST
C64	I64	2^16	0.201	318.934 us	0.54%	320.012 us	0.44%	1.077 us	0.34%	SAME
C64	I64	2^20	0.201	798.287 us	0.71%	801.935 us	0.73%	3.647 us	0.46%	SAME
C64	I64	2^24	0.201	12.782 ms	0.19%	12.838 ms	0.22%	56.532 us	0.44%	SLOW
C64	I64	2^28	0.201	215.607 ms	0.04%	216.589 ms	0.04%	981.892 us	0.46%	SLOW
C64	U64	2^16	1	201.418 us	0.50%	202.143 us	0.51%	0.724 us	0.36%	SAME
C64	U64	2^20	1	475.012 us	0.30%	476.798 us	0.30%	1.787 us	0.38%	SLOW
C64	U64	2^24	1	6.383 ms	0.15%	6.393 ms	0.18%	10.415 us	0.16%	SLOW
C64	U64	2^28	1	129.860 ms	0.51%	127.940 ms	0.35%	-1919.541 us	-1.48%	FAST
C64	U64	2^16	0.201	318.325 us	0.54%	319.237 us	0.52%	0.912 us	0.29%	SAME
C64	U64	2^20	0.201	797.370 us	0.73%	799.869 us	0.75%	2.499 us	0.31%	SAME
C64	U64	2^24	0.201	12.821 ms	0.26%	12.830 ms	0.20%	8.816 us	0.07%	SAME
C64	U64	2^28	0.201	216.320 ms	0.05%	216.655 ms	0.06%	335.448 us	0.16%	SLOW

github-actions · 2025-01-16T15:11:30Z

🟩 CI finished in 1h 50m: Pass: 100%/78 | Total: 2d 04h | Avg: 40m 19s | Max: 1h 11m | Hits: 261%/12760

🟩 cub: Pass: 100%/38 | Total: 1d 07h | Avg: 49m 55s | Max: 1h 11m | Hits: 371%/3540

🟩 cpu
  🟩 amd64              Pass: 100%/36  | Total:  1d 05h | Avg: 49m 11s | Max:  1h 11m | Hits: 371%/3540  
  🟩 arm64              Pass: 100%/2   | Total:  2h 06m | Avg:  1h 03m | Max:  1h 03m
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  4h 53m | Avg: 58m 45s | Max:  1h 06m | Hits: 371%/885   
  🟩 12.5               Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
  🟩 12.6               Pass: 100%/31  | Total:  1d 00h | Avg: 47m 19s | Max:  1h 09m | Hits: 371%/2655  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 53m | Avg: 56m 37s | Max: 56m 59s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  4h 53m | Avg: 58m 45s | Max:  1h 06m | Hits: 371%/885   
  🟩 nvcc12.5           Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
  🟩 nvcc12.6           Pass: 100%/29  | Total: 22h 33m | Avg: 46m 40s | Max:  1h 09m | Hits: 371%/2655  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 37s | Max: 56m 59s
  🟩 nvcc               Pass: 100%/36  | Total:  1d 05h | Avg: 49m 32s | Max:  1h 11m | Hits: 371%/3540  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  3h 41m | Avg: 55m 25s | Max: 58m 13s
  🟩 Clang15            Pass: 100%/1   | Total: 55m 56s | Avg: 55m 56s | Max: 55m 56s
  🟩 Clang16            Pass: 100%/1   | Total: 54m 52s | Avg: 54m 52s | Max: 54m 52s
  🟩 Clang17            Pass: 100%/1   | Total: 53m 30s | Avg: 53m 30s | Max: 53m 30s
  🟩 Clang18            Pass: 100%/7   | Total:  5h 40m | Avg: 48m 36s | Max:  1h 02m
  🟩 GCC7               Pass: 100%/2   | Total:  1h 48m | Avg: 54m 29s | Max: 57m 38s
  🟩 GCC8               Pass: 100%/1   | Total: 52m 02s | Avg: 52m 02s | Max: 52m 02s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 48m | Avg: 54m 08s | Max: 55m 38s
  🟩 GCC10              Pass: 100%/1   | Total: 53m 48s | Avg: 53m 48s | Max: 53m 48s
  🟩 GCC11              Pass: 100%/1   | Total: 53m 05s | Avg: 53m 05s | Max: 53m 05s
  🟩 GCC12              Pass: 100%/3   | Total:  1h 43m | Avg: 34m 35s | Max: 58m 00s
  🟩 GCC13              Pass: 100%/8   | Total:  4h 44m | Avg: 35m 33s | Max:  1h 03m
  🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 12m | Avg:  1h 06m | Max:  1h 06m | Hits: 371%/1770  
  🟩 MSVC14.39          Pass: 100%/2   | Total:  2h 17m | Avg:  1h 08m | Max:  1h 09m | Hits: 371%/1770  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total: 12h 06m | Avg: 51m 52s | Max:  1h 02m
  🟩 GCC                Pass: 100%/18  | Total: 12h 44m | Avg: 42m 27s | Max:  1h 03m
  🟩 MSVC               Pass: 100%/4   | Total:  4h 30m | Avg:  1h 07m | Max:  1h 09m | Hits: 371%/3540  
  🟩 NVHPC              Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 11m
🟩 gpu
  🟩 h100               Pass: 100%/2   | Total: 45m 46s | Avg: 22m 53s | Max: 26m 03s
  🟩 v100               Pass: 100%/36  | Total:  1d 06h | Avg: 51m 25s | Max:  1h 11m | Hits: 371%/3540  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total:  1d 04h | Avg: 55m 52s | Max:  1h 11m | Hits: 371%/3540  
  🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 31s | Avg: 22m 31s | Max: 22m 31s
  🟩 GraphCapture       Pass: 100%/1   | Total: 19m 07s | Avg: 19m 07s | Max: 19m 07s
  🟩 HostLaunch         Pass: 100%/3   | Total:  1h 06m | Avg: 22m 04s | Max: 25m 36s
  🟩 TestGPU            Pass: 100%/2   | Total: 56m 49s | Avg: 28m 24s | Max: 30m 51s
🟩 sm
  🟩 90                 Pass: 100%/2   | Total: 45m 46s | Avg: 22m 53s | Max: 26m 03s
  🟩 90a                Pass: 100%/1   | Total: 23m 37s | Avg: 23m 37s | Max: 23m 37s
🟩 std
  🟩 17                 Pass: 100%/14  | Total: 13h 38m | Avg: 58m 29s | Max:  1h 11m | Hits: 371%/2655  
  🟩 20                 Pass: 100%/24  | Total: 17h 58m | Avg: 44m 55s | Max:  1h 08m | Hits: 370%/885

🟩 thrust: Pass: 100%/37 | Total: 20h 13m | Avg: 32m 48s | Max: 1h 08m | Hits: 219%/9220

🟩 cmake_options
  🟩 -DTHRUST_DISPATCH_TYPE=Force32bit Pass: 100%/2   | Total: 36m 24s | Avg: 18m 12s | Max: 25m 09s
🟩 cpu
  🟩 amd64              Pass: 100%/35  | Total: 19h 16m | Avg: 33m 01s | Max:  1h 08m | Hits: 219%/9220  
  🟩 arm64              Pass: 100%/2   | Total: 57m 35s | Avg: 28m 47s | Max: 30m 29s
🟩 ctk
  🟩 12.0               Pass: 100%/5   | Total:  3h 03m | Avg: 36m 41s | Max: 56m 00s | Hits: 182%/1844  
  🟩 12.5               Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
  🟩 12.6               Pass: 100%/30  | Total: 15h 22m | Avg: 30m 44s | Max:  1h 08m | Hits: 228%/7376  
🟩 cudacxx
  🟩 ClangCUDA18        Pass: 100%/2   | Total: 52m 23s | Avg: 26m 11s | Max: 27m 07s
  🟩 nvcc12.0           Pass: 100%/5   | Total:  3h 03m | Avg: 36m 41s | Max: 56m 00s | Hits: 182%/1844  
  🟩 nvcc12.5           Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
  🟩 nvcc12.6           Pass: 100%/28  | Total: 14h 30m | Avg: 31m 04s | Max:  1h 08m | Hits: 228%/7376  
🟩 cudacxx_family
  🟩 ClangCUDA          Pass: 100%/2   | Total: 52m 23s | Avg: 26m 11s | Max: 27m 07s
  🟩 nvcc               Pass: 100%/35  | Total: 19h 21m | Avg: 33m 10s | Max:  1h 08m | Hits: 219%/9220  
🟩 cxx
  🟩 Clang14            Pass: 100%/4   | Total:  2h 04m | Avg: 31m 00s | Max: 32m 36s
  🟩 Clang15            Pass: 100%/1   | Total: 33m 36s | Avg: 33m 36s | Max: 33m 36s
  🟩 Clang16            Pass: 100%/1   | Total: 33m 51s | Avg: 33m 51s | Max: 33m 51s
  🟩 Clang17            Pass: 100%/1   | Total: 32m 35s | Avg: 32m 35s | Max: 32m 35s
  🟩 Clang18            Pass: 100%/7   | Total:  2h 47m | Avg: 23m 54s | Max: 33m 40s
  🟩 GCC7               Pass: 100%/2   | Total:  1h 02m | Avg: 31m 20s | Max: 32m 19s
  🟩 GCC8               Pass: 100%/1   | Total: 31m 01s | Avg: 31m 01s | Max: 31m 01s
  🟩 GCC9               Pass: 100%/2   | Total:  1h 04m | Avg: 32m 03s | Max: 32m 50s
  🟩 GCC10              Pass: 100%/1   | Total: 30m 32s | Avg: 30m 32s | Max: 30m 32s
  🟩 GCC11              Pass: 100%/1   | Total: 34m 23s | Avg: 34m 23s | Max: 34m 23s
  🟩 GCC12              Pass: 100%/1   | Total: 34m 51s | Avg: 34m 51s | Max: 34m 51s
  🟩 GCC13              Pass: 100%/8   | Total:  2h 54m | Avg: 21m 47s | Max: 36m 10s
  🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 42s | Max:  1h 01m | Hits: 182%/3688  
  🟩 MSVC14.39          Pass: 100%/3   | Total:  2h 45m | Avg: 55m 04s | Max:  1h 08m | Hits: 243%/5532  
  🟩 NVHPC24.7          Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
🟩 cxx_family
  🟩 Clang              Pass: 100%/14  | Total:  6h 31m | Avg: 27m 57s | Max: 33m 51s
  🟩 GCC                Pass: 100%/16  | Total:  7h 11m | Avg: 26m 59s | Max: 36m 10s
  🟩 MSVC               Pass: 100%/5   | Total:  4h 42m | Avg: 56m 31s | Max:  1h 08m | Hits: 219%/9220  
  🟩 NVHPC              Pass: 100%/2   | Total:  1h 47m | Avg: 53m 54s | Max: 55m 06s
🟩 gpu
  🟩 v100               Pass: 100%/37  | Total: 20h 13m | Avg: 32m 48s | Max:  1h 08m | Hits: 219%/9220  
🟩 jobs
  🟩 Build              Pass: 100%/31  | Total: 18h 41m | Avg: 36m 10s | Max:  1h 08m | Hits: 182%/7376  
  🟩 TestCPU            Pass: 100%/3   | Total: 53m 33s | Avg: 17m 51s | Max: 37m 37s | Hits: 365%/1844  
  🟩 TestGPU            Pass: 100%/3   | Total: 38m 27s | Avg: 12m 49s | Max: 16m 24s
🟩 sm
  🟩 90a                Pass: 100%/1   | Total: 18m 21s | Avg: 18m 21s | Max: 18m 21s
🟩 std
  🟩 17                 Pass: 100%/14  | Total:  9h 00m | Avg: 38m 36s | Max:  1h 01m | Hits: 182%/5532  
  🟩 20                 Pass: 100%/21  | Total: 10h 36m | Avg: 30m 19s | Max:  1h 08m | Hits: 274%/3688

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 22s | Avg: 4m 41s | Max: 7m 14s

🟩 cpu
  🟩 amd64              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
🟩 ctk
  🟩 12.6               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
🟩 cxx
  🟩 GCC13              Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
🟩 cxx_family
  🟩 GCC                Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
🟩 gpu
  🟩 v100               Pass: 100%/2   | Total:  9m 22s | Avg:  4m 41s | Max:  7m 14s
🟩 jobs
  🟩 Build              Pass: 100%/1   | Total:  2m 08s | Avg:  2m 08s | Max:  2m 08s
  🟩 Test               Pass: 100%/1   | Total:  7m 14s | Avg:  7m 14s | Max:  7m 14s

🟩 python: Pass: 100%/1 | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s

🟩 cpu
  🟩 amd64              Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
🟩 ctk
  🟩 12.6               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
🟩 cudacxx
  🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
🟩 cudacxx_family
  🟩 nvcc               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
🟩 cxx
  🟩 GCC13              Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
🟩 cxx_family
  🟩 GCC                Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
🟩 gpu
  🟩 v100               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s
🟩 jobs
  🟩 Test               Pass: 100%/1   | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s

👃 Inspect Changes

Modifications in project?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
	Thrust
	CUDA Experimental
	python
	CCCL C Parallel Library
	Catch2Helper

Modifications in project or dependencies?

	Project
	CCCL Infrastructure
	libcu++
+/-	CUB
+/-	Thrust
	CUDA Experimental
+/-	python
+/-	CCCL C Parallel Library
+/-	Catch2Helper

🏃‍ Runner counts (total jobs: 78)

#	Runner
53	`linux-amd64-cpu16`
11	`linux-amd64-gpu-v100-latest-1`
9	`windows-amd64-cpu16`
4	`linux-arm64-cpu16`
1	`linux-amd64-gpu-h100-latest-1-testing`

github-actions · 2025-01-18T08:46:38Z

Backport failed for branch/2.8.x, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally.

git fetch origin branch/2.8.x
git worktree add -d .worktree/backport-3419-to-branch/2.8.x origin/branch/2.8.x
cd .worktree/backport-3419-to-branch/2.8.x
git checkout -b backport-3419-to-branch/2.8.x
ancref=$(git merge-base 3267f42592a4b048aaf5d1bc8756859854e4a93f a0b0250fec1bfe3af1ccb9575fba53a1e8687445)
git cherry-pick -x $ancref..a0b0250fec1bfe3af1ccb9575fba53a1e8687445

… libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404

implement `add_sat` split `signed`/`unsigned` implementation, improve implementation for MSVC improve device `add_sat` implementation add `add_sat` test improve generic `add_sat` implementation for signed types implement `sub_sat` allow more msvc intrinsics on x86 add op tests partially implement `mul_sat` implement `div_sat` and `saturate_cast` add `saturate_cast` test simplify `div_sat` test Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop CUB's util_compiler.cuh (#3302) All contained macros were deprecated Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 move to c++17, finalize device optimization fix msvc compilation, update tests Deprectate C++11 and C++14 for libcu++ (#3173) * Deprectate C++11 and C++14 for libcu++ Co-authored-by: Bernhard Manfred Gruber <[email protected]> Implement `abs` and `div` from `cstdlib` (#3153) * implement integer abs functions * improve tests, fix constexpr support * just use the our implementation * implement `cuda::std::div` * prefer host's `div_t` like types * provide `cuda::std::abs` overloads for floats * allow fp abs for NVRTC * silence msvc's warning about conversion from floating point to integral Fix missing radix sort policies (#3174) Fixes NVBug 5009941 Introduces new `DeviceReduce::Arg{Min,Max}` interface with two output iterators (#3148) * introduces new arg{min,max} interface with two output iterators * adds fp inf tests * fixes docs * improves code example * fixes exec space specifier * trying to fix deprecation warning for more compilers * inlines unzip operator * trying to fix deprecation warning for nvhpc * integrates supression fixes in diagnostics * pre-ctk 11.5 deprecation suppression * fixes icc * fix for pre-ctk11.5 * cleans up deprecation suppression * cleanup Extend tuning documentation (#3179) Add codespell pre-commit hook, fix typos in CCCL (#3168) * Add codespell pre-commit hook * Automatic changes from codespell. * Manual changes. Fix parameter space for TUNE_LOAD in scan benchmark (#3176) fix various old compiler checks (#3178) implement C++26 `std::projected` (#3175) Fix pre-commit config for codespell and remaining typos (#3182) Massive cleanup of our config (#3155) Fix UB in atomics with automatic storage (#2586) * Adds specialized local cuda atomics and injects them into most atomics paths. Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> * Allow CUDA 12.2 to keep perf, this addresses earlier comments in #478 * Remove extraneous double brackets in unformatted code. * Merge unsafe atomic logic into `__cuda_is_local`. * Use `const_cast` for type conversions in cuda_local.h * Fix build issues from interface changes * Fix missing __nanosleep on sm70- * Guard __isLocal from NVHPC * Use PTX instead of running nothing from NVHPC * fixup /s/nvrtc/nvhpc * Fixup missing CUDA ifdef surrounding device code * Fix codegen * Bypass some sort of compiler bug on GCC7 * Apply suggestions from code review * Use unsafe automatic storage atomics in codegen tests --------- Co-authored-by: Georgy Evtushenko <[email protected]> Co-authored-by: gonzalobg <[email protected]> Co-authored-by: Michael Schellenberger Costa <[email protected]> Refactor the source code layout for `cuda.parallel` (#3177) * Refactor the source layout for cuda.parallel * Add copyright * Address review feedback * Don't import anything into `experimental` namespace * fix import --------- Co-authored-by: Ashwin Srinath <[email protected]> new type-erased memory resources (#2824) s/_LIBCUDACXX_DECLSPEC_EMPTY_BASES/_CCCL_DECLSPEC_EMPTY_BASES/g (#3186) Document address stability of `thrust::transform` (#3181) * Do not document _LIBCUDACXX_MARK_CAN_COPY_ARGUMENTS * Reformat and fix UnaryFunction/BinaryFunction in transform docs * Mention transform can use proclaim_copyable_arguments * Document cuda::proclaims_copyable_arguments better * Deprecate depending on transform functor argument addresses Fixes: #3053 turn off cuda version check for clangd (#3194) [STF] jacobi example based on parallel_for (#3187) * Simple jacobi example with parallel for and reductions * clang-format * remove useless capture list fixes pre-nv_diag suppression issues (#3189) Prefer c2h::type_name over c2h::demangle (#3195) Fix memcpy_async* tests (#3197) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test Add type annotations and mypy checks for `cuda.parallel` (#3180) * Refactor the source layout for cuda.parallel * Add initial type annotations * Update pre-commit config * More typing * Fix bad merge * Fix TYPE_CHECKING and numpy annotations * typing bindings.py correctly * Address review feedback --------- Co-authored-by: Ashwin Srinath <[email protected]> Fix rendering of cuda.parallel docs (#3192) * Fix pre-commit config for codespell and remaining typos * Fix rendering of docs for cuda.parallel --------- Co-authored-by: Ashwin Srinath <[email protected]> Enable PDL for DeviceMergeSortBlockSortKernel (#3199) The kernel already contains a call to _CCCL_PDL_GRID_DEPENDENCY_SYNC. This commit enables PDL when launching the kernel. Adds support for large `num_items` to `DeviceReduce::{ArgMin,ArgMax}` (#2647) * adds benchmarks for reduce::arg{min,max} * preliminary streaming arg-extremum reduction * fixes implicit conversion * uses streaming dispatch class * changes arg benches to use new streaming reduce * streaming arg-extrema reduction * fixes style * fixes compilation failures * cleanups * adds rst style comments * declare vars const and use clamp * consolidates argmin argmax benchmarks * fixes thrust usage * drops offset type in arg-extrema benchmarks * fixes clang cuda * exec space macros * switch to signed global offset type for slightly better perf * clarifies documentation * applies minor benchmark style changes from review comments * fixes interface documentation and comments * list-init accumulating output op * improves style, comments, and tests * cleans up aggregate init * renames dispatch class usage in benchmarks * fixes merge conflicts * addresses review comments * addresses review comments * fixes assertion * removes superseded implementation * changes large problem tests to use new interface * removes obsolete tests for deprecated interface Fixes for Python 3.7 docs environment (#3206) Co-authored-by: Ashwin Srinath <[email protected]> Adds support for large number of items to `DeviceTransform` (#3172) * moves large problem test helper to common file * adds support for large num items to device transform * adds tests for large number of items to device interface * fixes format * addresses review comments cp_async_bulk: Fix test (#3198) * memcpy_async_tx: Fix bug in test Two bugs, one of which occurs in practice: 1. There is a missing fence.proxy.space::global between the writes to global memory and the memcpy_async_tx. (Occurs in practice) 2. The end of the kernel should be fenced with `__syncthreads()`, because the barrier is invalidated in the destructor. If other threads are still waiting on it, there will be UB. (Has not yet manifested itself) * cp_async_bulk_tensor: Pre-emptively fence more in test * cp_async_bulk: Fix test The global memory pointer could be misaligned. cudax fixes for msvc 14.41 (#3200) avoid instantiating class templates in `is_same` implementation when possible (#3203) Fix: make launchers a CUB detail; make kernel source functions hidden. (#3209) * Fix: make launchers a CUB detail; make kernel source functions hidden. * [pre-commit.ci] auto code formatting * Address review comments, fix which macro gets fixed. help the ranges concepts recognize standard contiguous iterators in c++14/17 (#3202) unify macros and cmake options that control the suppression of deprecation warnings (#3220) * unify macros and cmake options that control the suppression of deprecation warnings * suppress nvcc warning #186 in thrust header tests * suppress c++ dialect deprecation warnings in libcudacxx header tests Fx thread-reduce performance regression (#3225) cuda.parallel: In-memory caching of build objects (#3216) * Define __eq__ and __hash__ for Iterators * Define cache_with_key utility and use it to cache Reduce objects * Add tests for caching Reduce objects * Tighten up types * Updates to support 3.7 * Address review feedback * Introduce IteratorKind to hold iterator type information * Use the .kind to generate an abi_name * Remove __eq__ and __hash__ methods from IteratorBase * Move helper function * Formatting * Don't unpack tuple in cache key --------- Co-authored-by: Ashwin Srinath <[email protected]> Just enough ranges for c++14 `span` (#3211) use generalized concepts portability macros to simplify the `range` concept (#3217) fixes some issues in the concepts portability macros and then re-implements the `range` concept with `_CCCL_REQUIRES_EXPR` Use Ruff to sort imports (#3230) * Update pyproject.tomls for import sorting * Update files after running pre-commit * Move ruff config to pyproject.toml --------- Co-authored-by: Ashwin Srinath <[email protected]> fix tuning_scan sm90 config issue (#3236) Co-authored-by: Shijie Chen <[email protected]> [STF] Logical token (#3196) * Split the implementation of the void interface into the definition of the interface, and its implementations on streams and graphs. * Add missing files * Check if a task implementation can match a prototype where the void_interface arguments are ignored * Implement ctx.abstract_logical_data() which relies on a void data interface * Illustrate how to use abstract handles in local contexts * Introduce an is_void_interface() virtual method in the data interface to potentially optimize some stages * Small improvements in the examples * Do not try to allocate or move void data * Do not use I as a variable * fix linkage error * rename abtract_logical_data into logical_token * Document logical token * fix spelling error * fix sphinx error * reflect name changes * use meaningful variable names * simplify logical_token implementation because writeback is already disabled * add a unit test for token elision * implement token elision in host_launch * Remove unused type * Implement helpers to check if a function can be invoked from a tuple, or from a tuple where we removed tokens * Much simpler is_tuple_invocable_with_filtered implementation * Fix buggy test * Factorize code * Document that we can ignore tokens for task and host_launch * Documentation for logical data freeze Fix ReduceByKey tuning (#3240) Fix RLE tuning (#3239) cuda.parallel: Forbid non-contiguous arrays as inputs (or outputs) (#3233) * Forbid non-contiguous arrays as inputs (or outputs) * Implement a more robust way to check for contiguity * Don't bother if cublas unavailable * Fix how we check for zero-element arrays * sort imports --------- Co-authored-by: Ashwin Srinath <[email protected]> expands support for more offset types in segmented benchmark (#3231) Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects (#3253) * Add escape hatches to the cmake configuration of the header tests so that we can tests deprecated compilers / dialects * Do not add option twice ptx: Add add_instruction.py (#3190) This file helps create the necessary structure for new PTX instructions. Co-authored-by: Allard Hendriksen <[email protected]> Bump main to 2.9.0. (#3247) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop cub::Mutex (#3251) Fixes: #3250 Remove legacy macros from CUB util_arch.cuh (#3257) Fixes: #3256 Remove thrust::[unary|binary]_traits (#3260) Fixes: #3259 Architecture and OS identification macros (#3237) Bump main to 3.0.0. (#3265) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Drop thrust not1 and not2 (#3264) Fixes: #3263 CCCL Internal macro documentation (#3238) Deprecate GridBarrier and GridBarrierLifetime (#3258) Fixes: #1389 Require at least gcc7 (#3268) Fixes: #3267 Drop thrust::[unary|binary]_function (#3274) Fixes: #3273 Drop ICC from CI (#3277) [STF] Corruption of the capture list of an extended lambda with a parallel_for construct on a host execution place (#3270) * Add a test to reproduce a bug observed with parallel_for on a host place * clang-format * use _CCCL_ASSERT * Attempt to debug * do not create a tuple with a universal reference that is out of scope when we use it, use an lvalue instead * fix lambda expression * clang-format Enable thrust::identity test for non-MSVC (#3281) This seems to be an oversight when the test was added Co-authored-by: Michael Schellenberger Costa <[email protected]> Enable PDL in triple chevron launch (#3282) It seems PDL was disabled by accident when _THRUST_HAS_PDL was renamed to _CCCL_HAS_PDL during the review introducing the feature. Disambiguate line continuations and macro continuations in <nv/target> (#3244) Drop VS 2017 from CI (#3287) Fixes: #3286 Drop ICC support in code (#3279) * Drop ICC from code Fixes: #3278 Co-authored-by: Michael Schellenberger Costa <[email protected]> Make CUB NVRTC commandline arguments come from a cmake template (#3292) Propose the same components (thrust, cub, libc++, cudax, cuda.parallel,...) in the bug report template than in the feature request template (#3295) Use process isolation instead of default hyper-v for Windows. (#3294) Try improving build times by using process isolation instead of hyper-v Co-authored-by: Michael Schellenberger Costa <[email protected]> [pre-commit.ci] pre-commit autoupdate (#3248) * [pre-commit.ci] pre-commit autoupdate updates: - [github.com/pre-commit/mirrors-clang-format: v18.1.8 → v19.1.6](https://github.com/pre-commit/mirrors-clang-format/compare/v18.1.8...v19.1.6) - [github.com/astral-sh/ruff-pre-commit: v0.8.3 → v0.8.6](https://github.com/astral-sh/ruff-pre-commit/compare/v0.8.3...v0.8.6) - [github.com/pre-commit/mirrors-mypy: v1.13.0 → v1.14.1](https://github.com/pre-commit/mirrors-mypy/compare/v1.13.0...v1.14.1) Co-authored-by: Michael Schellenberger Costa <[email protected]> Drop Thrust legacy arch macros (#3298) Which were disabled and could be re-enabled using THRUST_PROVIDE_LEGACY_ARCH_MACROS Drop Thrust's compiler_fence.h (#3300) Drop CTK 11.x from CI (#3275) * Add cuda12.0-gcc7 devcontainer * Move MSVC2017 jobs to CTK 12.6 Those is the only combination where rapidsai has devcontainers * Add /Zc:__cplusplus for the libcudacxx tests * Only add excape hatch for affected CTKs * Workaround missing cudaLaunchKernelEx on MSVC cudaLaunchKernelEx requires C++11, but unfortunately <cuda_runtime.h> checks this using the __cplusplus macro, which is reported wrongly for MSVC. CTK 12.3 fixed this by additionally detecting _MSV_VER. As a workaround, we provide our own copy of cudaLaunchKernelEx when it is not available from the CTK. * Workaround nvcc+MSVC issue * Regenerate devcontainers Fixes: #3249 Co-authored-by: Michael Schellenberger Costa <[email protected]> Update packman and repo_docs versions (#3293) Co-authored-by: Ashwin Srinath <[email protected]> Drop Thrust's deprecated compiler macros (#3301) Drop CUB_RUNTIME_ENABLED and __THRUST_HAS_CUDART__ (#3305) Adds support for large number of items to `DevicePartition::If` with the `ThreeWayPartition` overload (#2506) * adds support for large number of items to three-way partition * adapts interface to use choose_signed_offset_t * integrates applicable feedback from device-select pr * changes behavior for empty problems * unifies grid constant macro * fixes kernel template specialization mismatch * integrates _CCCL_GRID_CONSTANT changes * resolve merge conflicts * fixes checks in test * fixes test verification * improves tests * makes few improvements to streaming dispatch * improves code comment on test * fixes unrelated compiler error * minor style improvements Refactor scan tunings (#3262) Require C++17 for compiling Thrust and CUB (#3255) * Issue an unsuppressable warning when compiling with < C++17 * Remove C++11/14 presets * Remove CCCL_IGNORE_DEPRECATED_CPP_DIALECT from headers * Remove [CUB|THRUST|TCT]_IGNORE_DEPRECATED_CPP_[11|14] * Remove CUB_ENABLE_DIALECT_CPP[11|14] * Update CI runs * Remove C++11/14 CI runs for CUB and Thrust * Raise compiler minimum versions for C++17 * Update ReadMe * Drop Thrust's cpp14_required.h * Add escape hatch for C++17 removal Fixes: #3252 Implement `views::empty` (#3254) * Disable pair conversion of subrange with clang in C++17 * Fix namespace views * Implement `views::empty` This implements `std::ranges::views::empty`, see https://en.cppreference.com/w/cpp/ranges/empty_view Refactor `limits` and `climits` (#3221) * implement builtins for huge val, nan and nans * change `INFINITY` and `NAN` implementation for NVRTC cuda.parallel: Add documentation for the current iterators along with examples and tests (#3311) * Add tests demonstrating usage of different iterators * Update documentation of reduce_into by merging import code snippet with the rest of the example * Add documentation for current iterators * Run pre-commit checks and update accordingly * Fix comments to refer to the proper lines in the code snippets in the docs Drop clang<14 from CI, update devcontainers. (#3309) Co-authored-by: Bernhard Manfred Gruber <[email protected]> [STF] Cleanup task dependencies object constructors (#3291) * Define tag types for access modes * - Rework how we build task_dep objects based on access mode tags - pack_state is now responsible for using a const_cast for read only data * Greatly simplify the previous attempt : do not define new types, but use integral constants based on the enums * It seems the const_cast was not necessarily so we can simplify it and not even do some dispatch based on access modes Disable test with a gcc-14 regression (#3297) Deprecate Thrust's cpp_compatibility.h macros (#3299) Remove dropped function objects from docs (#3319) Document `NV_TARGET` macros (#3313) [STF] Define ctx.pick_stream() which was missing for the unified context (#3326) * Define ctx.pick_stream() which was missing for the unified context * clang-format Deprecate cub::IterateThreadStore (#3337) Drop CUB's BinaryFlip operator (#3332) Deprecate cub::Swap (#3333) Clarify transform output can overlap input (#3323) Drop CUB APIs with a debug_synchronous parameter (#3330) Fixes: #3329 Drop CUB's util_compiler.cuh for real (#3340) PR #3302 planned to drop the file, but only dropped its content. This was an oversight. So let's drop the entire file. Drop cub::ValueCache (#3346) limits offset types for merge sort (#3328) Drop CDPv1 (#3344) Fixes: #3341 Drop thrust::void_t (#3362) Use cuda::std::addressof in Thrust (#3363) Fix all_of documentation for empty ranges (#3358) all_of always returns true on an empty range. [STF] Do not keep track of dangling events in a CUDA graph backend (#3327) * Unlike the CUDA stream backend, nodes in a CUDA graph are necessarily done when the CUDA graph completes. Therefore keeping track of "dangling events" is a waste of time and resources. * replace can_ignore_dangling_events by track_dangling_events which leads to more readable code * When not storing the dangling events, we must still perform the deinit operations that were producing these events ! Extract scan kernels into NVRTC-compilable header (#3334) * Extract scan kernels into NVRTC-compilable header * Update cub/cub/device/dispatch/dispatch_scan.cuh Co-authored-by: Georgii Evtushenko <[email protected]> --------- Co-authored-by: Ashwin Srinath <[email protected]> Co-authored-by: Georgii Evtushenko <[email protected]> Drop deprecated aliases in Thrust functional (#3272) Fixes: #3271 Drop cub::DivideAndRoundUp (#3347) Use cuda::std::min/max in Thrust (#3364) Implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` (#3361) * implement `cuda::std::numeric_limits` for `__half` and `__nv_bfloat16` Cleanup util_arch (#2773) Deprecate thrust::null_type (#3367) Deprecate cub::DeviceSpmv (#3320) Fixes: #896 Improves `DeviceSegmentedSort` test run time for large number of items and segments (#3246) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * fixes spelling * adds tests for large number of segments * fixes narrowing conversion in tests * addresses review comments * fixes includes Compile basic infra test with C++17 (#3377) Adds support for large number of items and large number of segments to `DeviceSegmentedSort` (#3308) * fixes segment offset generation * switches to analytical verification * switches to analytical verification for pairs * addresses review comments * introduces segment offset type * adds tests for large number of segments * adds support for large number of segments * drops segment offset type * fixes thrust namespace * removes about-to-be-deprecated cub iterators * no exec specifier on defaulted ctor * fixes gcc7 linker error * uses local_segment_index_t throughout * determine offset type based on type returned by segment iterator begin/end iterators * minor style improvements Exit with error when RAPIDS CI fails. (#3385) cuda.parallel: Support structured types as algorithm inputs (#3218) * Introduce gpu_struct decorator and typing * Enable `reduce` to accept arrays of structs as inputs * Add test for reducing arrays-of-struct * Update documentation * Use a numpy array rather than ctypes object * Change zeros -> empty for output array and temp storage * Add a TODO for typing GpuStruct * Documentation udpates * Remove test_reduce_struct_type from test_reduce.py * Revert to `to_cccl_value()` accepting ndarray + GpuStruct * Bump copyrights --------- Co-authored-by: Ashwin Srinath <[email protected]> Deprecate thrust::async (#3324) Fixes: #100 Review/Deprecate CUB `util.ptx` for CCCL 2.x (#3342) Fix broken `_CCCL_BUILTIN_ASSUME` macro (#3314) * add compiler-specific path * fix device code path * add _CCC_ASSUME Deprecate thrust::numeric_limits (#3366) Replace `typedef` with `using` in libcu++ (#3368) Deprecate thrust::optional (#3307) Fixes: #3306 Upgrade to Catch2 3.8 (#3310) Fixes: #1724 refactor `<cuda/std/cstdint>` (#3325) Co-authored-by: Bernhard Manfred Gruber <[email protected]> Update CODEOWNERS (#3331) * Update CODEOWNERS * Update CODEOWNERS * Update CODEOWNERS * [pre-commit.ci] auto code formatting --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix sign-compare warning (#3408) Implement more cmath functions to be usable on host and device (#3382) * Implement more cmath functions to be usable on host and device * Implement math roots functions * Implement exponential functions Redefine and deprecate thrust::remove_cvref (#3394) * Redefine and deprecate thrust::remove_cvref Co-authored-by: Michael Schellenberger Costa <[email protected]> Fix assert definition for NVHPC due to constexpr issues (#3418) NVHPC cannot decide at compile time where the code would run so _CCCL_ASSERT within a constexpr function breaks it. Fix this by always using the host definition which should also work on device. Fixes #3411 Extend CUB reduce benchmarks (#3401) * Rename max.cu to custom.cu, since it uses a custom operator * Extend types covered my min.cu to all fundamental types * Add some notes on how to collect tuning parameters Fixes: #3283 Update upload-pages-artifact to v3 (#3423) * Update upload-pages-artifact to v3 * Empty commit --------- Co-authored-by: Ashwin Srinath <[email protected]> Replace and deprecate thrust::cuda_cub::terminate (#3421) `std::linalg` accessors and `transposed_layout` (#2962) Add round up/down to multiple (#3234) [FEA]: Introduce Python module with CCCL headers (#3201) * Add cccl/python/cuda_cccl directory and use from cuda_parallel, cuda_cooperative * Run `copy_cccl_headers_to_aude_include()` before `setup()` * Create python/cuda_cccl/cuda/_include/__init__.py, then simply import cuda._include to find the include path. * Add cuda.cccl._version exactly as for cuda.cooperative and cuda.parallel * Bug fix: cuda/_include only exists after shutil.copytree() ran. * Use `f"cuda-cccl @ file://{cccl_path}/python/cuda_cccl"` in setup.py * Remove CustomBuildCommand, CustomWheelBuild in cuda_parallel/setup.py (they are equivalent to the default functions) * Replace := operator (needs Python 3.8+) * Fix oversights: remove `pip3 install ./cuda_cccl` lines from README.md * Restore original README.md: `pip3 install -e` now works on first pass. * cuda_cccl/README.md: FOR INTERNAL USE ONLY * Remove `$pymajor.$pyminor.` prefix in cuda_cccl _version.py (as suggested under https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894035917) Command used: ci/update_version.sh 2 8 0 * Modernize pyproject.toml, setup.py Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894043178 * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894044996 * Install CCCL headers under cuda.cccl.include Trigger for this change: * https://github.com/NVIDIA/cccl/pull/3201#discussion_r1894048562 Unexpected accidental discovery: cuda.cooperative unit tests pass without CCCL headers entirely. * Factor out cuda_cccl/cuda/cccl/include_paths.py * Reuse cuda_cccl/cuda/cccl/include_paths.py from cuda_cooperative * Add missing Copyright notice. * Add missing __init__.py (cuda.cccl) * Add `"cuda.cccl"` to `autodoc.mock_imports` * Move cuda.cccl.include_paths into function where it is used. (Attempt to resolve Build and Verify Docs failure.) * Add # TODO: move this to a module-level import * Modernize cuda_cooperative/pyproject.toml, setup.py * Convert cuda_cooperative to use hatchling as build backend. * Revert "Convert cuda_cooperative to use hatchling as build backend." This reverts commit 61637d608da06fcf6851ef6197f88b5e7dbc3bbe. * Move numpy from [build-system] requires -> [project] dependencies * Move pyproject.toml [project] dependencies -> setup.py install_requires, to be able to use CCCL_PATH * Remove copy_license() and use license_files=["../../LICENSE"] instead. * Further modernize cuda_cccl/setup.py to use pathlib * Trivial simplifications in cuda_cccl/pyproject.toml * Further simplify cuda_cccl/pyproject.toml, setup.py: remove inconsequential code * Make cuda_cooperative/pyproject.toml more similar to cuda_cccl/pyproject.toml * Add taplo-pre-commit to .pre-commit-config.yaml * taplo-pre-commit auto-fixes * Use pathlib in cuda_cooperative/setup.py * CCCL_PYTHON_PATH in cuda_cooperative/setup.py * Modernize cuda_parallel/pyproject.toml, setup.py * Use pathlib in cuda_parallel/setup.py * Add `# TOML lint & format` comment. * Replace MANIFEST.in with `[tool.setuptools.package-data]` section in pyproject.toml * Use pathlib in cuda/cccl/include_paths.py * pre-commit autoupdate (EXCEPT clang-format, which was manually restored) * Fixes after git merge main * Resolve warning: AttributeError: '_Reduce' object has no attribute 'build_result' ``` =========================================================================== warnings summary =========================================================================== tests/test_reduce.py::test_reduce_non_contiguous /home/coder/cccl/python/devenv/lib/python3.12/site-packages/_pytest/unraisableexception.py:85: PytestUnraisableExceptionWarning: Exception ignored in: <function _Reduce.__del__ at 0x7bf123139080> Traceback (most recent call last): File "/home/coder/cccl/python/cuda_parallel/cuda/parallel/experimental/algorithms/reduce.py", line 132, in __del__ bindings.cccl_device_reduce_cleanup(ctypes.byref(self.build_result)) ^^^^^^^^^^^^^^^^^ AttributeError: '_Reduce' object has no attribute 'build_result' warnings.warn(pytest.PytestUnraisableExceptionWarning(msg)) -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ============================================================= 1 passed, 93 deselected, 1 warning in 0.44s ============================================================== ``` * Move `copy_cccl_headers_to_cuda_cccl_include()` functionality to `class CustomBuildPy` * Introduce cuda_cooperative/constraints.txt * Also add cuda_parallel/constraints.txt * Add `--constraint constraints.txt` in ci/test_python.sh * Update Copyright dates * Switch to https://github.com/ComPWA/taplo-pre-commit (the other repo has been archived by the owner on Jul 1, 2024) For completeness: The other repo took a long time to install into the pre-commit cache; so long it lead to timeouts in the CCCL CI. * Remove unused cuda_parallel jinja2 dependency (noticed by chance). * Remove constraints.txt files, advertise running `pip install cuda-cccl` first instead. * Make cuda_cooperative, cuda_parallel testing completely independent. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Fix sign-compare warning (#3408) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Try using another runner (because V100 runners seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc]" This reverts commit ea33a218ed77a075156cd1b332047202adb25aa2. Error message: https://github.com/NVIDIA/cccl/pull/3201#issuecomment-2594012971 * Try using A100 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Also show cuda-cooperative site-packages, cuda-parallel site-packages (after pip install) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Try using l4 runner (because V100 runners still seem to be stuck) [skip-rapids][skip-matx][skip-docs][skip-vdc] * Restore original ci/matrix.yaml [skip-rapids] * Use for loop in test_python.sh to avoid code duplication. * Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci] * Comment out taplo-lint in pre-commit config [skip-rapids][skip-matx][skip-docs][skip-vdc] * Revert "Run only test_python.sh [skip-rapids][skip-matx][skip-docs][skip-vdc][skip pre-commit.ci]" This reverts commit ec206fd8b50a6a293e00a5825b579e125010b13d. * Implement suggestion by @shwina (https://github.com/NVIDIA/cccl/pull/3201#pullrequestreview-2556918460) * Address feedback by @leofang --------- Co-authored-by: Bernhard Manfred Gruber <[email protected]> cuda.parallel: Add optional stream argument to reduce_into() (#3348) * Add optional stream argument to reduce_into() * Add tests to check for reduce_into() stream behavior * Move protocol related utils to separate file and rework __cuda_stream__ error messages * Fix synchronization issue in stream test and add one more invalid stream test case * Rename cuda stream validation function after removing leading underscore * Unpack values from __cuda_stream__ instead of indexing * Fix linting errors * Handle TypeError when unpacking invalid __cuda_stream__ return * Use stream to allocate cupy memory in new stream test Upgrade to actions/deploy-pages@v4 (from v2), as suggested by @leofang (#3434) Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ (#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes #3404 Fix CI issues (#3443) update docs fix review restrict allowed types replace constexpr implementations with generic optimize `__is_arithmetic_integral`

… libcu++ (#3419) (#3447) Fixes #3404

Deprecate cub::{min, max} and replace internal uses with those from…

9432cd0

… libcu++ Fixes NVIDIA#3404

miscco requested review from a team as code owners January 16, 2025 09:42

miscco requested a review from bernhardmgruber January 16, 2025 09:42

miscco added cub For all items related to CUB backport branch/2.8.x labels Jan 16, 2025

miscco requested review from elstehle and wmaxey January 16, 2025 09:43

bernhardmgruber approved these changes Jan 16, 2025

View reviewed changes

cub/benchmarks/bench/radix_sort/keys.cu Show resolved Hide resolved

[pre-commit.ci] auto code formatting

a0b0250

elstehle approved these changes Jan 16, 2025

View reviewed changes

wmaxey approved these changes Jan 16, 2025

View reviewed changes

miscco merged commit cac3738 into NVIDIA:main Jan 18, 2025
92 of 95 checks passed

miscco deleted the drop_cub_min_max branch January 18, 2025 08:46

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025

Deprecate cub::{min, max} and replace internal uses with those from…

c8edc09

… libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404

davebayer pushed a commit to davebayer/cccl that referenced this pull request Jan 18, 2025

Deprecate cub::{min, max} and replace internal uses with those from…

d7a86e6

… libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404

miscco added a commit to miscco/cccl that referenced this pull request Jan 19, 2025

Deprecate cub::{min, max} and replace internal uses with those from…

5170a97

… libcu++ (NVIDIA#3419) * Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ Fixes NVIDIA#3404

bernhardmgruber pushed a commit that referenced this pull request Jan 21, 2025

Deprecate cub::{min, max} and replace internal uses with those from…

21ff0a0

… libcu++ (#3419) (#3447) Fixes #3404

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ #3419

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ #3419

miscco commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 53m 35s | Max: 1h 11m | Hits: 36%/3540

🟩 thrust: Pass: 100%/37 | Total: 21h 13m | Avg: 34m 25s | Max: 1h 14m | Hits: 130%/9220

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 51s | Avg: 5m 55s | Max: 9m 18s

🟩 python: Pass: 100%/1 | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

bernhardmgruber commented Jan 16, 2025

copy-pr-bot bot commented Jan 16, 2025

bernhardmgruber commented Jan 16, 2025

elstehle commented Jan 16, 2025

elstehle commented Jan 16, 2025 •

edited

Loading

github-actions bot commented Jan 16, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 07h | Avg: 49m 55s | Max: 1h 11m | Hits: 371%/3540

🟩 thrust: Pass: 100%/37 | Total: 20h 13m | Avg: 32m 48s | Max: 1h 08m | Hits: 219%/9220

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 22s | Avg: 4m 41s | Max: 7m 14s

🟩 python: Pass: 100%/1 | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

github-actions bot commented Jan 18, 2025

Deprecate cub::{min, max} and replace internal uses with those from libcu++ #3419

Deprecate cub::{min, max} and replace internal uses with those from libcu++ #3419

Conversation

miscco commented Jan 16, 2025

github-actions bot commented Jan 16, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 09h | Avg: 53m 35s | Max: 1h 11m | Hits: 36%/3540

🟩 thrust: Pass: 100%/37 | Total: 21h 13m | Avg: 34m 25s | Max: 1h 14m | Hits: 130%/9220

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 51s | Avg: 5m 55s | Max: 9m 18s

🟩 python: Pass: 100%/1 | Total: 33m 03s | Avg: 33m 03s | Max: 33m 03s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

bernhardmgruber commented Jan 16, 2025

copy-pr-bot bot commented Jan 16, 2025

bernhardmgruber commented Jan 16, 2025

elstehle commented Jan 16, 2025

elstehle commented Jan 16, 2025 • edited Loading

github-actions bot commented Jan 16, 2025

🟩 cub: Pass: 100%/38 | Total: 1d 07h | Avg: 49m 55s | Max: 1h 11m | Hits: 371%/3540

🟩 thrust: Pass: 100%/37 | Total: 20h 13m | Avg: 32m 48s | Max: 1h 08m | Hits: 219%/9220

🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 22s | Avg: 4m 41s | Max: 7m 14s

🟩 python: Pass: 100%/1 | Total: 25m 30s | Avg: 25m 30s | Max: 25m 30s

👃 Inspect Changes

Modifications in project?

Modifications in project or dependencies?

🏃‍ Runner counts (total jobs: 78)

github-actions bot commented Jan 18, 2025

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ #3419

Deprecate `cub::{min, max}` and replace internal uses with those from libcu++ #3419

elstehle commented Jan 16, 2025 •

edited

Loading