-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prune CUB's ChainedPolicy by __CUDA_ARCH_LIST__ #2154
Conversation
a2d8c87
to
dfbb1ae
Compare
dfbb1ae
to
7f36941
Compare
71a0f9a
to
04e42c9
Compare
The unit tests now list all virtual architectures, since the list was shorter than I expected. |
04e42c9
to
4809a7e
Compare
4809a7e
to
c5f9680
Compare
I reworked the feature to now only ever instantiate to the PTX versions that appear in |
f6522ab
to
75d0f10
Compare
🟨 CI finished in 6h 48m: Pass: 80%/250 | Total: 6d 04h | Avg: 35m 38s | Max: 1h 27m | Hits: 64%/17277
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 1d 15h: Pass: 98%/250 | Total: 4d 21h | Avg: 28m 17s | Max: 1h 11m | Hits: 64%/17277
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work, I love the idea! 💚
Since the mechanism is at the core of CUB, I want to make sure all works as expected. I left a few comments that I hope will further improve test coverage.
9c627c6
to
8766281
Compare
🟨 CI finished in 9h 27m: Pass: 83%/250 | Total: 1d 23h | Avg: 11m 19s | Max: 48m 33s | Hits: 98%/16565
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
Thrust | |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
bd97bb1
to
bedf081
Compare
44e810f
to
5b3ef92
Compare
🟨 CI finished in 4h 01m: Pass: 98%/250 | Total: 5d 21h | Avg: 33m 57s | Max: 1h 17m | Hits: 65%/16657
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 250)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
41 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
5b3ef92
to
524f81c
Compare
🟨 CI finished in 7h 42m: Pass: 99%/251 | Total: 5d 23h | Avg: 34m 12s | Max: 1h 05m | Hits: 64%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
Co-authored-by: Elias Stehle <[email protected]>
``` /home/coder/cccl/thrust/thrust/cmake/../../thrust/iterator/detail/transform_input_output_iterator.inl:68:9: error: writing 1 byte into a region of size 0 [-Werror=stringop-overflow=] 68 | *io = output_function(x); | ~~~~^~~~~~~~~~~~~~~~~~~~~ ```
524f81c
to
c2b6385
Compare
🟨 CI finished in 4h 12m: Pass: 99%/251 | Total: 3d 15h | Avg: 21m 01s | Max: 1h 40m | Hits: 72%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟨 CI finished in 4h 49m: Pass: 99%/251 | Total: 3d 16h | Avg: 21m 03s | Max: 1h 40m | Hits: 72%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
🟩 CI finished in 5h 24m: Pass: 100%/251 | Total: 3d 16h | Avg: 21m 08s | Max: 1h 40m | Hits: 72%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
@gevtushenko ping. |
Per @elstehle request, I tested this successfully with libcudf 24.10 without our current scan-tuning patch and this worked well us. |
@gevtushenko ping. |
🟩 CI finished in 2h 05m: Pass: 100%/251 | Total: 6d 03h | Avg: 35m 22s | Max: 1h 42m | Hits: 64%/17373
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
pycuda |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | pycuda |
🏃 Runner counts (total jobs: 251)
# | Runner |
---|---|
178 | linux-amd64-cpu16 |
42 | linux-amd64-gpu-v100-latest-1 |
16 | linux-arm64-cpu16 |
15 | windows-amd64-cpu16 |
Motivated by @gevtushenko and @elstehle explaining to me why CUB instantiates so many kernels and why having many tuning policies is bad, here is a mitigation: When the macro
__CUDA_ARCH_LIST__
is available, we know at compile time what runtime values the ptx version can have, so we can prune the number of dispatches CUB generates from the tuning policies to only those versions. This should give us faster compilation and allow us to use tuning policies more liberally.Compile time and binary size of
cub.example.device.radix_sort
before and after: