-
Notifications
You must be signed in to change notification settings - Fork 756
fix > intmax num inputs for scan_by_key #1424
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM aside from one minor change to avoid introducing a build warning.
@@ -512,7 +512,7 @@ namespace __scan_by_key { | |||
inequality_op(equality_op_), | |||
scan_op(scan_op_) | |||
{ | |||
int tile_idx = blockIdx.x; | |||
Size tile_idx = blockIdx.x; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you switch this back to using a static_cast<Size>
in the tile_base
calculation and leave tile_idx
as an int
?
tile_idx
is passed to consume_tile
, which expects it to be an int
, and this change will introduce truncation warnings for 8-byte Size
types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah this is my fault. We should actually we use Size
for tile_idx
all the way down, I think. It's conceivable tile_idx
can rise higher than intmax. In my experiments ITEMS_PER_TILE
was 9 * 256
, meaning 1 >> 43
number of inputs would overflow.
Do you think having tile_idx
as Size
would be problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized tile_idx
goes all the way down to TilePrefixCallbackOp
, which accepts int
for tile_idx
in the constructor, so making a change to tile_idx
's type would require changes to all single-pass scan algorithms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A block's x dimension must always fit in an int
, so it's best to leave tile_idx
as-is. If we needed more tiles, it'd need to be handled at a higher level of the implementation.
09c2dcc
to
62e1891
Compare
62e1891
to
33a33d3
Compare
This LGTM, I'll run it through our tests. I should be able to land it before the next release. Thanks for the patch! |
DVS CL: 29947024 run tests |
I make the mistake of thinking |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's unfortunate. This patch just got a bit more complicated -- see my inline comment.
@@ -734,7 +734,7 @@ namespace __scan_by_key { | |||
ScanOp scan_op, | |||
AddInitToScan add_init_to_scan) | |||
{ | |||
int num_items = static_cast<int>(thrust::distance(keys_first, keys_last)); | |||
size_t num_items = static_cast<size_t>(thrust::distance(keys_first, keys_last)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely introduce performance regressions -- using size_t
unconditionally here will instantiate the scan_by_key implementation with Size=size_t
, increasing register pressure and generating less efficient code for inputs that can be indexed by int
.
Take a look at the macros in thrust/system/cuda/detail/dispatch.h -- these will conditionally switch between using int
or size_t
depending on the actual runtime value.
) same fix seen here, but via patch: NVIDIA/thrust#1424 Also fixes rapidsai/cuspatial#393 Alternatively, we could wait and update our thrust version, rather than patching the existing one. Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - Mark Harris (https://github.com/harrism) - Paul Taylor (https://github.com/trxcllnt) URL: #8199
Closing because this isn't a viable solution without a major overhaul of the single-pass scan utilities, and/or adding conditional dispatched based on the size of input. |
…_key" (#8263) Reverts #8199 According to @allisonvacanti (NVIDIA/thrust#1424 (comment)) this patch will likely have adverse effect on performance. We should revert it until a better solution can be found. Authors: - Christopher Harris (https://github.com/cwharris) Approvers: - David Wendt (https://github.com/davidwendt) - Keith Kraus (https://github.com/kkraus14) - Elias Stehle (https://github.com/elstehle) URL: #8263
We should address this in the long term through NVIDIA/cub#212 and moving Thrust kernels into CUB. |
Fixes NVIDIA/cccl#766. With these updates
scan_by_key
supports a higher number of inputs. The number of inputs is now capped bytile_idx
, which is typeint
. The actual number of supported inputs isintmax * ITEMS_PER_TILE
, whereITEMS_PER_TILE
is determined via cub/thrust PtxPolicy.