-
Notifications
You must be signed in to change notification settings - Fork 917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Changes to expect_columns_equivalent() for supporting lists #6091
Conversation
Some failing struct tests: 1. ListStructList is failing because of corruptions. 2. A couple of other failing tests, because the tests are wrong. Will fix later.
Except ListStructList. :/
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Recursion was causing threads to run out of stack space.
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
Ah, shoot. The changes seem to affect the slice/split tests for lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a change we can make lightly. Adding list comparison to the row_equality_comparator is likely going to cause significant performance regressions across operations like join/groupby.
template <typename Element, | ||
std::enable_if_t<std::is_same<Element, cudf::list_view>::value>* = nullptr> | ||
__device__ bool operator()(size_type lhs_element_index, size_type rhs_element_index); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has the potential for massive performance implications across all features that use row_equality_comparator
. I'm strongly opposed to merging this change until a thorough performance analysis has been completed of things like join/groupby that look at the change in register/stack frame usage as result of this addition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
massive performance implications across all features that use
row_equality_comparator
.
Pardon me, but shouldn't this only affect schemas containing lists/structs, where they're compared for equivalence? Would this affect existing operators working on primitive columns?
Edit: I think I see the concern now. The device-side type-dispatch has potential to complicate the row_equality_comparator
, to the point that performance is affected.
I will examine how to benchmark this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this only affect schemas containing lists/structs, where they're compared for equivalence?
Not exactly. row_equality_comparator
is an example of device side dispatch, i.e., we iterate over columns in a row and invoke the type_dispatcher
for each column. Since the type_dispatcher
is just a big switch, all case
s of the switch are instantiated even if they are never executed. If one of those code paths is significantly more complex than the others (e.g., list_view
), that can impact things like register and stack frame usage. The compiler has no idea that the list_view
code path is never executed, so it has to make resource allocation assumptions based on the worst case code path, which can then negatively impact performance of the other code paths.
I will examine how to benchmark this.
I would start with using cuobjdump
to inspect register/stack frame usage for operations like join/groupby kernels before/after adding the list_view
code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compiler has no idea that the
list_view
code path is never executed, so it has to make resource allocation assumptions based on the worst case code path, which can then negatively impact performance of the other code paths.
Thank you for the explanation, @jrhemstad. I understand this better now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delayed response. I am poring over the output from cuobjdump --dumpsass
on hash_join.o
, groupby.o
, etc. There's a fair bit to peruse there, and I'm not completely clear on what I'm looking for. (Should I be comparing the outputs from --dump-ptx
instead?) I would appreciate advice.
In the meantime, I have run the gbenchmarks/
tests for GROUPBY_BENCH
, JOIN_BENCH
, and SORT_BENCH
, before and after this change to row_equality_operator
. I'm not sure if the change we're seeing here (after the change) is as large/detrimental as we expected. (It is entirely possible that I'm not running these correctly.)
Here are some results:
GROUPBY_BENCH
Running ./GROUPBY_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.12, 0.19, 0.30
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark ORIG Time AFTER CHANGE Time CPU Iterations
-------------------------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time 0.329 ms 0.330 ms 0.364 ms 2098
Groupby/Basic/10000000/manual_time 3.24 ms 3.31 ms 3.34 ms 211
Groupby/PreSorted/10000000/manual_time 6.32 ms 6.33 ms 6.36 ms 108
Groupby/PreSortedNth/1000000/manual_time 0.361 ms 0.362 ms 0.396 ms 1950
Groupby/PreSortedNth/10000000/manual_time 0.876 ms 0.873 ms 0.908 ms 785
Groupby/PreSortedNth/100000000/manual_time 6.19 ms 6.23 ms 6.26 ms 91
JOIN_BENCH
2020-09-10 13:50:12
Running ./JOIN_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.39, 0.36, 0.35
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------
Benchmark ORIG Time AFTER CHANGE Time CPU Iterations
-----------------------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/join_32bit/100000/100000/manual_time 0.508 ms 0.500 ms 0.532 ms 1393
Join<int32_t, int32_t>/join_32bit/100000/400000/manual_time 0.679 ms 0.670 ms 0.702 ms 1003
Join<int32_t, int32_t>/join_32bit/100000/1000000/manual_time 1.16 ms 1.21 ms 1.24 ms 581
Join<int32_t, int32_t>/join_32bit/10000000/10000000/manual_time 28.8 ms 28.5 ms 28.5 ms 25
Join<int32_t, int32_t>/join_32bit/10000000/40000000/manual_time 77.7 ms 76.5 ms 76.6 ms 9
Join<int32_t, int32_t>/join_32bit/10000000/100000000/manual_time 110 ms 108 ms 108 ms 6
Join<int32_t, int32_t>/join_32bit/100000000/100000000/manual_time 290 ms 287 ms 287 ms 2
Join<int32_t, int32_t>/join_32bit/80000000/240000000/manual_time 500 ms 491 ms 491 ms 2
Join<int64_t, int64_t>/join_64bit/50000000/50000000/manual_time 147 ms 146 ms 146 ms 5
Join<int64_t, int64_t>/join_64bit/40000000/120000000/manual_time 255 ms 251 ms 251 ms 3
SORT_TEST
2020-09-10 13:51:34
Running ./SORT_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x4)
L1 Instruction 32 KiB (x4)
L2 Unified 256 KiB (x4)
L3 Unified 8192 KiB (x1)
Load Average: 0.73, 0.55, 0.42
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------------------------------------
Benchmark ORIG Time AFTER CHANGE Time CPU Iterations
----------------------------------------------------------------------------------------------------------------------
Sort<true>/sort_stable/1024/1/manual_time 0.201 ms 0.198 ms 0.228 ms 3523
Sort<true>/sort_stable/4096/1/manual_time 0.265 ms 0.263 ms 0.292 ms 2651
Sort<true>/sort_stable/32768/1/manual_time 0.350 ms 0.345 ms 0.375 ms 2029
Sort<true>/sort_stable/262144/1/manual_time 0.660 ms 0.656 ms 0.684 ms 1069
Sort<true>/sort_stable/2097152/1/manual_time 5.07 ms 5.11 ms 5.14 ms 137
Sort<true>/sort_stable/16777216/1/manual_time 41.4 ms 41.7 ms 41.7 ms 17
Sort<true>/sort_stable/67108864/1/manual_time 171 ms 172 ms 172 ms 4
Sort<true>/sort_stable/1024/8/manual_time 0.570 ms 0.567 ms 0.598 ms 1217
Sort<true>/sort_stable/4096/8/manual_time 0.745 ms 0.728 ms 0.760 ms 961
Sort<true>/sort_stable/32768/8/manual_time 1.02 ms 1.03 ms 1.06 ms 679
Sort<true>/sort_stable/262144/8/manual_time 2.19 ms 2.19 ms 2.21 ms 319
Sort<true>/sort_stable/2097152/8/manual_time 37.8 ms 38.4 ms 38.4 ms 18
Sort<true>/sort_stable/16777216/8/manual_time 677 ms 673 ms 673 ms 1
Sort<true>/sort_stable/67108864/8/manual_time 3895 ms 3867 ms 3853 ms 1
Sort<false>/sort_unstable/1024/1/manual_time 0.204 ms 0.204 ms 0.235 ms 3413
Sort<false>/sort_unstable/4096/1/manual_time 0.268 ms 0.267 ms 0.298 ms 2621
Sort<false>/sort_unstable/32768/1/manual_time 0.351 ms 0.350 ms 0.381 ms 1988
Sort<false>/sort_unstable/262144/1/manual_time 0.659 ms 0.660 ms 0.690 ms 1060
Sort<false>/sort_unstable/2097152/1/manual_time 5.08 ms 5.13 ms 5.17 ms 137
Sort<false>/sort_unstable/16777216/1/manual_time 41.5 ms 41.7 ms 41.4 ms 17
Sort<false>/sort_unstable/67108864/1/manual_time 171 ms 172 ms 172 ms 4
Sort<false>/sort_unstable/1024/8/manual_time 0.577 ms 0.566 ms 0.598 ms 1231
Sort<false>/sort_unstable/4096/8/manual_time 0.747 ms 0.733 ms 0.765 ms 948
Sort<false>/sort_unstable/32768/8/manual_time 1.02 ms 1.03 ms 1.05 ms 678
Sort<false>/sort_unstable/262144/8/manual_time 2.19 ms 2.19 ms 2.20 ms 318
Sort<false>/sort_unstable/2097152/8/manual_time 37.9 ms 38.4 ms 38.4 ms 18
Sort<false>/sort_unstable/16777216/8/manual_time 676 ms 675 ms 674 ms 1
Sort<false>/sort_unstable/67108864/8/manual_time 3881 ms 3879 ms 3878 ms 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am poring over the output from cuobjdump --dumpsass on hash_join.o, groupby.o, etc. There's a fair bit to peruse there, and I'm not completely clear on what I'm looking for.
You want to use cubobjdump --dump-resource-usage
and look at the number of registers and stack frame used per thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, @jrhemstad. There are a few object files that indicate the following kind of changes: (What's below is from hash_join.o
.)
Before:
Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
REG:62 STACK:0 SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:144 TEXTURE:0 SURFACE:0 SAMPLER:0
Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
REG:64 STACK:0 SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:144 TEXTURE:0 SURFACE:0 SAMPLER:0
...
Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
REG:72 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
REG:80 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
After:
Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
REG:72 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
REG:80 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
...
Function _ZN4cudf6detail24compute_join_output_sizeILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEiPi:
REG:74 STACK:UNKNOWN SHARED:32 LOCAL:0 CONSTANT[0]:584 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
Function _ZN4cudf6detail24compute_join_output_sizeILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEiPi:
REG:80 STACK:UNKNOWN SHARED:32 LOCAL:0 CONSTANT[0]:584 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
There were (effectively) 4 functions in hash_join.cu
whose STACK
went from 0
to UNKNOWN
. This also seems to be the case in hash/groupby.cu
, semi_join.cu
. join.cu
remains unaffected.
GROUPBY_BENCH
and JOIN_BENCH
times don't appear to be affected, to my eye. It is more than likely that these aren't the benchmarks I'm looking for, to demonstrate the effect of STACK: UNKNOWN
. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The register usage also went up significantly. 62 -> 72, 64 -> 80 in the probe_hash_table kernels. This is the kind of thing I was worried about and makes me very wary. The fact that benchmarks don't show any regression doesn't provide me much solace as that's likely due to a lack of thorough benchmarking that would expose this as problematic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the advice and analysis, @jrhemstad.
I will close this PR, and and try revisit this afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will close this PR, and and try revisit this afterwards.
You don't need to close the PR. I'm not saying this will never be merged. I just want to be very careful and make sure we do due diligence before adding this in.
bool const lhs_is_null{lhs.nullable() and lhs.is_null(lhs_element_index)}; | ||
bool const rhs_is_null{rhs.nullable() and rhs.is_null(rhs_element_index)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we're at it, might as well clean this up. This nullability check is redundant.
bool const lhs_is_null{lhs.nullable() and lhs.is_null(lhs_element_index)}; | |
bool const rhs_is_null{rhs.nullable() and rhs.is_null(rhs_element_index)}; | |
bool const lhs_is_null{lhs.is_null(lhs_element_index)}; | |
bool const rhs_is_null{rhs.is_null(rhs_element_index)}; |
I'd better close this PR for the moment. It is stale, and will require a different approach. I'll open a new one when I get a chance to work on this again. |
P.S. Thank you for the guidance, @jrhemstad. |
Closes #5867.
With #5807,
struct
columns added the ability to modify child-column null masks, without collapsing the underlying data. This poses challenges when comparing structs' list members against exemplars constructed withlists_column_wrapper
, because there might be differences in the underlying lists' child members.The equivalence checks added here allow us to check that lists contain the same values, regardless of the shape of their internal child columns.