Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Changes to expect_columns_equivalent() for supporting lists #6091

Closed
wants to merge 12 commits into from

Conversation

mythrocks
Copy link
Contributor

Closes #5867.

With #5807, struct columns added the ability to modify child-column null masks, without collapsing the underlying data. This poses challenges when comparing structs' list members against exemplars constructed with lists_column_wrapper, because there might be differences in the underlying lists' child members.

The equivalence checks added here allow us to check that lists contain the same values, regardless of the shape of their internal child columns.

Some failing struct tests:
1. ListStructList is failing because of corruptions.
2. A couple of other failing tests, because the tests
   are wrong. Will fix later.
@mythrocks mythrocks requested a review from a team as a code owner August 25, 2020 22:37
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@mythrocks mythrocks self-assigned this Aug 27, 2020
@mythrocks mythrocks added 2 - In Progress Currently a work in progress 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 27, 2020
@mythrocks mythrocks changed the title [WIP] Changes for expect_columns_equivalent() for lists Changes for expect_columns_equivalent() to support lists Aug 27, 2020
@mythrocks mythrocks changed the title Changes for expect_columns_equivalent() to support lists Changes to expect_columns_equivalent() for supporting lists Aug 27, 2020
@harrism harrism changed the title Changes to expect_columns_equivalent() for supporting lists [REVIEW] Changes to expect_columns_equivalent() for supporting lists Aug 28, 2020
@mythrocks
Copy link
Contributor Author

Ah, shoot. The changes seem to affect the slice/split tests for lists.

@mythrocks mythrocks added 2 - In Progress Currently a work in progress and removed 3 - Ready for Review Ready for review by team labels Aug 28, 2020
@mythrocks mythrocks changed the title [REVIEW] Changes to expect_columns_equivalent() for supporting lists [WIP] Changes to expect_columns_equivalent() for supporting lists Aug 28, 2020
Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a change we can make lightly. Adding list comparison to the row_equality_comparator is likely going to cause significant performance regressions across operations like join/groupby.

cpp/include/cudf/table/row_operators.cuh Show resolved Hide resolved
Comment on lines +102 to +104
template <typename Element,
std::enable_if_t<std::is_same<Element, cudf::list_view>::value>* = nullptr>
__device__ bool operator()(size_type lhs_element_index, size_type rhs_element_index);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential for massive performance implications across all features that use row_equality_comparator. I'm strongly opposed to merging this change until a thorough performance analysis has been completed of things like join/groupby that look at the change in register/stack frame usage as result of this addition.

Copy link
Contributor Author

@mythrocks mythrocks Aug 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

massive performance implications across all features that use row_equality_comparator.

Pardon me, but shouldn't this only affect schemas containing lists/structs, where they're compared for equivalence? Would this affect existing operators working on primitive columns?

Edit: I think I see the concern now. The device-side type-dispatch has potential to complicate the row_equality_comparator, to the point that performance is affected.
I will examine how to benchmark this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this only affect schemas containing lists/structs, where they're compared for equivalence?

Not exactly. row_equality_comparator is an example of device side dispatch, i.e., we iterate over columns in a row and invoke the type_dispatcher for each column. Since the type_dispatcher is just a big switch, all cases of the switch are instantiated even if they are never executed. If one of those code paths is significantly more complex than the others (e.g., list_view), that can impact things like register and stack frame usage. The compiler has no idea that the list_view code path is never executed, so it has to make resource allocation assumptions based on the worst case code path, which can then negatively impact performance of the other code paths.

I will examine how to benchmark this.

I would start with using cuobjdump to inspect register/stack frame usage for operations like join/groupby kernels before/after adding the list_view code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler has no idea that the list_view code path is never executed, so it has to make resource allocation assumptions based on the worst case code path, which can then negatively impact performance of the other code paths.

Thank you for the explanation, @jrhemstad. I understand this better now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed response. I am poring over the output from cuobjdump --dumpsass on hash_join.o, groupby.o, etc. There's a fair bit to peruse there, and I'm not completely clear on what I'm looking for. (Should I be comparing the outputs from --dump-ptx instead?) I would appreciate advice.

In the meantime, I have run the gbenchmarks/ tests for GROUPBY_BENCH, JOIN_BENCH, and SORT_BENCH, before and after this change to row_equality_operator. I'm not sure if the change we're seeing here (after the change) is as large/detrimental as we expected. (It is entirely possible that I'm not running these correctly.)
Here are some results:

  1. GROUPBY_BENCH
Running ./GROUPBY_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.12, 0.19, 0.30
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark                                         ORIG Time    AFTER CHANGE Time            CPU   Iterations
-------------------------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                   0.329 ms         0.330 ms             0.364 ms         2098
Groupby/Basic/10000000/manual_time                 3.24 ms          3.31 ms              3.34 ms          211
Groupby/PreSorted/10000000/manual_time             6.32 ms          6.33 ms              6.36 ms          108
Groupby/PreSortedNth/1000000/manual_time          0.361 ms         0.362 ms             0.396 ms         1950
Groupby/PreSortedNth/10000000/manual_time         0.876 ms         0.873 ms             0.908 ms          785
Groupby/PreSortedNth/100000000/manual_time         6.19 ms          6.23 ms              6.26 ms           91
  1. JOIN_BENCH
2020-09-10 13:50:12
Running ./JOIN_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.39, 0.36, 0.35
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                            ORIG Time     AFTER CHANGE Time         CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/join_32bit/100000/100000/manual_time           0.508 ms         0.500 ms         0.532 ms         1393
Join<int32_t, int32_t>/join_32bit/100000/400000/manual_time           0.679 ms         0.670 ms         0.702 ms         1003
Join<int32_t, int32_t>/join_32bit/100000/1000000/manual_time           1.16 ms          1.21 ms          1.24 ms          581
Join<int32_t, int32_t>/join_32bit/10000000/10000000/manual_time        28.8 ms          28.5 ms          28.5 ms           25
Join<int32_t, int32_t>/join_32bit/10000000/40000000/manual_time        77.7 ms          76.5 ms          76.6 ms            9
Join<int32_t, int32_t>/join_32bit/10000000/100000000/manual_time        110 ms           108 ms           108 ms            6
Join<int32_t, int32_t>/join_32bit/100000000/100000000/manual_time       290 ms           287 ms           287 ms            2
Join<int32_t, int32_t>/join_32bit/80000000/240000000/manual_time        500 ms           491 ms           491 ms            2
Join<int64_t, int64_t>/join_64bit/50000000/50000000/manual_time         147 ms           146 ms           146 ms            5
Join<int64_t, int64_t>/join_64bit/40000000/120000000/manual_time        255 ms           251 ms           251 ms            3
  1. SORT_TEST
2020-09-10 13:51:34
Running ./SORT_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.73, 0.55, 0.42
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------------------------------------
Benchmark                                               ORIG Time     AFTER CHANGE Time               CPU   Iterations
----------------------------------------------------------------------------------------------------------------------
Sort<true>/sort_stable/1024/1/manual_time               0.201 ms               0.198 ms          0.228 ms         3523
Sort<true>/sort_stable/4096/1/manual_time               0.265 ms               0.263 ms          0.292 ms         2651
Sort<true>/sort_stable/32768/1/manual_time              0.350 ms               0.345 ms          0.375 ms         2029
Sort<true>/sort_stable/262144/1/manual_time             0.660 ms               0.656 ms          0.684 ms         1069
Sort<true>/sort_stable/2097152/1/manual_time             5.07 ms                5.11 ms           5.14 ms          137
Sort<true>/sort_stable/16777216/1/manual_time            41.4 ms                41.7 ms           41.7 ms           17
Sort<true>/sort_stable/67108864/1/manual_time             171 ms                 172 ms            172 ms            4
Sort<true>/sort_stable/1024/8/manual_time               0.570 ms               0.567 ms          0.598 ms         1217
Sort<true>/sort_stable/4096/8/manual_time               0.745 ms               0.728 ms          0.760 ms          961
Sort<true>/sort_stable/32768/8/manual_time               1.02 ms                1.03 ms           1.06 ms          679
Sort<true>/sort_stable/262144/8/manual_time              2.19 ms                2.19 ms           2.21 ms          319
Sort<true>/sort_stable/2097152/8/manual_time             37.8 ms                38.4 ms           38.4 ms           18
Sort<true>/sort_stable/16777216/8/manual_time             677 ms                 673 ms            673 ms            1
Sort<true>/sort_stable/67108864/8/manual_time            3895 ms                3867 ms           3853 ms            1
Sort<false>/sort_unstable/1024/1/manual_time            0.204 ms               0.204 ms          0.235 ms         3413
Sort<false>/sort_unstable/4096/1/manual_time            0.268 ms               0.267 ms          0.298 ms         2621
Sort<false>/sort_unstable/32768/1/manual_time           0.351 ms               0.350 ms          0.381 ms         1988
Sort<false>/sort_unstable/262144/1/manual_time          0.659 ms               0.660 ms          0.690 ms         1060
Sort<false>/sort_unstable/2097152/1/manual_time          5.08 ms                5.13 ms           5.17 ms          137
Sort<false>/sort_unstable/16777216/1/manual_time         41.5 ms                41.7 ms           41.4 ms           17
Sort<false>/sort_unstable/67108864/1/manual_time          171 ms                 172 ms            172 ms            4
Sort<false>/sort_unstable/1024/8/manual_time            0.577 ms               0.566 ms          0.598 ms         1231
Sort<false>/sort_unstable/4096/8/manual_time            0.747 ms               0.733 ms          0.765 ms          948
Sort<false>/sort_unstable/32768/8/manual_time            1.02 ms                1.03 ms           1.05 ms          678
Sort<false>/sort_unstable/262144/8/manual_time           2.19 ms                2.19 ms           2.20 ms          318
Sort<false>/sort_unstable/2097152/8/manual_time          37.9 ms                38.4 ms           38.4 ms           18
Sort<false>/sort_unstable/16777216/8/manual_time          676 ms                 675 ms            674 ms            1
Sort<false>/sort_unstable/67108864/8/manual_time         3881 ms                3879 ms           3878 ms            1

Copy link
Contributor

@jrhemstad jrhemstad Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am poring over the output from cuobjdump --dumpsass on hash_join.o, groupby.o, etc. There's a fair bit to peruse there, and I'm not completely clear on what I'm looking for.

You want to use cubobjdump --dump-resource-usage and look at the number of registers and stack frame used per thread.

Copy link
Contributor Author

@mythrocks mythrocks Sep 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @jrhemstad. There are a few object files that indicate the following kind of changes: (What's below is from hash_join.o.)
Before:

   Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
    REG:62 STACK:0 SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:144 TEXTURE:0 SURFACE:0 SAMPLER:0
   Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
    REG:64 STACK:0 SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:144 TEXTURE:0 SURFACE:0 SAMPLER:0
...
 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:72 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:80 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0

After:

 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:72 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:80 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
...
 Function _ZN4cudf6detail24compute_join_output_sizeILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEiPi:
  REG:74 STACK:UNKNOWN SHARED:32 LOCAL:0 CONSTANT[0]:584 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function _ZN4cudf6detail24compute_join_output_sizeILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEiPi:
  REG:80 STACK:UNKNOWN SHARED:32 LOCAL:0 CONSTANT[0]:584 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0

There were (effectively) 4 functions in hash_join.cu whose STACK went from 0 to UNKNOWN. This also seems to be the case in hash/groupby.cu, semi_join.cu. join.cu remains unaffected.

GROUPBY_BENCH and JOIN_BENCH times don't appear to be affected, to my eye. It is more than likely that these aren't the benchmarks I'm looking for, to demonstrate the effect of STACK: UNKNOWN. :/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The register usage also went up significantly. 62 -> 72, 64 -> 80 in the probe_hash_table kernels. This is the kind of thing I was worried about and makes me very wary. The fact that benchmarks don't show any regression doesn't provide me much solace as that's likely due to a lack of thorough benchmarking that would expose this as problematic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the advice and analysis, @jrhemstad.
I will close this PR, and and try revisit this afterwards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this PR, and and try revisit this afterwards.

You don't need to close the PR. I'm not saying this will never be merged. I just want to be very careful and make sure we do due diligence before adding this in.

Comment on lines +88 to +89
bool const lhs_is_null{lhs.nullable() and lhs.is_null(lhs_element_index)};
bool const rhs_is_null{rhs.nullable() and rhs.is_null(rhs_element_index)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're at it, might as well clean this up. This nullability check is redundant.

Suggested change
bool const lhs_is_null{lhs.nullable() and lhs.is_null(lhs_element_index)};
bool const rhs_is_null{rhs.nullable() and rhs.is_null(rhs_element_index)};
bool const lhs_is_null{lhs.is_null(lhs_element_index)};
bool const rhs_is_null{rhs.is_null(rhs_element_index)};

@rapidsai rapidsai deleted a comment from EpigenRice Sep 15, 2020
@harrism harrism added libcudf Affects libcudf (C++/CUDA) code. tests Unit testing for project labels Oct 6, 2020
@harrism harrism changed the base branch from branch-0.16 to branch-0.17 October 6, 2020 04:54
@mythrocks
Copy link
Contributor Author

I'd better close this PR for the moment. It is stale, and will require a different approach. I'll open a new one when I get a chance to work on this again.

@mythrocks mythrocks closed this Oct 12, 2020
@mythrocks
Copy link
Contributor Author

P.S. Thank you for the guidance, @jrhemstad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. tests Unit testing for project
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QST] expect_columns_equal() vs expect_columns_equivalent()
4 participants