New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[WIP] Changes to expect_columns_equivalent() for supporting lists #6091

Closed

mythrocks wants to merge 12 commits into rapidsai:branch-0.17 from mythrocks:lists-eq-0.16-rebase

Contributor

mythrocks commented Aug 25, 2020

Closes #5867.

With #5807, struct columns added the ability to modify child-column null masks, without collapsing the underlying data. This poses challenges when comparing structs' list members against exemplars constructed with lists_column_wrapper, because there might be differences in the underlying lists' child members.

The equivalence checks added here allow us to check that lists contain the same values, regardless of the shape of their internal child columns.

mythrocks added 4 commits

August 25, 2020 10:55


          [WIP] [list_eq] Initial commit:

a159f88

Some failing struct tests:
1. ListStructList is failing because of corruptions.
2. A couple of other failing tests, because the tests
   are wrong. Will fix later.


          [WIP] [list_eq] Fixed most structs tests.

b9f7899

Except ListStructList. :/


          [WIP] [list_eq] Moved list equivalence test to lists_col_wrapper_tests

18e9eb0


          [WIP] [list_eq] Removed debug prints, cruft.

e1b657c

mythrocks requested a review from a team as a code owner

August 25, 2020 22:37

mythrocks requested review from jrhemstad and codereport

August 25, 2020 22:37

Collaborator

GPUtester commented Aug 25, 2020

Please update the changelog in order to start CI tests.

View the gpuCI docs here.


          [WIP] [list_eq] Removed unnecessary header.

5ae2949

Collaborator

GPUtester commented Aug 26, 2020

Please update the changelog in order to start CI tests.

View the gpuCI docs here.


          [WIP] [list_eq] Fixed null-handling in ldv::op==()

149fac1

Collaborator

GPUtester commented Aug 26, 2020

Please update the changelog in order to start CI tests.

View the gpuCI docs here.


          [WIP] [list_eq] Documentation, + descriptive asserts.

Collaborator

GPUtester commented Aug 27, 2020

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

mythrocks added 2 commits

August 27, 2020 16:03


          [WIP] [list_eq] Fixed List<Struct<List>> test.

33b1cb4

Recursion was causing threads to run out of stack space.


          Merge remote-tracking branch 'origin/branch-0.16' into lists-eq-0.16-…

abdd4c0

…rebase

Collaborator

GPUtester commented Aug 27, 2020

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

mythrocks added 2 commits

August 27, 2020 16:07


          [list_eq] Updated CHANGELOG.md.

80c9512


          [list_eq] Code formatter.

456444e

mythrocks self-assigned this

mythrocks added 2 - In Progress 3 - Ready for Review and removed 2 - In Progress labels

mythrocks changed the title ~~[WIP] Changes for expect_columns_equivalent() for lists~~ Changes for expect_columns_equivalent() to support lists

mythrocks changed the title ~~Changes for expect_columns_equivalent() to support lists~~ Changes to expect_columns_equivalent() for supporting lists

harrism changed the title ~~Changes to expect_columns_equivalent() for supporting lists~~ [REVIEW] Changes to expect_columns_equivalent() for supporting lists

Contributor Author

mythrocks commented Aug 28, 2020

Ah, shoot. The changes seem to affect the slice/split tests for lists.

mythrocks added 2 - In Progress and removed 3 - Ready for Review labels

mythrocks changed the title ~~[REVIEW] Changes to expect_columns_equivalent() for supporting lists~~ [WIP] Changes to expect_columns_equivalent() for supporting lists

mythrocks mentioned this pull request

[REVIEW] Support gather() for CUDF struct columns #5984

Merged

jrhemstad requested changes

View reviewed changes

Contributor

jrhemstad left a comment

This is not a change we can make lightly. Adding list comparison to the row_equality_comparator is likely going to cause significant performance regressions across operations like join/groupby.

cpp/include/cudf/table/row_operators.cuh Show resolved Hide resolved

cpp/include/cudf/table/element_equality_comparator.cuh

Comment on lines +102 to +104

+                template <typename Element,
+                          std::enable_if_t<std::is_same<Element, cudf::list_view>::value>* = nullptr>
+                __device__ bool operator()(size_type lhs_element_index, size_type rhs_element_index);

Contributor

jrhemstad Aug 31, 2020

This has the potential for massive performance implications across all features that use row_equality_comparator. I'm strongly opposed to merging this change until a thorough performance analysis has been completed of things like join/groupby that look at the change in register/stack frame usage as result of this addition.

Contributor Author

mythrocks Aug 31, 2020 •

edited

Loading

massive performance implications across all features that use row_equality_comparator.

Pardon me, but shouldn't this only affect schemas containing lists/structs, where they're compared for equivalence? Would this affect existing operators working on primitive columns?

Edit: I think I see the concern now. The device-side type-dispatch has potential to complicate the row_equality_comparator, to the point that performance is affected.
I will examine how to benchmark this.

Contributor

jrhemstad Aug 31, 2020

shouldn't this only affect schemas containing lists/structs, where they're compared for equivalence?

Not exactly. row_equality_comparator is an example of device side dispatch, i.e., we iterate over columns in a row and invoke the type_dispatcher for each column. Since the type_dispatcher is just a big switch, all cases of the switch are instantiated even if they are never executed. If one of those code paths is significantly more complex than the others (e.g., list_view), that can impact things like register and stack frame usage. The compiler has no idea that the list_view code path is never executed, so it has to make resource allocation assumptions based on the worst case code path, which can then negatively impact performance of the other code paths.

I will examine how to benchmark this.

I would start with using cuobjdump to inspect register/stack frame usage for operations like join/groupby kernels before/after adding the list_view code path.

Contributor Author

mythrocks Aug 31, 2020

The compiler has no idea that the list_view code path is never executed, so it has to make resource allocation assumptions based on the worst case code path, which can then negatively impact performance of the other code paths.

Thank you for the explanation, @jrhemstad. I understand this better now.

Contributor Author

mythrocks Sep 11, 2020

Apologies for the delayed response. I am poring over the output from cuobjdump --dumpsass on hash_join.o, groupby.o, etc. There's a fair bit to peruse there, and I'm not completely clear on what I'm looking for. (Should I be comparing the outputs from --dump-ptx instead?) I would appreciate advice.

In the meantime, I have run the gbenchmarks/ tests for GROUPBY_BENCH, JOIN_BENCH, and SORT_BENCH, before and after this change to row_equality_operator. I'm not sure if the change we're seeing here (after the change) is as large/detrimental as we expected. (It is entirely possible that I'm not running these correctly.)
Here are some results:

GROUPBY_BENCH

Running ./GROUPBY_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.12, 0.19, 0.30
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------------------------------------------
Benchmark                                         ORIG Time    AFTER CHANGE Time            CPU   Iterations
-------------------------------------------------------------------------------------------------------------
Groupby/Basic/10000/manual_time                   0.329 ms         0.330 ms             0.364 ms         2098
Groupby/Basic/10000000/manual_time                 3.24 ms          3.31 ms              3.34 ms          211
Groupby/PreSorted/10000000/manual_time             6.32 ms          6.33 ms              6.36 ms          108
Groupby/PreSortedNth/1000000/manual_time          0.361 ms         0.362 ms             0.396 ms         1950
Groupby/PreSortedNth/10000000/manual_time         0.876 ms         0.873 ms             0.908 ms          785
Groupby/PreSortedNth/100000000/manual_time         6.19 ms          6.23 ms              6.26 ms           91

JOIN_BENCH

2020-09-10 13:50:12
Running ./JOIN_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.39, 0.36, 0.35
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------------------------------------
Benchmark                                                            ORIG Time     AFTER CHANGE Time         CPU   Iterations
-----------------------------------------------------------------------------------------------------------------------------
Join<int32_t, int32_t>/join_32bit/100000/100000/manual_time           0.508 ms         0.500 ms         0.532 ms         1393
Join<int32_t, int32_t>/join_32bit/100000/400000/manual_time           0.679 ms         0.670 ms         0.702 ms         1003
Join<int32_t, int32_t>/join_32bit/100000/1000000/manual_time           1.16 ms          1.21 ms          1.24 ms          581
Join<int32_t, int32_t>/join_32bit/10000000/10000000/manual_time        28.8 ms          28.5 ms          28.5 ms           25
Join<int32_t, int32_t>/join_32bit/10000000/40000000/manual_time        77.7 ms          76.5 ms          76.6 ms            9
Join<int32_t, int32_t>/join_32bit/10000000/100000000/manual_time        110 ms           108 ms           108 ms            6
Join<int32_t, int32_t>/join_32bit/100000000/100000000/manual_time       290 ms           287 ms           287 ms            2
Join<int32_t, int32_t>/join_32bit/80000000/240000000/manual_time        500 ms           491 ms           491 ms            2
Join<int64_t, int64_t>/join_64bit/50000000/50000000/manual_time         147 ms           146 ms           146 ms            5
Join<int64_t, int64_t>/join_64bit/40000000/120000000/manual_time        255 ms           251 ms           251 ms            3

SORT_TEST

2020-09-10 13:51:34
Running ./SORT_BENCH
Run on (8 X 4200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 8192 KiB (x1)
Load Average: 0.73, 0.55, 0.42
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------------------------------------
Benchmark                                               ORIG Time     AFTER CHANGE Time               CPU   Iterations
----------------------------------------------------------------------------------------------------------------------
Sort<true>/sort_stable/1024/1/manual_time               0.201 ms               0.198 ms          0.228 ms         3523
Sort<true>/sort_stable/4096/1/manual_time               0.265 ms               0.263 ms          0.292 ms         2651
Sort<true>/sort_stable/32768/1/manual_time              0.350 ms               0.345 ms          0.375 ms         2029
Sort<true>/sort_stable/262144/1/manual_time             0.660 ms               0.656 ms          0.684 ms         1069
Sort<true>/sort_stable/2097152/1/manual_time             5.07 ms                5.11 ms           5.14 ms          137
Sort<true>/sort_stable/16777216/1/manual_time            41.4 ms                41.7 ms           41.7 ms           17
Sort<true>/sort_stable/67108864/1/manual_time             171 ms                 172 ms            172 ms            4
Sort<true>/sort_stable/1024/8/manual_time               0.570 ms               0.567 ms          0.598 ms         1217
Sort<true>/sort_stable/4096/8/manual_time               0.745 ms               0.728 ms          0.760 ms          961
Sort<true>/sort_stable/32768/8/manual_time               1.02 ms                1.03 ms           1.06 ms          679
Sort<true>/sort_stable/262144/8/manual_time              2.19 ms                2.19 ms           2.21 ms          319
Sort<true>/sort_stable/2097152/8/manual_time             37.8 ms                38.4 ms           38.4 ms           18
Sort<true>/sort_stable/16777216/8/manual_time             677 ms                 673 ms            673 ms            1
Sort<true>/sort_stable/67108864/8/manual_time            3895 ms                3867 ms           3853 ms            1
Sort<false>/sort_unstable/1024/1/manual_time            0.204 ms               0.204 ms          0.235 ms         3413
Sort<false>/sort_unstable/4096/1/manual_time            0.268 ms               0.267 ms          0.298 ms         2621
Sort<false>/sort_unstable/32768/1/manual_time           0.351 ms               0.350 ms          0.381 ms         1988
Sort<false>/sort_unstable/262144/1/manual_time          0.659 ms               0.660 ms          0.690 ms         1060
Sort<false>/sort_unstable/2097152/1/manual_time          5.08 ms                5.13 ms           5.17 ms          137
Sort<false>/sort_unstable/16777216/1/manual_time         41.5 ms                41.7 ms           41.4 ms           17
Sort<false>/sort_unstable/67108864/1/manual_time          171 ms                 172 ms            172 ms            4
Sort<false>/sort_unstable/1024/8/manual_time            0.577 ms               0.566 ms          0.598 ms         1231
Sort<false>/sort_unstable/4096/8/manual_time            0.747 ms               0.733 ms          0.765 ms          948
Sort<false>/sort_unstable/32768/8/manual_time            1.02 ms                1.03 ms           1.05 ms          678
Sort<false>/sort_unstable/262144/8/manual_time           2.19 ms                2.19 ms           2.20 ms          318
Sort<false>/sort_unstable/2097152/8/manual_time          37.9 ms                38.4 ms           38.4 ms           18
Sort<false>/sort_unstable/16777216/8/manual_time          676 ms                 675 ms            674 ms            1
Sort<false>/sort_unstable/67108864/8/manual_time         3881 ms                3879 ms           3878 ms            1

Contributor

jrhemstad Sep 11, 2020 •

edited

Loading

I am poring over the output from cuobjdump --dumpsass on hash_join.o, groupby.o, etc. There's a fair bit to peruse there, and I'm not completely clear on what I'm looking for.

You want to use cubobjdump --dump-resource-usage and look at the number of registers and stack frame used per thread.

Contributor Author

mythrocks Sep 11, 2020 •

edited

Loading

Hey, @jrhemstad. There are a few object files that indicate the following kind of changes: (What's below is from hash_join.o.)
Before:

   Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
    REG:62 STACK:0 SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:144 TEXTURE:0 SURFACE:0 SAMPLER:0
   Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
    REG:64 STACK:0 SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:144 TEXTURE:0 SURFACE:0 SAMPLER:0
...
 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:72 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:80 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0

After:

 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:72 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function _ZN4cudf6detail16probe_hash_tableILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128ELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEPiSL_SL_i:
  REG:80 STACK:UNKNOWN SHARED:4112 LOCAL:0 CONSTANT[0]:596 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
...
 Function _ZN4cudf6detail24compute_join_output_sizeILNS0_9join_kindE0E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEiPi:
  REG:74 STACK:UNKNOWN SHARED:32 LOCAL:0 CONSTANT[0]:584 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0
 Function _ZN4cudf6detail24compute_join_output_sizeILNS0_9join_kindE1E29concurrent_unordered_multimapIjimLj4294967295ELi2147483647E14MurmurHash3_32IjE8equal_toIjE17default_allocatorIN6thrust4pairIjiEEELb0EELi128EEEvT0_NS_17table_device_viewESF_NS_10row_hasherI12default_hashLb1EEENS_23row_equality_comparatorILb1EEEiPi:
  REG:80 STACK:UNKNOWN SHARED:32 LOCAL:0 CONSTANT[0]:584 CONSTANT[2]:404 TEXTURE:0 SURFACE:0 SAMPLER:0

There were (effectively) 4 functions in hash_join.cu whose STACK went from 0 to UNKNOWN. This also seems to be the case in hash/groupby.cu, semi_join.cu. join.cu remains unaffected.

GROUPBY_BENCH and JOIN_BENCH times don't appear to be affected, to my eye. It is more than likely that these aren't the benchmarks I'm looking for, to demonstrate the effect of STACK: UNKNOWN. :/

Contributor

jrhemstad Sep 16, 2020

The register usage also went up significantly. 62 -> 72, 64 -> 80 in the probe_hash_table kernels. This is the kind of thing I was worried about and makes me very wary. The fact that benchmarks don't show any regression doesn't provide me much solace as that's likely due to a lack of thorough benchmarking that would expose this as problematic.

Contributor Author

mythrocks Sep 21, 2020

Thanks for the advice and analysis, @jrhemstad.
I will close this PR, and and try revisit this afterwards.

Contributor

jrhemstad Sep 21, 2020

I will close this PR, and and try revisit this afterwards.

You don't need to close the PR. I'm not saying this will never be merged. I just want to be very careful and make sure we do due diligence before adding this in.


          [list_eq] Restore row_equality_comparator to row_comparators.cuh.

59a4cfb

jrhemstad reviewed

View reviewed changes

cpp/include/cudf/table/element_equality_comparator.cuh

Comment on lines +88 to +89

		bool const lhs_is_null{lhs.nullable() and lhs.is_null(lhs_element_index)};
		bool const rhs_is_null{rhs.nullable() and rhs.is_null(rhs_element_index)};

Contributor

jrhemstad Aug 31, 2020

While we're at it, might as well clean this up. This nullability check is redundant.

Suggested change

      
                  bool const lhs_is_null{lhs.nullable() and lhs.is_null(lhs_element_index)};
          
                  bool const rhs_is_null{rhs.nullable() and rhs.is_null(rhs_element_index)};
          
                  bool const lhs_is_null{lhs.is_null(lhs_element_index)};
          
                  bool const rhs_is_null{rhs.is_null(rhs_element_index)};

rapidsai deleted a comment from EpigenRice

harrism added libcudf tests labels

harrism changed the base branch from branch-0.16 to branch-0.17

October 6, 2020 04:54

Contributor Author

mythrocks commented Oct 12, 2020

I'd better close this PR for the moment. It is stale, and will require a different approach. I'll open a new one when I get a chance to work on this again.

mythrocks closed this

Contributor Author

mythrocks commented Oct 12, 2020

P.S. Thank you for the guidance, @jrhemstad.

davidwendt mentioned this pull request

Add libcudf lists column count_elements API #7173

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 - In Progress libcudf tests