Use cuco::static_set in the hash-based groupby #14813

PointKernel · 2024-01-19T23:13:15Z

Description

Depends on #14849

Contributes to #12261

This PR migrates hash groupby to use the new cuco::static_set data structure. It doesn't change any existing libcudf behavior but uncovers the fact that the cudf python value_counts doesn't guarantee output orders thus the PR becomes a breaking change.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…oupby

…428) This PR fixes a bug in the OA base class where the erased key sentinel value should have been initialized by the empty key sentinel if not specified. Tests are updated to exercise this issue. Needed by rapidsai/cudf#14813

…oupby

PointKernel · 2024-02-16T22:18:29Z

Question to reviewers: is it worth doing runtime dispatching based on whether nested types are involved or not?

Benchmark results are shown below, TLDR:

The hash table size is about half as before
The new set groupby brings from 0 to 30% speedups for flat types
The improvement for nested type is noticeable but not as good as flat types

Based on our past tuning experience, e.g.:

cudf/cpp/src/search/contains_table.cu

Lines 158 to 165 in 3ba63c3

    
           // Distinguish probing scheme CG sizes between nested and flat types for better performance 
        
           auto const probing_scheme = [&]() { 
        
             if constexpr (HasNested) { 
        
               return cuco::linear_probing<4, Hasher>{d_hasher}; 
        
             } else { 
        
               return cuco::linear_probing<1, Hasher>{d_hasher}; 
        
             } 
        
           }();

using a larger CG size (like 2 or 4) for nested types can bring significant speedups comapred to the current CGSize == 1 for all types.

groupby_max

[0] Quadro RTX 8000

T	num_rows	null_probability	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	2^12	0	83.292 us	7.14%	83.962 us	7.56%	0.670 us	0.80%	PASS
I32	2^18	0	123.161 us	21.91%	103.673 us	3.47%	-19.489 us	-15.82%	FAIL
I32	2^24	0	2.702 ms	5.78%	1.886 ms	4.61%	-815.711 us	-30.19%	FAIL
I32	2^12	0.1	106.195 us	4.47%	105.438 us	5.46%	-0.757 us	-0.71%	PASS
I32	2^18	0.1	144.568 us	4.03%	130.310 us	4.62%	-14.258 us	-9.86%	FAIL
I32	2^24	0.1	2.713 ms	2.97%	1.956 ms	3.43%	-756.833 us	-27.90%	FAIL
I32	2^12	0.9	106.164 us	5.38%	105.230 us	5.81%	-0.933 us	-0.88%	PASS
I32	2^18	0.9	134.932 us	3.71%	126.974 us	4.86%	-7.958 us	-5.90%	FAIL
I32	2^24	0.9	2.396 ms	0.49%	1.700 ms	0.94%	-695.351 us	-29.03%	FAIL
I64	2^12	0	79.217 us	4.74%	81.464 us	5.33%	2.247 us	2.84%	PASS
I64	2^18	0	124.065 us	3.35%	110.290 us	3.71%	-13.776 us	-11.10%	FAIL
I64	2^24	0	2.921 ms	5.13%	2.080 ms	3.67%	-841.509 us	-28.81%	FAIL
I64	2^12	0.1	107.544 us	5.30%	106.271 us	5.33%	-1.273 us	-1.18%	PASS
I64	2^18	0.1	158.161 us	12.67%	138.162 us	3.74%	-19.999 us	-12.64%	FAIL
I64	2^24	0.1	2.935 ms	3.03%	2.149 ms	2.33%	-786.224 us	-26.79%	FAIL
I64	2^12	0.9	106.698 us	5.45%	105.586 us	5.56%	-1.112 us	-1.04%	PASS
I64	2^18	0.9	141.798 us	3.56%	131.287 us	5.12%	-10.512 us	-7.41%	FAIL
I64	2^24	0.9	2.528 ms	0.41%	1.843 ms	0.67%	-684.869 us	-27.09%	FAIL
F32	2^12	0	87.432 us	4.34%	87.644 us	5.17%	0.212 us	0.24%	PASS
F32	2^18	0	139.932 us	5.17%	120.252 us	3.90%	-19.680 us	-14.06%	FAIL
F32	2^24	0	3.034 ms	8.05%	2.159 ms	8.12%	-875.115 us	-28.85%	FAIL
F32	2^12	0.1	118.647 us	4.18%	117.394 us	4.97%	-1.254 us	-1.06%	PASS
F32	2^18	0.1	167.872 us	5.11%	145.176 us	3.74%	-22.696 us	-13.52%	FAIL
F32	2^24	0.1	2.997 ms	5.81%	2.172 ms	6.50%	-825.041 us	-27.53%	FAIL
F32	2^12	0.9	121.641 us	6.73%	111.989 us	5.16%	-9.652 us	-7.93%	FAIL
F32	2^18	0.9	143.159 us	4.11%	131.017 us	3.83%	-12.142 us	-8.48%	FAIL
F32	2^24	0.9	2.489 ms	0.43%	1.787 ms	0.84%	-702.675 us	-28.23%	FAIL
F64	2^12	0	87.555 us	4.14%	88.347 us	4.57%	0.792 us	0.90%	PASS
F64	2^18	0	156.124 us	17.01%	127.645 us	4.27%	-28.480 us	-18.24%	FAIL
F64	2^24	0	3.300 ms	8.27%	2.337 ms	8.59%	-963.100 us	-29.19%	FAIL
F64	2^12	0.1	119.458 us	4.38%	117.437 us	5.24%	-2.021 us	-1.69%	PASS
F64	2^18	0.1	173.348 us	4.67%	152.983 us	4.03%	-20.365 us	-11.75%	FAIL
F64	2^24	0.1	3.249 ms	6.93%	2.349 ms	5.91%	-900.036 us	-27.70%	FAIL
F64	2^12	0.9	112.170 us	4.52%	113.566 us	5.73%	1.396 us	1.24%	PASS
F64	2^18	0.9	148.631 us	3.42%	135.485 us	3.78%	-13.146 us	-8.84%	FAIL
F64	2^24	0.9	2.612 ms	0.49%	1.906 ms	0.64%	-705.332 us	-27.01%	FAIL

groupby_struct_keys

[0] Quadro RTX 8000

NumRows	Depth	Nulls	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
2^10	0	0	71.830 us	4.79%	68.804 us	6.57%	-3.026 us	-4.21%	PASS
2^16	0	0	78.070 us	5.15%	81.142 us	5.82%	3.072 us	3.94%	PASS
2^20	0	0	267.730 us	8.76%	228.207 us	12.14%	-39.523 us	-14.76%	FAIL
2^10	1	0	192.670 us	4.20%	195.295 us	4.21%	2.625 us	1.36%	PASS
2^16	1	0	199.809 us	4.59%	199.813 us	4.16%	0.004 us	0.00%	PASS
2^20	1	0	403.536 us	4.08%	400.537 us	7.91%	-2.999 us	-0.74%	PASS
2^10	8	0	631.825 us	2.70%	628.752 us	3.03%	-3.073 us	-0.49%	PASS
2^16	8	0	651.027 us	2.83%	645.891 us	2.99%	-5.136 us	-0.79%	PASS
2^20	8	0	1.049 ms	2.41%	990.536 us	2.78%	-58.560 us	-5.58%	FAIL
2^10	0	1	127.173 us	5.66%	127.897 us	3.85%	0.724 us	0.57%	PASS
2^16	0	1	128.940 us	4.33%	132.152 us	4.29%	3.212 us	2.49%	PASS
2^20	0	1	304.425 us	5.95%	268.729 us	7.05%	-35.696 us	-11.73%	FAIL
2^10	1	1	193.054 us	4.19%	195.841 us	4.19%	2.787 us	1.44%	PASS
2^16	1	1	199.877 us	4.14%	201.166 us	4.07%	1.289 us	0.64%	PASS
2^20	1	1	429.069 us	5.37%	431.467 us	7.65%	2.398 us	0.56%	PASS
2^10	8	1	627.959 us	2.97%	625.391 us	2.84%	-2.568 us	-0.41%	PASS
2^16	8	1	654.510 us	3.21%	648.749 us	2.94%	-5.761 us	-0.88%	PASS
2^20	8	1	1.055 ms	2.13%	1.006 ms	2.87%	-49.153 us	-4.66%	FAIL

…oupby

PointKernel · 2024-02-27T21:48:24Z

@vyasr Thanks for the review, have you reviewed both cpp and python or do I need another cpp/python approval to merge the PR?

vyasr · 2024-02-27T22:29:52Z

I reviewed both. I think the main question on the Python side would be whether we need to update the docs in https://github.com/rapidsai/cudf/blob/branch-24.04/docs/cudf/source/user_guide/pandas-comparison.md#result-ordering to also mention value_counts. @shwina WDYT? It's not immediately obvious that this would be implemented as a groupby under the hood.

shwina · 2024-02-28T18:42:07Z

I think we should do that, yes (this PR is then a breaking change)

bdice · 2024-02-28T19:36:41Z

cpp/src/groupby/hash/groupby.cu

+using probing_scheme_type = cuco::linear_probing<
+  1,  ///< Number of threads used to handle each input key


@PointKernel You mentioned:

Based on our past tuning experience, e.g.:
cudf/cpp/src/search/contains_table.cu
using a larger CG size (like 2 or 4) for nested types can bring significant speedups compared to the current CGSize == 1 for all types.

Do we need to adopt that here? Do we need an issue or TODO describing that optimization?

Do we need to adopt that here?

Probably not. That requires nontrivial changes to the current code and the benefit is unclear, i.e., no users actually complained about the groupby performance with nested types. I'm inclined to look into it until we have the bandwidth or someone raises an issue about it. Does it sound good to you?

I’d like some kind of note in the code or an issue to make sure that we are aware of this optimization potential for the future. Otherwise, no action needed in terms of implementation.

Make sense. Done in 56a2229

vyasr · 2024-02-28T20:15:14Z

Also CC @mroeschke @galipremsagar we may have to introduce an extra sorting in pandas compatibility mode due to the ordering change in this PR for cudf.pandas to behave as expected. No need to make any changes yet, just something to keep on your radars.

PointKernel · 2024-02-29T01:08:15Z

@vyasr @shwina I've updated the doc as suggested and marked this work as a breaking change.

Thanks for your comments.

…groupby

…oupby

PointKernel · 2024-02-29T21:25:31Z

/merge

PointKernel added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue tech debt improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 19, 2024

PointKernel added 2 commits January 19, 2024 15:14

Rewrite hash groupby with hash set

11b8126

Formatting

166ed49

PointKernel force-pushed the cuco-set-groupby branch from 6213d8e to 166ed49 Compare January 19, 2024 23:15

PointKernel mentioned this pull request Jan 19, 2024

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open

PointKernel added 2 commits January 22, 2024 13:18

Merge remote-tracking branch 'upstream/branch-24.04' into cuco-set-gr…

07313fe

…oupby

Minor cleanups

b1db243

PointKernel self-assigned this Jan 23, 2024

PointKernel mentioned this pull request Jan 23, 2024

Fix the default erase sentinel bug in the base open addressing ctor NVIDIA/cuCollections#428

Merged

rapidsai deleted a comment from copy-pr-bot bot Feb 16, 2024

PointKernel added 8 commits February 16, 2024 12:25

Merge remote-tracking branch 'upstream/branch-24.04' into cuco-set-gr…

ed8502d

…oupby

Update cuco code

ca6829d

Add CUCO_CUDF_SIZE_TYPE_SENTINEL

0c10a0b

Header cleanups

2470c68

Update docs

7da8c55

Minor doc updates

7dd59a6

Add peak memory usage metrics to groupby NV bencmarks

3cbdb7c

Revert some benchmark changes

82aa0ce

PointKernel added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 16, 2024

PointKernel marked this pull request as ready for review February 16, 2024 22:21

PointKernel requested a review from a team as a code owner February 16, 2024 22:21

PointKernel requested a review from robertmaynard February 16, 2024 22:21

PointKernel requested a review from a team as a code owner February 22, 2024 20:38

PointKernel requested review from vyasr and shwina February 22, 2024 20:38

github-actions bot added the Python Affects Python cuDF API. label Feb 22, 2024

PointKernel added 2 commits February 22, 2024 12:42

Renaming

574f628

Fix several docstring tests

75a8e64

vyasr removed the tech debt label Feb 23, 2024

PointKernel added 2 commits February 23, 2024 12:57

Make value_counts docstring test deterministic

85a47db

Merge remote-tracking branch 'upstream/branch-24.04' into cuco-set-gr…

241aca0

…oupby

github-actions bot added the cuDF (Python) label Feb 23, 2024

vyasr removed the cuDF (Python2) label Feb 23, 2024

vyasr approved these changes Feb 27, 2024

View reviewed changes

Merge branch 'branch-24.04' into cuco-set-groupby

dbd9e6b

Merge branch 'branch-24.04' into cuco-set-groupby

0af1d13

bdice reviewed Feb 28, 2024

View reviewed changes

Update docs

f79f1d6

PointKernel added breaking Breaking change and removed non-breaking Non-breaking change labels Feb 29, 2024

PointKernel added 4 commits February 28, 2024 23:21

Merge branch 'branch-24.04' into cuco-set-groupby

8263b4f

Add TODO reminder for future performance tuning

56a2229

Merge remote-tracking branch 'origin/cuco-set-groupby' into cuco-set-…

8bade44

…groupby

Merge remote-tracking branch 'upstream/branch-24.04' into cuco-set-gr…

6e54cd9

…oupby

rapids-bot bot merged commit 200fc0b into rapidsai:branch-24.04 Feb 29, 2024
76 checks passed

PointKernel deleted the cuco-set-groupby branch February 29, 2024 21:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use cuco::static_set in the hash-based groupby #14813

Use cuco::static_set in the hash-based groupby #14813

PointKernel commented Jan 19, 2024 •

edited

Loading

PointKernel commented Feb 16, 2024 •

edited

Loading

PointKernel commented Feb 27, 2024 •

edited

Loading

vyasr commented Feb 27, 2024 •

edited

Loading

shwina commented Feb 28, 2024

bdice Feb 28, 2024

PointKernel Feb 29, 2024

bdice Feb 29, 2024 •

edited

Loading

PointKernel Feb 29, 2024

vyasr commented Feb 28, 2024

PointKernel commented Feb 29, 2024

PointKernel commented Feb 29, 2024

		using probing_scheme_type = cuco::linear_probing<
		1, ///< Number of threads used to handle each input key

Use cuco::static_set in the hash-based groupby #14813

Use cuco::static_set in the hash-based groupby #14813

Conversation

PointKernel commented Jan 19, 2024 • edited Loading

Description

Checklist

PointKernel commented Feb 16, 2024 • edited Loading

groupby_max

[0] Quadro RTX 8000

groupby_struct_keys

[0] Quadro RTX 8000

PointKernel commented Feb 27, 2024 • edited Loading

vyasr commented Feb 27, 2024 • edited Loading

shwina commented Feb 28, 2024

bdice Feb 28, 2024

Choose a reason for hiding this comment

PointKernel Feb 29, 2024

Choose a reason for hiding this comment

bdice Feb 29, 2024 • edited Loading

Choose a reason for hiding this comment

PointKernel Feb 29, 2024

Choose a reason for hiding this comment

vyasr commented Feb 28, 2024

PointKernel commented Feb 29, 2024

PointKernel commented Feb 29, 2024

PointKernel commented Jan 19, 2024 •

edited

Loading

PointKernel commented Feb 16, 2024 •

edited

Loading

PointKernel commented Feb 27, 2024 •

edited

Loading

vyasr commented Feb 27, 2024 •

edited

Loading

bdice Feb 29, 2024 •

edited

Loading