Add groupby_max multi-threaded benchmark #16154

srinivasyadav18 · 2024-07-01T22:44:10Z

Description

This PR adds groupby_max multi-threaded benchmark. The benchmark runs multiple max groupby aggregations concurrently using one CUDA stream per host thread.

Closes #16134

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

GregoryKimball · 2024-07-02T05:18:54Z

Thank you @srinivasyadav18 for constructing this!

I ran the benchmark and I believe it is working as expected. With an increased thread and stream count, we are seeing higher throughput for smaller batch sizes (perhaps 7 to 27 GB/s for 4M row batches). For larger batches we see saturation around ~60 GB/s for various thread counts.

Two items I noticed:

Would you please enable I64 and F64 as optional values for T? We don't need to run them by default, but it would be nice to be able to choose them.
This is the first time we are profiling a multi-threaded hash algorithm - congrats! I suppose we don't see a huge benefit because SM utilization is 100% but warp utilization is 34%. Adding more threads doesn't add more warps because the SMs are already active. Adding more threads does improve pipelining and give some boost to throughput (as we were hoping).

This is what an 8-thread groupby_max looks like with 100M row batches:

Some commands I was using:
/nfs/nsight-systems-2022.5.1/bin/nsys profile -t nvtx,cuda,osrt -f true --cuda-memory-usage=true --gpu-metrics-device=0 --output=/nfs/20240627_databricks/prof_multigroupby ./GROUPBY_NVBENCH -d 0 --profile -b 2 -a T=I32 -a num_rows[pow2]=[15,20,25] -a cardinality=1000000 --timeout 0.02 -a null_probability=0.1

./GROUPBY_NVBENCH -d 0 --profile -b 2 -a T=I32 -a num_rows[pow2]=[10:28:3] -a cardinality=1000000 --timeout 0.02 -a null_probability=0.1 -a num_threads=[1,2,4,8,16,32]

For the intermediate utilization of 4M rows per batch, you can see how 8 streams increases the SM utilization.

Zooming in to the 4M row batch case, I think we are seeing copy engine contention even here in groupby_max. Lots of 8byte copies to and from host pageable, and sometimes with no kernel running! Hopefully the work in support of #15620 will also improve pipelining here.

GregoryKimball · 2024-07-02T14:49:32Z

Something about the multithread benchmark is giving lower throughput. At first I thought it was the I64 versus I32 type, but now I think its a different root cause. Are these two commands running the same thing? Why is the timing so much longer in the multithreaded case?

./GROUPBY_NVBENCH -d 0 -b 2 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1 -a num_threads=1

## groupby_max_multithreaded

### [0] NVIDIA H100 80GB HBM3

|  T  | cardinality |    num_rows    | null_probability | num_threads | Samples | CPU Time | Noise  | GPU Time | Noise  | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|-------------|---------|----------|--------|----------|--------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |           1 |    304x | 1.759 ms | 31.12% | 1.746 ms | 31.03% |    2401 |        72.014 MiB |
(all_cuda-122_arch-x86_64) root@4f8b355cf5d1:/nfs/repo/gk24.08ah100/cpp/build/benchmarks# ./GROUPBY_NVBENCH -d 0 -b 0 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1

./GROUPBY_NVBENCH -d 0 -b 0 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1

## groupby_max

### [0] NVIDIA H100 80GB HBM3

|  T  | cardinality |    num_rows    | null_probability | Samples |  CPU Time  | Noise  |  GPU Time  | Noise  | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|---------|------------|--------|------------|--------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |   3232x | 534.603 us | 17.77% | 527.480 us | 17.69% |    7951 |        72.014 MiB |

Looking for more information... the original groupby max profile shows tight grouping of aggregate

Whereas the multithread groupby max shows a gap of nanosleep between each aggregate invocation. Is this timing difference just due to thread launching and synchronization overhead?

Perhaps a single-thread, multi-stream benchmark would show higher throughput!!

srinivasyadav18 · 2024-07-02T16:56:59Z

@GregoryKimball Suprisingly, I see same results for groupby_max_multithreaded with num_threads=1 and groupby_max on T4. But, yes it might be very different on new GPUs (H100 etc.).

./GROUPBY_NVBENCH -d 0 -b 0 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1

## groupby_max

### [0] Tesla T4

|  T  | cardinality |    num_rows    | null_probability | Samples | CPU Time | Noise | GPU Time | Noise | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|---------|----------|-------|----------|-------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |   3168x | 3.528 ms | 5.77% | 3.519 ms | 5.09% |    1191 |        72.014 MiB |

./GROUPBY_NVBENCH -d 0 -b 2 -a T=I32 -a num_rows[pow2]=[22] -a cardinality=1000000 -a null_probability=0.1 -a num_threads=1

## groupby_max_multithreaded

### [0] Tesla T4

|  T  | cardinality |    num_rows    | null_probability | num_threads | Samples | CPU Time | Noise  | GPU Time | Noise  | Mrows/s | peak_memory_usage |
|-----|-------------|----------------|------------------|-------------|---------|----------|--------|----------|--------|---------|-------------------|
| I32 |     1000000 | 2^22 = 4194304 |              0.1 |           1 |    944x | 4.079 ms | 11.47% | 4.072 ms | 11.45% |    1030 |        72.014 MiB |

PointKernel · 2024-07-02T17:58:06Z

Why is the timing so much longer in the multithreaded case?

We explicitly include threads.wait_for_tasks(); when counting the timer and this can be expensive. For the parquet case it doesn't matter much since kernels are large but with groupby's performance being around several milliseconds, the synchronization cost becomes innegligible IMO. Out of curiosity, what is the CPU in this case?

GregoryKimball · 2024-07-02T21:04:21Z

Thanks guys, I ran the profiles and benchmarks above on NVIDIA H100 80GB HBM3 + Intel(R) Xeon(R) Platinum 8480CL

cpp/benchmarks/groupby/group_max_multistream.cpp

cpp/benchmarks/groupby/group_max_multithreaded.cpp

PointKernel

LGTM

GregoryKimball · 2024-07-04T04:27:19Z

Thanks @srinivasyadav18 for building these new benchmarks!

As far as groupby_max_multistream, I think you and @PointKernel were right to doubt this idea. After testing and profiling I think we can drop the single-thread multi-stream benchmark. Would you please remove the groupby_max_multistream?
On the other hand, I also experimented with multiple batches on the same thread. My goal was to increase the amount of work per thread and better amortize per-thread overhead. I used the following pattern to perform 50 aggregations with input size num_rows

auto perform_agg = [&](int64_t index) { 
        for (int64_t i = 0; i < 50 ; i++) {
        gb_obj.aggregate(requests[index], streams[index]); 
        }
      };

The results show higher throughput and the profiles show clearer pipelining behavior:

Would you please consider adding an axis that lets us control "num_batches" (or something similar) to groupby_max_multithreaded?

cpp/benchmarks/groupby/group_max_multithreaded.cpp

GregoryKimball · 2024-07-08T17:44:12Z

@vuule would you please take a look? I would like to merge this new benchmark as soon as it is ready

vuule

Looks great, just a few nitpicks and questions

cpp/benchmarks/groupby/group_max_multithreaded.cpp

cpp/benchmarks/groupby/group_max.cpp

…srinivasyadav18/cudf into groupby_max_multithread_nvbench

srinivasyadav18 · 2024-07-10T04:00:52Z

/merge

…#16630) This PR fixes a minor bug where the `num_aggregations` axis was missed when working on #16154. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Bradley Dice (https://github.com/bdice) - David Wendt (https://github.com/davidwendt) URL: #16630

Add groupby_max multi-threaded benchmark

4bdb7d9

srinivasyadav18 requested a review from a team as a code owner July 1, 2024 22:44

srinivasyadav18 requested review from hyperbolic2346 and davidwendt July 1, 2024 22:44

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Jul 1, 2024

srinivasyadav18 added feature request New feature or request Performance Performance related issue non-breaking Non-breaking change labels Jul 1, 2024

GregoryKimball requested a review from PointKernel July 2, 2024 14:07

srinivasyadav18 added 3 commits July 3, 2024 20:06

Add multistream bench

3e6cfb3

drop multithreaded_cardinality bench

37f493c

Add more types for benchmarking

4136871

PointKernel requested changes Jul 3, 2024

View reviewed changes

cpp/benchmarks/groupby/group_max_multistream.cpp Outdated Show resolved Hide resolved

cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved

cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved

srinivasyadav18 added 2 commits July 3, 2024 20:21

minor cleanup

e37f707

further cleanup

ca049ff

PointKernel approved these changes Jul 3, 2024

View reviewed changes

wence- reviewed Jul 4, 2024

View reviewed changes

cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved

Add num_aggregations axis and remove multistream benchmark

32181eb

wence- mentioned this pull request Jul 8, 2024

[FEA] Update vendored thread-pool implementation #16209

Closed

GregoryKimball requested a review from vuule July 8, 2024 17:43

vuule reviewed Jul 8, 2024

View reviewed changes

fix minor nits

c32ccb6

srinivasyadav18 requested review from PointKernel and vuule July 8, 2024 22:30

PointKernel approved these changes Jul 8, 2024

View reviewed changes

vuule reviewed Jul 9, 2024

View reviewed changes

cpp/benchmarks/groupby/group_max_multithreaded.cpp Outdated Show resolved Hide resolved

add threads submit loop to timer

5808e2b

vuule approved these changes Jul 9, 2024

View reviewed changes

Merge branch 'branch-24.08' into groupby_max_multithread_nvbench

bdc40df

GregoryKimball reviewed Jul 9, 2024

View reviewed changes

cpp/benchmarks/groupby/group_max.cpp Show resolved Hide resolved

srinivasyadav18 added 2 commits July 10, 2024 01:09

Reduce default number of aggregations

8e08aa4

Merge branch 'groupby_max_multithread_nvbench' of https://github.com/…

d5fc3d4

…srinivasyadav18/cudf into groupby_max_multithread_nvbench

rapids-bot bot merged commit f592e9c into rapidsai:branch-24.08 Jul 10, 2024
79 checks passed

GregoryKimball mentioned this pull request Jul 10, 2024

[FEA] Updates to groupby_max multithreaded benchmark #16237

Closed

PointKernel mentioned this pull request Aug 21, 2024

Add the missing num_aggregations axis for groupby_max_cardinality #16630

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add groupby_max multi-threaded benchmark #16154

Add groupby_max multi-threaded benchmark #16154

srinivasyadav18 commented Jul 1, 2024

GregoryKimball commented Jul 2, 2024 •

edited

Loading

GregoryKimball commented Jul 2, 2024 •

edited

Loading

srinivasyadav18 commented Jul 2, 2024

PointKernel commented Jul 2, 2024

GregoryKimball commented Jul 2, 2024

PointKernel left a comment

GregoryKimball commented Jul 4, 2024 •

edited

Loading

GregoryKimball commented Jul 8, 2024

vuule left a comment

srinivasyadav18 commented Jul 10, 2024

Add groupby_max multi-threaded benchmark #16154

Add groupby_max multi-threaded benchmark #16154

Conversation

srinivasyadav18 commented Jul 1, 2024

Description

Checklist

GregoryKimball commented Jul 2, 2024 • edited Loading

GregoryKimball commented Jul 2, 2024 • edited Loading

srinivasyadav18 commented Jul 2, 2024

PointKernel commented Jul 2, 2024

GregoryKimball commented Jul 2, 2024

PointKernel left a comment

Choose a reason for hiding this comment

GregoryKimball commented Jul 4, 2024 • edited Loading

GregoryKimball commented Jul 8, 2024

vuule left a comment

Choose a reason for hiding this comment

srinivasyadav18 commented Jul 10, 2024

GregoryKimball commented Jul 2, 2024 •

edited

Loading

GregoryKimball commented Jul 2, 2024 •

edited

Loading

GregoryKimball commented Jul 4, 2024 •

edited

Loading