[BUG] low cardinality joins on the GPU are really bad. #7529

revans2 · 2023-01-18T19:47:39Z

Describe the bug
I recently was playing around with testing a patch and ended up crating a join that ran absolutely horribly on the GPU, but not too bad on the CPU. This is not common so I spent a little bit of time trying to debug it and found that it really comes down to the cardinality of the build side (RHS in a left outer join or the smaller table in an inner join).

Steps/Code to reproduce bug

For Inner join

val lhs = spark.range(200000000L).selectExpr("CAST(id DIV 100 as STRING) as something")
val rhs = spark.range(20000000L).selectExpr("CAST(id % 100 as STRING) as more")
spark.time(lhs.join(rhs, lhs("something") === rhs("more")).count())

Partitions	GPU Time (ms)	CPU Time (ms)	approximate GPU join op time (ms)
4	151,469	31,828	594,000
8	74,058	25,810	288,000
16	39,506	22,395	150,000
32	23,240	20,167	84,000
64	14,685	19,776	52,300
128	9,941	19,037	32,800
256	6,507	21,247	20,300
512	4,565	21,288	12,500
1,024	3,755	21,450	8,300
2,048	3,586	23,769	5,700
4,096	4,721	24,560	4,200
8,192	8,295	30,102	3,700

For left outer join (which is an order of magnitude slower than inner in this case so I had to make the input sizes 1/10th that of the inner join)

val lhs = spark.range(20000000L).selectExpr("CAST(id DIV 100 as STRING) as something")
val rhs = spark.range(2000000L).selectExpr("CAST(id % 100 as STRING) as more")
spark.time(lhs.join(rhs, lhs("something") === rhs("more"), jointType="leftouter").count())

Partitions	GPU Time (ms)	CPU Time (ms)	approximate GPU join op time (ms)
4	51,179	4,295	138,000
8	29,732	2,706	78,000
16	20,851	2,305	57,000
32	13,894	2,235	41,900
64	8,978	2,229	30,000
128	6,153	2,151	20,500
256	4,461	3,011	14,400
512	3,563	3,077	10,500
1,024	3,489	3,948	8,700
2,048	4,214	5,323	8,500
4,096	5,885	8,618	9,400
8,192	9,688	15,527	11,800

I tested with longs instead of strings as the join keys and it made a very small difference, but not enough to worry about.
I verified that none of the operators were spilling. In fact the input data is small enough I could have set the concurrent GPU tasks to 12 and gotten a bit more speedup out of the GPU.

It ended up being very directly related to the cardinality of the build side.

val lhs = spark.range(20000000L).selectExpr("CAST(id DIV 200 as STRING) as something")
val rhs = spark.range(2000000L).selectExpr("CAST(id % X as STRING) as more")
spark.time(lhs.join(rhs, lhs("something") === rhs("more"), jointType="leftouter").count())

Where X is the cardinality. I set the concurrent level to 4. There are 12 cores and 12 threads. This was also on an a6000 GPU.

Cardinality	GPU Time (ms)
50	82,272
100	23,873
200	7,216
400	2,257
800	831
1,600	396
3,200	271

Increasing the cardinality has a 300x speedup. The other interesting part is that we can also have a 14x speedup on a cardinality of 100 if we subdivide the input data even smaller pieces. This is something we were looking into to deal with out of memory issues, but here it shows some big performance improvements. I am not sure exactly what is happening or if this is something that would impact a real world use case, but I thought I should document this at the least. I think it is related to making the build table and collisions in the hash table, specifically with the need to walk more of the tree on a collision vs a hash aggregation that can use atomics to do the operations.

We are doing an estimate with a gorupby for the output size using the build side table it we might be able to use that to figure out if we are in a bad case like this and possibly decide if there is an alternative join path we could take.

The text was updated successfully, but these errors were encountered:

jlowe · 2023-01-18T20:03:42Z

@revans2 have you taken an Nsight Systems trace of this case? Curious if it's primarily a problem with building the hash table vs. probing it via the stream table (or both). My guess is we're getting killed by the thread collisions on the same key when trying to build the hash table, and it's not so much an issue when we probe it later.

revans2 · 2023-01-18T20:11:02Z

I stopped at finding the issue. I had already spent enough time on this that I wanted to stop and check in on the priority for next steps.

revans2 · 2023-01-19T18:25:38Z

Okay, I think this is really just us doing a very bad job a join output estimation. The join output estimation code looks at the build side of the table and guesses at a row multiplication factor based off of the average key count. In this case we have a build side with a relatively low cardinality and lots fo duplication. The estimate ends up being about 20,000 for anything that has a build table. The stream side has a lot of values that are not in the build table. So the average size increase is actually 210, not 20,000.

So the more we end up partitioning the input the more likely it is that we end up with an accurate estimate. In many cases the estimation is 0 because there is no build table. In the other cases we have less keys on the stream side that don't match anything.

I don't think that this is likely to happen in reality, but it could. The only real way to fix this is to have better join estimation that takes into account the stream table too. We can probably do that if we had some kind of bloom filter in place that would be really nice to look into, and hopefully not too difficult to implement.

jlowe · 2023-01-19T18:31:37Z

I assume another alternative is to finally solve #2440 where we can get the join output size "for free" in most cases but find a way to mitigate the performance pitfalls we ran into there. IIRC the biggest issue with the last attempt was needing to throw away the built hash table when we decided to split the table we used to build that hash table (e.g.: inner join where we can freely pick). Join code doesn't currently support splitting what was declared the build side, but the libcudf join algorithm is slower if we don't always pick the smaller table as the build table. Thus, if the stream batch is smaller than the build batch, we should split the build batch but currently the code isn't prepped to handle that.

revans2 · 2023-01-19T18:35:30Z

Yes, but how much slower is it to always have the build side be the build side in CUDF? And can we use AQE or something to dynamically switch the build side so we know which one is truly smaller? I think there are ways to fix those issues too. We might want to have a JOIN optimization epic of some sort where we come up with a plan on how to avoid some of these pitfalls.

jlowe · 2023-01-19T18:39:00Z

Yes, but how much slower is it to always have the build side be the build side in CUDF?

Slow enough that we didn't check in the first PR attempt because benchmarks showed the guesstimate was faster than the "for free" approach. The biggest issue there is that we don't have a way to make the built hash table spillable, so we can't save it across batches. If we could not have to rebuild the hash table for every batch, that could be a big win.

revans2 · 2023-01-19T18:53:23Z

If we could not have to rebuild the hash table for every batch, that could be a big win.

Exactly that is what I thought the slowness was. That picking the wrong side was slightly slower in most cases, but in a few it was rather bad. But we were not able to offset the slowness because we had to rebuild the has table each time. I would love to see us resurrect that code and see what happens if we just keep the hash table around. Just for an experiment to see if it is worth spending the time to make the hash table spillable. If it is a huge win, then it might be worth it.

PointKernel · 2024-05-29T19:39:05Z

We are about to close the related cudf issue (rapidsai/cudf#14948) and want to ensure we understand the performance regression described here properly.

Going through the above discussions, it seems that #2440 can help solve the problem and no actions are needed on the libcudf side, did I miss something?

jlowe · 2024-05-29T20:50:43Z

@PointKernel rapidsai/cudf#14948 doesn't seem to be related? I'm assuming this is in reference to rapidsai/cudf#15262. If so, curious why the latter is going to be closed? It's not clear to me how optimizations around distinct joins apply to this problem where the issue is about low cardinality, highly duplicated build-side keys that have many collisions and potentially long chain walking when building the hash table.

PointKernel · 2024-05-29T21:22:41Z

I see, rapidsai/cudf#14948 is indeed unrelated then. rapidsai/cudf#15262 probably describes another problem as well since it's for low cardinality groupby IIUC.

the issue is about low cardinality, highly duplicated build-side keys that have many collisions and potentially long chain walking when building the hash table

Tuning CG size can help in this case https://github.com/rapidsai/cudf/blob/9192d259633c382c6f98f956dc7f43d754ebbf44/cpp/include/cudf/detail/join.hpp#L46. Using a larger CG size like 4, 8 or 16 instead of 2 can help with high-multiplicity cases. Is it easy for you to test the performance impact of larger CG sizes? If it's proved to be effective, libcudf can expose CG size in public APIs to make your tuning work easier.

jlowe · 2024-05-29T21:38:36Z

@PointKernel thanks for the pointers. Can you comment on rapidsai/cudf#15262? That's the issue tracking the poor performance of highly duplicated build-side keys when building hash tables. I know it's talking about aggregations, but I believe the problem is common between that and joins -- both start with building a hash table, and that build is particularly slow when there are many key collisions and long chaining.

revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Jan 18, 2023

sameerz removed the ? - Needs Triage Need team to review and classify label Jan 24, 2023

sameerz mentioned this issue Mar 9, 2024

[FEA] Add shared memory hash map for low-cardinality aggregations rapidsai/cudf#15262

Closed

GregoryKimball mentioned this issue May 2, 2024

[FEA] Add distinct-key joins to libcudf rapidsai/cudf#14948

Closed

6 tasks

GregoryKimball mentioned this issue Jun 13, 2024

[FEA] Improve performance of high-multiplicity joins rapidsai/cudf#16025

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] low cardinality joins on the GPU are really bad. #7529

[BUG] low cardinality joins on the GPU are really bad. #7529

revans2 commented Jan 18, 2023

jlowe commented Jan 18, 2023

revans2 commented Jan 18, 2023

revans2 commented Jan 19, 2023

jlowe commented Jan 19, 2023

revans2 commented Jan 19, 2023

jlowe commented Jan 19, 2023

revans2 commented Jan 19, 2023

PointKernel commented May 29, 2024

jlowe commented May 29, 2024

PointKernel commented May 29, 2024

jlowe commented May 29, 2024

[BUG] low cardinality joins on the GPU are really bad. #7529

[BUG] low cardinality joins on the GPU are really bad. #7529

Comments

revans2 commented Jan 18, 2023

jlowe commented Jan 18, 2023

revans2 commented Jan 18, 2023

revans2 commented Jan 19, 2023

jlowe commented Jan 19, 2023

revans2 commented Jan 19, 2023

jlowe commented Jan 19, 2023

revans2 commented Jan 19, 2023

PointKernel commented May 29, 2024

jlowe commented May 29, 2024

PointKernel commented May 29, 2024

jlowe commented May 29, 2024