Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] low cardinality joins on the GPU are really bad. #7529

Open
revans2 opened this issue Jan 18, 2023 · 11 comments
Open

[BUG] low cardinality joins on the GPU are really bad. #7529

revans2 opened this issue Jan 18, 2023 · 11 comments
Labels
performance A performance related task/issue

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 18, 2023

Describe the bug
I recently was playing around with testing a patch and ended up crating a join that ran absolutely horribly on the GPU, but not too bad on the CPU. This is not common so I spent a little bit of time trying to debug it and found that it really comes down to the cardinality of the build side (RHS in a left outer join or the smaller table in an inner join).

Steps/Code to reproduce bug

For Inner join

val lhs = spark.range(200000000L).selectExpr("CAST(id DIV 100 as STRING) as something")
val rhs = spark.range(20000000L).selectExpr("CAST(id % 100 as STRING) as more")
spark.time(lhs.join(rhs, lhs("something") === rhs("more")).count())
Partitions GPU Time (ms) CPU Time (ms) approximate GPU join op time (ms)
4 151,469 31,828 594,000
8 74,058 25,810 288,000
16 39,506 22,395 150,000
32 23,240 20,167 84,000
64 14,685 19,776 52,300
128 9,941 19,037 32,800
256 6,507 21,247 20,300
512 4,565 21,288 12,500
1,024 3,755 21,450 8,300
2,048 3,586 23,769 5,700
4,096 4,721 24,560 4,200
8,192 8,295 30,102 3,700

GPU vs CPU inner join scaling

For left outer join (which is an order of magnitude slower than inner in this case so I had to make the input sizes 1/10th that of the inner join)

val lhs = spark.range(20000000L).selectExpr("CAST(id DIV 100 as STRING) as something")
val rhs = spark.range(2000000L).selectExpr("CAST(id % 100 as STRING) as more")
spark.time(lhs.join(rhs, lhs("something") === rhs("more"), jointType="leftouter").count())
Partitions GPU Time (ms) CPU Time (ms) approximate GPU join op time (ms)
4 51,179 4,295 138,000
8 29,732 2,706 78,000
16 20,851 2,305 57,000
32 13,894 2,235 41,900
64 8,978 2,229 30,000
128 6,153 2,151 20,500
256 4,461 3,011 14,400
512 3,563 3,077 10,500
1,024 3,489 3,948 8,700
2,048 4,214 5,323 8,500
4,096 5,885 8,618 9,400
8,192 9,688 15,527 11,800

GPU vs CPU left outer join scaling

I tested with longs instead of strings as the join keys and it made a very small difference, but not enough to worry about.
I verified that none of the operators were spilling. In fact the input data is small enough I could have set the concurrent GPU tasks to 12 and gotten a bit more speedup out of the GPU.

It ended up being very directly related to the cardinality of the build side.

val lhs = spark.range(20000000L).selectExpr("CAST(id DIV 200 as STRING) as something")
val rhs = spark.range(2000000L).selectExpr("CAST(id % X as STRING) as more")
spark.time(lhs.join(rhs, lhs("something") === rhs("more"), jointType="leftouter").count())

Where X is the cardinality. I set the concurrent level to 4. There are 12 cores and 12 threads. This was also on an a6000 GPU.

Cardinality GPU Time (ms)
50 82,272
100 23,873
200 7,216
400 2,257
800 831
1,600 396
3,200 271

GPU left outer join scaling

Increasing the cardinality has a 300x speedup. The other interesting part is that we can also have a 14x speedup on a cardinality of 100 if we subdivide the input data even smaller pieces. This is something we were looking into to deal with out of memory issues, but here it shows some big performance improvements. I am not sure exactly what is happening or if this is something that would impact a real world use case, but I thought I should document this at the least. I think it is related to making the build table and collisions in the hash table, specifically with the need to walk more of the tree on a collision vs a hash aggregation that can use atomics to do the operations.

We are doing an estimate with a gorupby for the output size using the build side table it we might be able to use that to figure out if we are in a bad case like this and possibly decide if there is an alternative join path we could take.

@revans2 revans2 added ? - Needs Triage Need team to review and classify performance A performance related task/issue labels Jan 18, 2023
@jlowe
Copy link
Member

jlowe commented Jan 18, 2023

@revans2 have you taken an Nsight Systems trace of this case? Curious if it's primarily a problem with building the hash table vs. probing it via the stream table (or both). My guess is we're getting killed by the thread collisions on the same key when trying to build the hash table, and it's not so much an issue when we probe it later.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 18, 2023

I stopped at finding the issue. I had already spent enough time on this that I wanted to stop and check in on the priority for next steps.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 19, 2023

Okay, I think this is really just us doing a very bad job a join output estimation. The join output estimation code looks at the build side of the table and guesses at a row multiplication factor based off of the average key count. In this case we have a build side with a relatively low cardinality and lots fo duplication. The estimate ends up being about 20,000 for anything that has a build table. The stream side has a lot of values that are not in the build table. So the average size increase is actually 210, not 20,000.

So the more we end up partitioning the input the more likely it is that we end up with an accurate estimate. In many cases the estimation is 0 because there is no build table. In the other cases we have less keys on the stream side that don't match anything.

I don't think that this is likely to happen in reality, but it could. The only real way to fix this is to have better join estimation that takes into account the stream table too. We can probably do that if we had some kind of bloom filter in place that would be really nice to look into, and hopefully not too difficult to implement.

@jlowe
Copy link
Member

jlowe commented Jan 19, 2023

I assume another alternative is to finally solve #2440 where we can get the join output size "for free" in most cases but find a way to mitigate the performance pitfalls we ran into there. IIRC the biggest issue with the last attempt was needing to throw away the built hash table when we decided to split the table we used to build that hash table (e.g.: inner join where we can freely pick). Join code doesn't currently support splitting what was declared the build side, but the libcudf join algorithm is slower if we don't always pick the smaller table as the build table. Thus, if the stream batch is smaller than the build batch, we should split the build batch but currently the code isn't prepped to handle that.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 19, 2023

Yes, but how much slower is it to always have the build side be the build side in CUDF? And can we use AQE or something to dynamically switch the build side so we know which one is truly smaller? I think there are ways to fix those issues too. We might want to have a JOIN optimization epic of some sort where we come up with a plan on how to avoid some of these pitfalls.

@jlowe
Copy link
Member

jlowe commented Jan 19, 2023

Yes, but how much slower is it to always have the build side be the build side in CUDF?

Slow enough that we didn't check in the first PR attempt because benchmarks showed the guesstimate was faster than the "for free" approach. The biggest issue there is that we don't have a way to make the built hash table spillable, so we can't save it across batches. If we could not have to rebuild the hash table for every batch, that could be a big win.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 19, 2023

If we could not have to rebuild the hash table for every batch, that could be a big win.

Exactly that is what I thought the slowness was. That picking the wrong side was slightly slower in most cases, but in a few it was rather bad. But we were not able to offset the slowness because we had to rebuild the has table each time. I would love to see us resurrect that code and see what happens if we just keep the hash table around. Just for an experiment to see if it is worth spending the time to make the hash table spillable. If it is a huge win, then it might be worth it.

@PointKernel
Copy link
Member

We are about to close the related cudf issue (rapidsai/cudf#14948) and want to ensure we understand the performance regression described here properly.

Going through the above discussions, it seems that #2440 can help solve the problem and no actions are needed on the libcudf side, did I miss something?

@jlowe
Copy link
Member

jlowe commented May 29, 2024

@PointKernel rapidsai/cudf#14948 doesn't seem to be related? I'm assuming this is in reference to rapidsai/cudf#15262. If so, curious why the latter is going to be closed? It's not clear to me how optimizations around distinct joins apply to this problem where the issue is about low cardinality, highly duplicated build-side keys that have many collisions and potentially long chain walking when building the hash table.

@PointKernel
Copy link
Member

I see, rapidsai/cudf#14948 is indeed unrelated then. rapidsai/cudf#15262 probably describes another problem as well since it's for low cardinality groupby IIUC.

the issue is about low cardinality, highly duplicated build-side keys that have many collisions and potentially long chain walking when building the hash table

Tuning CG size can help in this case https://github.com/rapidsai/cudf/blob/9192d259633c382c6f98f956dc7f43d754ebbf44/cpp/include/cudf/detail/join.hpp#L46. Using a larger CG size like 4, 8 or 16 instead of 2 can help with high-multiplicity cases. Is it easy for you to test the performance impact of larger CG sizes? If it's proved to be effective, libcudf can expose CG size in public APIs to make your tuning work easier.

@jlowe
Copy link
Member

jlowe commented May 29, 2024

@PointKernel thanks for the pointers. Can you comment on rapidsai/cudf#15262? That's the issue tracking the poor performance of highly duplicated build-side keys when building hash tables. I know it's talking about aggregations, but I believe the problem is common between that and joins -- both start with building a hash table, and that build is particularly slow when there are many key collisions and long chaining.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance A performance related task/issue
Projects
None yet
Development

No branches or pull requests

4 participants