-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] low cardinality joins on the GPU are really bad. #7529
Comments
@revans2 have you taken an Nsight Systems trace of this case? Curious if it's primarily a problem with building the hash table vs. probing it via the stream table (or both). My guess is we're getting killed by the thread collisions on the same key when trying to build the hash table, and it's not so much an issue when we probe it later. |
I stopped at finding the issue. I had already spent enough time on this that I wanted to stop and check in on the priority for next steps. |
Okay, I think this is really just us doing a very bad job a join output estimation. The join output estimation code looks at the build side of the table and guesses at a row multiplication factor based off of the average key count. In this case we have a build side with a relatively low cardinality and lots fo duplication. The estimate ends up being about 20,000 for anything that has a build table. The stream side has a lot of values that are not in the build table. So the average size increase is actually 210, not 20,000. So the more we end up partitioning the input the more likely it is that we end up with an accurate estimate. In many cases the estimation is 0 because there is no build table. In the other cases we have less keys on the stream side that don't match anything. I don't think that this is likely to happen in reality, but it could. The only real way to fix this is to have better join estimation that takes into account the stream table too. We can probably do that if we had some kind of bloom filter in place that would be really nice to look into, and hopefully not too difficult to implement. |
I assume another alternative is to finally solve #2440 where we can get the join output size "for free" in most cases but find a way to mitigate the performance pitfalls we ran into there. IIRC the biggest issue with the last attempt was needing to throw away the built hash table when we decided to split the table we used to build that hash table (e.g.: inner join where we can freely pick). Join code doesn't currently support splitting what was declared the build side, but the libcudf join algorithm is slower if we don't always pick the smaller table as the build table. Thus, if the stream batch is smaller than the build batch, we should split the build batch but currently the code isn't prepped to handle that. |
Yes, but how much slower is it to always have the build side be the build side in CUDF? And can we use AQE or something to dynamically switch the build side so we know which one is truly smaller? I think there are ways to fix those issues too. We might want to have a JOIN optimization epic of some sort where we come up with a plan on how to avoid some of these pitfalls. |
Slow enough that we didn't check in the first PR attempt because benchmarks showed the guesstimate was faster than the "for free" approach. The biggest issue there is that we don't have a way to make the built hash table spillable, so we can't save it across batches. If we could not have to rebuild the hash table for every batch, that could be a big win. |
Exactly that is what I thought the slowness was. That picking the wrong side was slightly slower in most cases, but in a few it was rather bad. But we were not able to offset the slowness because we had to rebuild the has table each time. I would love to see us resurrect that code and see what happens if we just keep the hash table around. Just for an experiment to see if it is worth spending the time to make the hash table spillable. If it is a huge win, then it might be worth it. |
We are about to close the related cudf issue (rapidsai/cudf#14948) and want to ensure we understand the performance regression described here properly. Going through the above discussions, it seems that #2440 can help solve the problem and no actions are needed on the libcudf side, did I miss something? |
@PointKernel rapidsai/cudf#14948 doesn't seem to be related? I'm assuming this is in reference to rapidsai/cudf#15262. If so, curious why the latter is going to be closed? It's not clear to me how optimizations around distinct joins apply to this problem where the issue is about low cardinality, highly duplicated build-side keys that have many collisions and potentially long chain walking when building the hash table. |
I see, rapidsai/cudf#14948 is indeed unrelated then. rapidsai/cudf#15262 probably describes another problem as well since it's for low cardinality groupby IIUC.
Tuning CG size can help in this case https://github.com/rapidsai/cudf/blob/9192d259633c382c6f98f956dc7f43d754ebbf44/cpp/include/cudf/detail/join.hpp#L46. Using a larger CG size like 4, 8 or 16 instead of 2 can help with high-multiplicity cases. Is it easy for you to test the performance impact of larger CG sizes? If it's proved to be effective, libcudf can expose CG size in public APIs to make your tuning work easier. |
@PointKernel thanks for the pointers. Can you comment on rapidsai/cudf#15262? That's the issue tracking the poor performance of highly duplicated build-side keys when building hash tables. I know it's talking about aggregations, but I believe the problem is common between that and joins -- both start with building a hash table, and that build is particularly slow when there are many key collisions and long chaining. |
Describe the bug
I recently was playing around with testing a patch and ended up crating a join that ran absolutely horribly on the GPU, but not too bad on the CPU. This is not common so I spent a little bit of time trying to debug it and found that it really comes down to the cardinality of the build side (RHS in a left outer join or the smaller table in an inner join).
Steps/Code to reproduce bug
For Inner join
For left outer join (which is an order of magnitude slower than inner in this case so I had to make the input sizes 1/10th that of the inner join)
I tested with longs instead of strings as the join keys and it made a very small difference, but not enough to worry about.
I verified that none of the operators were spilling. In fact the input data is small enough I could have set the concurrent GPU tasks to 12 and gotten a bit more speedup out of the GPU.
It ended up being very directly related to the cardinality of the build side.
Where
X
is the cardinality. I set the concurrent level to 4. There are 12 cores and 12 threads. This was also on an a6000 GPU.Increasing the cardinality has a 300x speedup. The other interesting part is that we can also have a 14x speedup on a cardinality of 100 if we subdivide the input data even smaller pieces. This is something we were looking into to deal with out of memory issues, but here it shows some big performance improvements. I am not sure exactly what is happening or if this is something that would impact a real world use case, but I thought I should document this at the least. I think it is related to making the build table and collisions in the hash table, specifically with the need to walk more of the tree on a collision vs a hash aggregation that can use atomics to do the operations.
We are doing an estimate with a gorupby for the output size using the build side table it we might be able to use that to figure out if we are in a bad case like this and possibly decide if there is an alternative join path we could take.
The text was updated successfully, but these errors were encountered: