-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Accelerate Bloom filtered joins #7803
Comments
Supporting GPU |
I've filed a cudf issue asking for |
Wouldn't this be a spark-rapids-jni kernel? Or is the thinking that this applies to a non-Spark use case as well? |
Yes in the meantime I think |
For reference, here's pointers to Spark code related to Bloom filter assisted join processing: |
Just for your information, @sleeepyjack is working on a GPU-based bloom filter, |
Can confirm. I was going to mention that in the XXHash issue. Yes, the Bloom filter overhaul (which depends on solving NVIDIA/cuCollections#290 and NVIDIA/cuCollections#284 first) will be my next work item, and I'm prioritizing it. I'm OOO this week, so please expect more updates to come by next week when I'm back at the office. |
Could you elaborate on the Spark (CPU) Bloom filter implementation? Would it be viable to emulate the GPU filter design/layout on the CPU instead? Maybe related: Why exactly do you need a 64-bit hash function for this? |
@sleeepyjack and @PointKernel I am not sure that @jlowe explained all of the requirements that we have perfectly. We really would like to match the Spark CPU implementation bit for bit if possible. This is because the data is produced and/or interpreted in multiple different operators. If we change the implementation, even in a very subtle way, we then would have to be 100% sure that we replaced all CPU operators with GPU versions or a compatible CPU version. This gets really complicated to do. Especially when it is not an intermediate result. In theory a user could create a bloom filter, then write out the result to Parquet or ORC for later use. We would have no way to guarantee that every spark job that reads in that column has our plugin and could replace how the serialized bloom filter is used. I know that this is a rather far fetched example, but I would rather err on the safe side if possible. |
See the bloom filter code link above which points to the code the CPU uses, especially
Spark uses xxhash64 to hash arbitrary objects into a 64-bit long which is then in turn fed to the bloom filter (either for creation or probing). Not exactly sure why they decided to hash-the-hash instead of hashing the object directly in the bloom filter. It might be related to the cost of hashing certain objects, and hashing-the-hash would be cheaper in that case. I agree with @revans2 that we really would like a Spark-compatible implementation of this bloom filter. First is that it's far simpler for us to plan properly, as discussed above. The other reason to keep them in sync is then we get the exact same behavior as the CPU--only plan in terms of how large the filter is and amount of data passing through the filter. It is possible to deal with a custom bloom filter solution, but it makes the planning side far trickier to get a proper plan that won't corrupt data. Note that any solution we use requires the bloom filter to be serializable, as we need to ship it across nodes in the cluster. |
Yes, I made the mistake that I thought the bloom filter operations where actually exposed publicly in Spark, but they are not. So we can make the assumption that if someone goes through the front door then we will not run into odd situations. |
Spark supports an optimization that leverages a Bloom filter to try to reduce the amount of shuffled data before a join. See SPARK-32268 for details. Currently the RAPIDS Accelerator falls back to the GPU for the ObjectHashAggregates that are introduced to build the bloom filter and then falls back to the GPU on the subsequent Filter expression that uses the generated Bloom filter.
The bloom filter construction is technically an aggregation expression but it appears to always be used in a reduction context. Therefore we should be able to write Bloom filter construction and merge kernels without needing Bloom filters to be supported by libcudf groupby aggregations. In addition to the Bloom filter kernels, we would also need to support the xxhash64 function which is used to hash the input data into a 64-bit long which is then subsequently fed into the Bloom filter.
Note that it is important that we match the behavior of the Spark Bloom filter exactly, otherwise we risk data loss if the CPU ends up building the filter but the GPU ends up evaluating it or vice-versa.
Dependencies
For the plugin tagging logic, we need to make sure that we can replace both sides of the joins fully on GPU before committing to changing the plan.
The text was updated successfully, but these errors were encountered: