-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement inequality joins #701
Comments
If we want inequality joins we may not be able to wait for these primitives to exist on the CUDF side. One of the issues above is closed and the other is just a request that has been open for a long time. In the Simplicity engine (very old blazingdb) we had an approach that required significant materialization but that scaled relatively well on a single node. We stable sorted the data according to the columns that were part of the inequality joins, created an RLE representation of the larger table, then performed bounded look ups from the smaller table into the larger table to get something like a series of ranges per row that you were joined to. So for each row you end up with something like 0-5, 100-112, 300-340. If there were any equality joins you perform those first to reduce the number of rows to be analyzed during the inequality phase. Another option that has been floated is using an AST to evaluate the join condition but this is somewhat akin to performing a cartesian join ( though not necessarily one that has to be materialized) followed by a filter step which seems like its begging to be improved. |
I think we will want to do a combination of both approaches. For inequality joins, we will want to do distribution based on a order as opposed to hashes |
Im not sure combination is the right way to go here. I think we need to pick hopefully one strategy for the first implemenation. I am currently reviewing literature on the subject to see what else I can find. |
This feature is waiting on this:
rapidsai/cudf#2792
rapidsai/cudf#3628
The text was updated successfully, but these errors were encountered: