Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement inequality joins #701

Open
wmalpica opened this issue May 26, 2020 · 3 comments
Open

Implement inequality joins #701

wmalpica opened this issue May 26, 2020 · 3 comments

Comments

@wmalpica
Copy link
Contributor

wmalpica commented May 26, 2020

This feature is waiting on this:
rapidsai/cudf#2792
rapidsai/cudf#3628

@felipeblazing
Copy link

If we want inequality joins we may not be able to wait for these primitives to exist on the CUDF side. One of the issues above is closed and the other is just a request that has been open for a long time.

In the Simplicity engine (very old blazingdb) we had an approach that required significant materialization but that scaled relatively well on a single node. We stable sorted the data according to the columns that were part of the inequality joins, created an RLE representation of the larger table, then performed bounded look ups from the smaller table into the larger table to get something like a series of ranges per row that you were joined to. So for each row you end up with something like 0-5, 100-112, 300-340. If there were any equality joins you perform those first to reduce the number of rows to be analyzed during the inequality phase.

Another option that has been floated is using an AST to evaluate the join condition but this is somewhat akin to performing a cartesian join ( though not necessarily one that has to be materialized) followed by a filter step which seems like its begging to be improved.

@wmalpica
Copy link
Contributor Author

I think we will want to do a combination of both approaches. For inequality joins, we will want to do distribution based on a order as opposed to hashes

@felipeblazing
Copy link

Im not sure combination is the right way to go here. I think we need to pick hopefully one strategy for the first implemenation. I am currently reviewing literature on the subject to see what else I can find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants