Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Inequality Joins #2792

Closed
wmalpica opened this issue Sep 13, 2019 · 19 comments
Closed

[FEA] Inequality Joins #2792

wmalpica opened this issue Sep 13, 2019 · 19 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@wmalpica
Copy link

Is your feature request related to a problem? Please describe.
CUDF currently does not support inequality joins. We (BlazingSQL) would like to be able to support the queries in the TPCH benchmark, some of which require inequality joins. It would also allow us to expand our SQL coverage in general and also is necessary for us to do partition skipping processing for joins.

Describe the solution you'd like
We would like to be able to do inner and outer joins that have complex conditions including inequalities, for example:
A.col1 < B.col1
A.col1 < B.col1 AND A.col2 = B.col2
A.col1 = B.col1 AND A.col2 <> B.col2

Describe alternatives you've considered
We currently cannot think of a way of implementing this with the current CUDF functionality. We have several ideas of how these algorithms could be implemented and would like to participate in design discussions.

Additional context

@wmalpica wmalpica added feature request New feature or request Needs Triage Need team to review and classify labels Sep 13, 2019
@harrism
Copy link
Member

harrism commented Sep 13, 2019

@kkraus14
Copy link
Collaborator

cc @jlowe @revans2 @tgravescs I believe Spark will need this as well

@kkraus14 kkraus14 added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 13, 2019
@jrhemstad
Copy link
Contributor

What does A.col2 <> B.col2 mean?

@harrism
Copy link
Member

harrism commented Sep 13, 2019

I assume that means !=.

@jlowe jlowe added the Spark Functionality that helps Spark RAPIDS label Sep 13, 2019
@jrhemstad
Copy link
Contributor

@williamBlazing since you guys already have some implementation ideas, it would help to get the conversation started if you could type up a proposed API and a summary of the algorithm/implementation.

@nsakharnykh
Copy link
Collaborator

Let me start with some basic idea for the API/implementation. On a high level what we need is this: a join cuDF function where in addition to the list of columns you can pass a list of conditions - one condition for each column, can be one of =, >, <, or <>. On the implementation side, we can build a hash table using the columns with = condition only, and then if we find a match during the probe phase, we read data from the inequality columns and test if they pass the specified conditions. It should be fairly straightforward to implement, but will only work for cases like A.col1 = B.col1 AND A.col2 <> B.col2 and not for A.col1 < B.col1. I suppose in the latter case we would need to use the sort-based approach for joins.

@jlowe
Copy link
Member

jlowe commented Sep 13, 2019

you can pass a list of conditions - one condition for each column, can be one of =, >, <, or <>.

I'd like the ability to pass something like an expression tree which can handle queries where a column could appear multiple times in the condition, e.g. A.col1 = B.col1 AND A.col2 < B.col2 AND A.col2 > B.col3

Parquet and ORC readers support predicate push-down filters where one can pass an expression tree describing the predicate condition. I was thinking it'd be nice to have a similar feature for pushing a filter condition down into the join. That would avoid manifesting the full equi-join result only to throw most of it out with a post-filter by the predicate condition.

@harrism
Copy link
Member

harrism commented Sep 15, 2019

The most flexible way to do that would be a UDF that we JIT. But it would be simpler to do it more like Nikolay suggests, but rather than passing one condition per column, pass a vector of tuples: [op, lhs_col_index, rhs_col_index].

@jlowe
Copy link
Member

jlowe commented Sep 16, 2019

Does using a vector of tuples imply it would only support AND-ing the individual operations together to form the predicate? Curious how conditions using OR would be handled.

@revans2
Copy link
Contributor

revans2 commented Sep 16, 2019

I agree with @jlowe we want to support OR conditions as well. In TPCH Q7 we do end up with an OR in a conditional join.

			and (
				(n1.n_name = 'FRANCE' and n2.n_name = 'GERMANY')
				or (n1.n_name = 'GERMANY' and n2.n_name = 'FRANCE')
			)

I am not an expert if there are ways to get around this with some SQL join optimizations.

Longer-term I would prefer to see us JIT something instead of interpret it, but that is a much more difficult problem to solve initially.

@harrism
Copy link
Member

harrism commented Sep 16, 2019

Great point... Really do need an expression tree I guess. I suppose we could define a simple expression tree data structure for that and require the caller to build it up -- @williamBlazing would BlazingSQL be able to generate these expression trees if we had such an API?

@kkraus14 Thoughts about the Python interface?

@nsakharnykh
Copy link
Collaborator

In the Q7 example above, OR is used for the predicate to the join, but not in the join itself, and technically can be done outside the join as a separate filter, albeit inefficient. I think this discussion is about supporting new types of joins like inequality joins. For TPC-H, do we have expressions like A.col1 B.col2 OR A.col3 B.col4? If all the join relations are expressed with AND, then I think a list would work. At the very least we could try to restrict ourselves to support join expressions with only column operands as a first step. Then for filters/predicates we might want to do a separate thing with UDFs that we pass into the join call. Just trying to think how we can breakdown this without trying to write a full query optimizer yet.

@revans2
Copy link
Contributor

revans2 commented Sep 18, 2019

I am fine with an 'AND' initially if it makes things simpler because the majority of the joins will likely involve just 'ANDs', but my point really was that other operations can be expressed in SQL and it is a question of how far do we want to go down that rabbit hole. I personally would prefer to see the API set up so that a simple expression tree is passed in, even if the only operator supported is AND because then we don't have to change the API in the future for other operations, but CUDF appears to be happy to change APIs regularly so I am not too concerned.

@harrism
Copy link
Member

harrism commented Sep 19, 2019

CUDF appears to be happy to change APIs regularly so I am not too concerned.

Only until we reach 1.0!

I predict we are unlikely to get a new data structure like an expression tree just right the first time.

@wmalpica
Copy link
Author

Sorry for the late reply.

BlazingSQL does have the ability to create, parse and modify expression trees, but we do it starting from the output of Apache Calcite, which already provides us something that looks a bit like an expression tree. We could extend this work for this purpose if we want to, but I am not sure we are there yet.

I think for the purposes of a cudf join that supports inequality joins, I think we want something that would take in a vector of tuples: [op, lhs_col_index, rhs_col_index] as suggested by @harrism . I think that predicate pushdown, for now should be handled by the user. In the long term cudf joins could handle all sorts of different scenarios, and internally select the type of join algorithm(s) or combinations of filters and joins that are necessary, but trying to do that now could get too complicated.

@jtommaney
Copy link

Q7
"and" possible filtering conditions

  • n1.n_name in ( 'FRANCE', 'GERMANY')
  • n2.n_name in ( 'FRANCE', 'GERMANY')

a more complex example is Q19
where (p_partkey = l_partkey and p_brand = 'Brand#12'
and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG') and l_quantity >= 1
and l_quantity <= 1 + 10 and p_size between 1 and 5
and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON')
or (p_partkey = l_partkey and p_brand ='Brand#23'
and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK') and l_quantity >=10
and l_quantity <=10 + 10 and p_size between 1 and 10
and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON')
or (p_partkey = l_partkey and p_brand = 'Brand#34'
and p_container in ( 'LG CASE', 'LG BOX', 'LG PACK', 'LG PKG') and l_quantity >=20
and l_quantity <= 20 + 10 and p_size between 1 and 15
and l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON');

"and" filters implied by above

  • p_partkey = l_partkey
  • l_shipmode in ('AIR', 'AIR REG') and l_shipinstruct = 'DELIVER IN PERSON')
  • p_size between 1 and 15
  • l_quantity <= 30
  • p_brand in (....)
    etc..
    These filters are applied to reduce the result-set however the or condition is also executed as a subsequent filter

Calcite likely has some or all of the optimizations here, but performant behavior requires at minimum the part key join expression can be used as a (join) filter.

It is definitely a common scenario for a post-join expression to be evaluated:
a join b where a.key=b.key and (a.col2 like b.col3 ) where like can be any SQL-supported expression.

@wmalpica
Copy link
Author

@jtommaney

I think most of what you are refering to for Q7 and Q19 should be part of a filtering operation and not a join operation.

@ChuckHastings
Copy link
Contributor

Summarizing some discussions on implementing inequality joins.

Fundamentally, there are two important aspects of efficiently implementing inequality joins... and they in fact relate to even equijoins that we already implement. The first aspect is the efficient mechanical execution of the joins, the second aspect is how to determine which execution is optimal.

The first aspect is something we'll be looking to address - creating an interface and implementation that allows for the efficient execution of joins (more on this below). The second aspect is something that we're currently considering to be beyond the scope of what we want to address in cuDF at this time. That is, we will leave it to the caller to determine how to decide how to control the join implementation (which specific implementation, which order we apply clauses, etc). Perhaps at some point it will make sense to add functionality into cuDF to address this, but for now we'll just work on the mechanisms.

Efficient execution of joins is highly dependent on the exact data and join clauses that are involved. Traditionally there are three basic join algorithms:

  1. Nested loop join - iterate over the cross product of the two relations and filter the resulting rows down to the rows that match
  2. Sort-merge join - sort the two relations, merge them to identify the rows that match
  3. Hash join - create a hash table out of the keys of one relation, probe the hash table to identify rows that match

cuDF currently implements the Hash join. There are situations where a Sort-merge join or even a nested loop join would be more efficient than the implemented Hash join. There are also potential optimizations to the Hash join that we have discussed.

Note that only equijoins can be implemented with a Hash join, since the hash only finds exact matches. So as a precursor to implemented inequality joins, we need to have implementations of either nested loop joins and sort-merge joins (or both).

Created issues to improve equi joins in anticipation of addressing this issue:

  1. [FEA] Implement nested loop join #3772
  2. [FEA] Implement sort-merge join #3773
  3. [FEA] Improve hash merge implementation #3774

@vyasr
Copy link
Contributor

vyasr commented Aug 2, 2021

This issue was resolved by #8214.

@vyasr vyasr closed this as completed Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging a pull request may close this issue.