-
Notifications
You must be signed in to change notification settings - Fork 227
Support equi-joins #20
Comments
Here's my initial thinking on a equi-inner-join implementation.
For the implementation, at a high level, we'd want to support a new implementation of QueryPlan called JoinPlan. The JoinPlan would nest two QueryPlans, the LHS and RHS, with a HashCache being created and sent at execution time for the LHS. For example, A JOIN B JOIN C would cause HashCache(A) to be sent to all region servers, the scan for B to be done (where the the GroupedAggregateRegionObserver or ScanRegionObserver would do the join), the HashCache(A joined to B) sent to all region servers, and the scan for C would be done (again with the region observer doing the join). The key of the HashCache would be formed by the concatenated expression evaluation of the expressions in the ON clause for the table being cached, kind of the same as the GROUP BY key formed in GroupedAggregateRegionObserver#getKey. The region observer would be passed the parallel expressions for the other side of the expression in the ON clause. For each row scanned, we'd evaluate the expressions and look up (using an ImmutableBytesPtr key) the result in our HashCache to find the corresponding result. The List of KeyValues from the hash cache would be concatenated with the List of Key Values from the scan and the row key from each would be concatenated together and returned as the result. For the client side changes:
|
For outer joins, it'd be similar, but we couldn't push the where clauses down to the scans (not sure if you can sometimes push them or not). I think we should start with inner joins and that'll be the bulk of the work. |
Updated the above write-up slightly. I think we'll be better off if the JoinPlan nests two plans, the LHS plan and the RHS plan. The nested plan may itself be another JoinPlan or a simple ScanPlan or AggregatePlan. |
Here's a rundown on the server side of the existing/partial hash join implementation and what needs to change. GlobalCache. We instantiate a GlobalCache singleton on each region server. This GlobaCache has a map of child TenantCache instances (see below). A different, better mechanism for holding onto an object on the region server was surfaced by @lhofhansl that we should switch to - maybe he can remind us what this is? TenantCacheImpl. This is what's holding on to the HashCache, potentially a list of them if multiple joins are being processed by the same tenant. Don't get confused by the tenant stuff. By default, there's a single global tenant being held onto by the GlobalCache. You can have multiple tenants if the client sets a "TenantId" connection property. It's just a way of potentially rolling up resource usage on the server side to an "owner" so that a single owner doesn't consume too many resources. AgeOutHashCache. The server-side implementation of HashCache. It builds up a map from the row key to the Result when it deserializes the hash cache sent from the client. It's basically hard coded right now to do a join between a PK and FK (in either direction). Instead, we want to serialize through the LHS expressions from the ON clause (along with the hash cache) and evaluate them against each Result in the HashCache to form a key (exactly like the key that's formed in GroupedAggregateRegionObserver#getKey - just refactor that code). This will be the key of the Map. Use an ImmutableBytesPtr as the key - it's like an ImmutableBytesWritable, but it caches the hash code for better performance. The value should be a Tuple instead of a Result. The cache gets aged out after a configurable amount of time. We should either just switch to using the Guava Map for caching (they have age-out options), or tie into the HBase mechanism for a "lease". This mechanism was surfaced by @lhofhansl in HBase awhile back - maybe another reminder would be good. HashJoiningRegionObserver. Take a look below at the prior revision. This is the coprocessor that handles a hash join. I think it's probably better to get rid of this coprocessor and put the logic in a class that's usable by both ScanRegionObserver and GroupedAggregateRegionObserver. Otherwise we'll have multiple coprocessors firing, for example if there's a top N query that does a join or a group by with a join which will get difficult to manage. The changes necessary here are similar to the changes to AgeOutHashCache. You'll want to send the RHS of the ON expression through the Scan attributes and use those to build up a key by evaluating them against each Tuple returned by the scan. You'll then use this key to look for a match in the HashCache, similar to what's being done now. If you find a match, then the two Tuples need to be unioned, including the row keys. The SchemaUtil.joinColumns (which isn't in the code base anymore), didn't do much more than concatenate the list of KeyValues. I think you'll need to devise a scheme where similarly named columns from the LHS and RHS don't conflict. If you pass along the alias used on the RHS and LHS, I think you can use these as the column family name (and ignore the current column family name). This should handle this case. I think the logic for an outer join is the same as in this code.
|
Is the implementation in HashJoiningRegionObserver#doPostScanner() meant for star-join, by having multiple join tables? or for now we just support pipelined join as described in the first comment, like ((A join B) join C) join D ? |
Yes, the multiple join tables in HashJoiningRegionObserver#doPostScanner() are meant to support star joins for queries like below. This saves us from having to push the result of each join back to the region server.
We can stage this, though, and only do the pipelining for v1. |
Yes. think it's just the complexity in the compiler to decide between the two join schemes. |
How can we do KeyValueColumnExpression evaluation on a joined result? like to evaluate f1:c1 from table A and f1:c1 from table B in (A join B)? |
This would look something like this:
Our ExpressionCompiler currently adds cf/cq to the Scan as compiles in ExpressionCompiler.visit(ColumnParseNode node), but instead I think we'll want to delay doing this and keep a List on ExpressionCompiler instead (this captures the TableRef as well). For our OnClauseExpressionCompiler, we can use this to detect between case (1) and (2) above, and we'll end up with two QueryPlans. For the default case, we can project into the Scan as we do now from QueryCompiler. |
issue #20 : Runtime part implementation
With all presumptions mentioned above, like we only support AND-connected = conditions in the ON clause and conditions in the where clause will be compiled as either table filters or as post-scan checks, we decided to compile multiple join queries in a slightly different way, given the below situations:
SELECT ...
So in the first example, we can't actually make table hashed in their order of appearance, for we wouldn’t be able to do a LEFT JOIN with the LHS being hashed. However, what we intend to do with the query here is most like a star join, scanning the table awards and joining against the rest two small tables. Here, we simply define our star join as multiple LEFT or INNER joins with conditions in the ON clause only referred to the current joining table and the first table. The query generation process for multiple (single) joins goes like:
For group-by in the main query, we do join earlier than grouping. But for sub-queries containing groupby, we may have to specify for “compile_as_scan()” if the groupby should be performed before or after the join operation, though in the process mentioned above we prefer putting such sub-queries into the hash. Some cases we now leave out from hash join implementation may also be doable. So for later improvements, we can try rewriting those join queries, maybe like re-grouping sub-queries. |
Great write-up, @maryannxue. When you talk about "sub-queries", are you talking about the traditional sub-query, but expressed (or re-written) as a join? One comment on "we only support AND-connected = conditions in the ON clause...", I think this is true only for conditions that define the join key (i.e. involve multiple tables). We should be able to allow arbitrary filters in the ON clause which would be parsed out and pushed as predicates to the individual table scans. For inner joins this doesn't make a difference, but for outer joins it can. With outer joins, you're forced to evaluate where clause filters after the join (so you can't push them down). But if those filters are included in the ON clause, you can always push them down. These filters would be ok to include OR conditions too. We'd just require AND conditions between the join keys. |
@jtaylor-sfdc The "sub-queries" I mentioned above can be traditional sub-queries or JOIN clause in parentheses (which will most probably be compiled to the hashed side), or the LHS of the rightmost top-level JOIN node, e.g., "select ... from A join B on A.p = B.p" as in "select * from A join B on A.p = B.p join C on B.q = C.q;", as we consider JOIN left-associative. Yes, thanks for pointing out the condition handling! Just to confirm, we only support OR conditions as table filters if they are enclosed in a top-level AND condition in the ON clause, right? like "on A.p = B.p AND (A.a = 'xx' OR A.a = 'yy')". And for those OR filter conditions dealing with more than one table, we also defer the evaluation to after join stage. |
Yes, on all counts |
With a runtime enhancement to support iterative join key evaluation on the previously joined result, for example: |
Issue #20 - hashjoin implementation
Issue #20 - hashjoin implementation improvements
Issue #20 - Hash join implementations
Implemented by @maryannxue and pulled into the master branch. Fantastic job! |
Support joins, starting with hash joins. Some work on this has already been done. See HashCache* and HashJoiningRegionObserver.
The text was updated successfully, but these errors were encountered: