Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Esql/lookup join grammar #116515

Merged
merged 27 commits into from
Nov 20, 2024
Merged

Esql/lookup join grammar #116515

merged 27 commits into from
Nov 20, 2024

Conversation

costin
Copy link
Member

@costin costin commented Nov 8, 2024

First PR for adding LOOKUP JOIN in ESQL.
Introduces grammar and wires the main building blocks to execute a query; follow-ups are required (see #116208 for more details).

@elasticsearchmachine elasticsearchmachine added v9.0.0 needs:triage Requires assignment of a team area label labels Nov 8, 2024
@costin costin removed the needs:triage Requires assignment of a team area label label Nov 8, 2024
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Nov 8, 2024
@costin costin added the :Analytics/ES|QL AKA ESQL label Nov 8, 2024
@costin costin linked an issue Nov 8, 2024 that may be closed by this pull request
48 tasks
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Nov 8, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Nov 8, 2024
@costin costin marked this pull request as draft November 8, 2024 17:55
Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor questions.

Remove USING syntax - use ON <field> for now
Fix incorrect serialization
@costin costin force-pushed the esql/lookup-join-grammar branch from fae8354 to 7f14d61 Compare November 12, 2024 05:36
@alex-spies alex-spies self-requested a review November 12, 2024 15:32
Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a first pass with some questions and thoughts.

}
for (Attribute a : right) {
// override the existing entry in place
nameToAttribute.compute(a.name(), (name, existing) -> a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess we should make a nullable here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO plus entry in the meta ticket - making the attributes nullable changes their equality which has side effects in the plan.

@alex-spies alex-spies self-requested a review November 14, 2024 16:40
@costin costin force-pushed the esql/lookup-join-grammar branch from c66ccce to 9b51cef Compare November 15, 2024 00:59
Move writable declarations outside the core classes to avoid errors
(such as subClass.getNamedWritable()) and centralize them in a top
package class for better management.
@costin costin marked this pull request as ready for review November 18, 2024 05:38
@alex-spies alex-spies mentioned this pull request Nov 18, 2024
48 tasks
Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, all done now. I think we're not too far away, but I think there's a couple things that needs to be done before merging - and a bunch of things that can be done after.

Here's what I think we should do:

  • I believe the refactoring of the named writable entries should become its own PR. Currently, our mixed cluster test suite is muted, which means that making such a large scale refactor to our serialization is quite risky.
  • It looks to me like our current planning for lookup joins expects the match fields from the left and right hand side's outputs to have the same name ids. I believe this is wrong and will lead to inconsistent planning; see my comment on the changes to Analyzer.java. The fields in the lookup index are different from fields that we join on. Actually, even if we join the lookup index with itself, we should probably still assign different name ids to the respective field attrs. to avoid bugs. There's a comment in LookupJoinExec that claims we need qualifiers to make this work, but I don't believe so - name ids should disambiguate the attributes just fine.
  • LookupJoin and LookupJoinExec have some inconsistencies and hacks which I think we best address to avoid them from lingering and producing secondary issues. E.g. the fact that LookupJoinExec doesn't have any right hand side attributes in its references(), the fact that LookupJoin.info() is not quite right etc. See the individual comments.
  • Test coverage (can be addressed in a follow up, but then we need to put it in the meta issue as well):
    • The test for lookup joins on the data nodes doesn't look right to me. But we can consider lookup joins on data nodes a follow-up.
    • In general, the test coverage in terms of csv tests for lookup joins even on the coordinator is too small. We can address this in a follow-up, though, but at least should add it to the meta issue.

Comment on lines 65 to 67
// Do not just add the JoinConfig as a whole - this would prevent correctly registering the
// expressions and references.
return NodeInfo.create(this, LookupJoin::new, left(), right(), config(), output);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is missing a TODO? The JoinConfig is currently still added as a whole, which means that methods like QueryPlan.forEachExpression will not work as expected.

Maybe this can just be fixed by copying the approach from Join? There, I solved this by having a constructor that takes the components of the JoinConfig instead of the whole thing, which is also the longest constructor to satisfy the info() contract.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was no longer relevant and was removed.

Copy link
Contributor

@alex-spies alex-spies Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still have to pass the individual components of the join config to the node info instead of the join config object as a whole.

If we want to, say, rewrite the plan and need to replace an attribute in the join config's left hand side, just calling forEachExpression on the join will be wrong because it ignores the join config.

We had issues because of this with Join before; let's fix it so we don't run into trouble later.

Comment on lines +207 to +218
if (right instanceof FragmentExec fragment
&& fragment.fragment() instanceof EsRelation relation
&& relation.indexMode() == IndexMode.LOOKUP) {
return new LookupJoinExec(
join.source(),
left,
right,
config.matchFields(),
config.leftFields(),
config.rightFields(),
join.output()
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I understand that our plan for the future is to have more general joins where we might push e.g. a filter into the right hand side and then won't perform a LookupJoinExec but a more general join here. But as that's currently not in place, I believe this approach is complicating the planning for no gain, because we could also map LookupJoin directly to LookupJoinExec without first transforming it into a general Join. I believe this increases the risk of mistakes due to premature abstraction and loses information during logical optimization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree right now the abstraction is simplistic but the point is to decouple the command from the execution.

LookupJoin is syntactic sugar for special left join with special output - it gets folded into a regular join since I'd like to avoid serializing it (and dealing with versioning) and to have the compute engine pick up the pattern not the class.
That means any syntax that gets converted into a left join could be executed against a lookup join. Enrich for example could be attached onto this one but the dependency between the logical and physical happens through the pattern NOT the Lookup class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair - but we aren't at the point yet where an Enrich, or anything else for that matter, is rewritten into a pattern that is translated into a LookupJoinExec. Currently, we only have the Lookup logical plan node, which doesn't need the additional round trip via surrogate + pattern matching.

I really prefer deferring such generalizations until they have actual use cases - because getting the generalization wrong early on leads to pain :/ and it makes the code more complex than it currently needs.

But that's supposed to be a nit anyway :)

@elasticsearchmachine
Copy link
Collaborator

Hi @costin, I've created a changelog YAML for you.

@costin
Copy link
Member Author

costin commented Nov 19, 2024

Thanks @alex-spies for all the comments, see my nested replies. This PR doesn't aim to fully address the lookup join, rather add the basic layers that can later be distributed and worked on as a follow-up from #116208 in which you already commented (thanks for that!).

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/kibana-esql (ES|QL-ui)

Copy link
Contributor

@alex-spies alex-spies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya - approving this to unblock the PR. I think it's fine to address most things in follow ups, as you suggested @costin.

Before merging, I think it'd be great to

  • merge ESQL: Refactor named writeable entries for expressions and plans #117029 first so that the serialization refactoring can be easily backported,
  • carefully consider if this should be backported as well (not doing so can seriously hamper our ability to auto-backport as this PR touches many places, even when factoring out the serialization changes), and
  • fix LookupJoin.info() as commented before.

@costin costin force-pushed the esql/lookup-join-grammar branch from 0c30cb9 to ac9427e Compare November 20, 2024 00:39
@costin costin added ES|QL-ui Impacts ES|QL UI and removed ES|QL-ui Impacts ES|QL UI labels Nov 20, 2024
@costin costin merged commit bc785f5 into elastic:main Nov 20, 2024
16 checks passed
@costin costin deleted the esql/lookup-join-grammar branch November 20, 2024 01:52
costin added a commit to costin/elasticsearch that referenced this pull request Nov 20, 2024
First PR for adding LOOKUP JOIN in ESQL.
Introduces grammar and wires the main building blocks to execute a query
 ; follow-ups are required (see elastic#116208 for more details).

Co-authored-by: Nik Everett <[email protected]>
(cherry picked from commit bc785f5)
elasticsearchmachine pushed a commit that referenced this pull request Nov 20, 2024
First PR for adding LOOKUP JOIN in ESQL.
Introduces grammar and wires the main building blocks to execute a query
 ; follow-ups are required (see #116208 for more details).

Co-authored-by: Nik Everett <[email protected]>
(cherry picked from commit bc785f5)
rjernst pushed a commit to rjernst/elasticsearch that referenced this pull request Nov 20, 2024
First PR for adding LOOKUP JOIN in ESQL.
Introduces grammar and wires the main building blocks to execute a query; follow-ups are required (see elastic#116208 for more details).

Co-authored-by: Nik Everett <[email protected]>
alexey-ivanov-es pushed a commit to alexey-ivanov-es/elasticsearch that referenced this pull request Nov 28, 2024
First PR for adding LOOKUP JOIN in ESQL.
Introduces grammar and wires the main building blocks to execute a query; follow-ups are required (see elastic#116208 for more details).

Co-authored-by: Nik Everett <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL ES|QL-ui Impacts ES|QL UI >feature Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ESQL: Lookup Join meta issue
6 participants