-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(substrait): handle emit_kind when consuming Substrait plans #13127
feat(substrait): handle emit_kind when consuming Substrait plans #13127
Conversation
This looks good to me, pending the emit vs nondeterministic exprs, thanks @vbarua! To solve that, I'd be fine with for example special-casing the "emit exactly the expressions" case (that's what all roundtrip tests will anyways use I believe), and then adding a second project for any other case. |
@Blizzara agreed. If I have to choose between this vs calling optimize in tests, I'd say avoiding second project is a lesser evil. @vbarua We could modify
|
I experimented with using the
That's where my head is at as well. @tokoko I like you're idea of hiding all of this in the emit kind utility. I should find some time to update this later this week. |
Converting to draft as it sounds like it is waiting on feedback to be incorporated. Please mark it as ready for review when ready for another look (I am tying to keep the review queue under control, not tring to discourage contributions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. thanks @vbarua this is a big one, it was probably the biggest "oversight" in the consumer.
LGTM, thanks @vbarua! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
Follows up from #12495
Closes #12347
Rationale for this change
Substrait relations have an emit_kind which is either Direct, in which case the default fields of the relation are output, or Emit, which enables precise control of the order and inclusion of fields.
For example, given a relation with the following emit
The output mapping indicates that from the default columns output from the relation, only the 2nd, 0th and 1st column should be output (in that order).
DataFusion currently ignores the emit_kind field entirely when reading Substrait plans.
What changes are included in this PR?
This PR adds support for handling output mappings by treating them as DataFusion Projections that are layered on top of the default translation of the relation.
The one exception to this is Substrait Project, for which special handling has been added to avoid creating a Projection on top of a Projection.
Are these changes tested?
Yes. Two new tests have been added to check the remap logic.
Additionally, DataFusion currently includes output mappings when it produces Substrait Projects, so any test which roundtrips a Projection also serves as a test of this functionality.
Are there any user-facing changes?
Substrait plans generated by DataFusion prior to version 0.42 did not set the output mapping correctly for Substrait Projects (see #12495 for details).
After these changes, attempting to consume Substrait plans generated before version 0.42 will not work.