Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(substrait): handle emit_kind when consuming Substrait plans #13127

Merged

Conversation

vbarua
Copy link
Contributor

@vbarua vbarua commented Oct 26, 2024

Which issue does this PR close?

Follows up from #12495

Closes #12347

Rationale for this change

Substrait relations have an emit_kind which is either Direct, in which case the default fields of the relation are output, or Emit, which enables precise control of the order and inclusion of fields.

For example, given a relation with the following emit

"emit": {
  "outputMapping": [2, 0, 1]
}

The output mapping indicates that from the default columns output from the relation, only the 2nd, 0th and 1st column should be output (in that order).

DataFusion currently ignores the emit_kind field entirely when reading Substrait plans.

What changes are included in this PR?

This PR adds support for handling output mappings by treating them as DataFusion Projections that are layered on top of the default translation of the relation.

The one exception to this is Substrait Project, for which special handling has been added to avoid creating a Projection on top of a Projection.

Are these changes tested?

Yes. Two new tests have been added to check the remap logic.

Additionally, DataFusion currently includes output mappings when it produces Substrait Projects, so any test which roundtrips a Projection also serves as a test of this functionality.

Are there any user-facing changes?

Substrait plans generated by DataFusion prior to version 0.42 did not set the output mapping correctly for Substrait Projects (see #12495 for details).

After these changes, attempting to consume Substrait plans generated before version 0.42 will not work.

@Blizzara
Copy link
Contributor

Blizzara commented Oct 30, 2024

This looks good to me, pending the emit vs nondeterministic exprs, thanks @vbarua! To solve that, I'd be fine with for example special-casing the "emit exactly the expressions" case (that's what all roundtrip tests will anyways use I believe), and then adding a second project for any other case.

@tokoko
Copy link
Contributor

tokoko commented Oct 30, 2024

@Blizzara agreed. If I have to choose between this vs calling optimize in tests, I'd say avoiding second project is a lesser evil. @vbarua We could modify apply_emit_kind function to look something like this. This way "optimization" code would be less tangled with project code.

match retrieve_emit_kind(rel_common) {
    EmitKind::Direct(_) => Ok(plan),
    EmitKind::Emit(emit) => {
    	if (rel is project) and (emit doesn't contain duplicates) {
            return modified project
    	} else {
    	    return second project
    	}
    }
}

@vbarua
Copy link
Contributor Author

vbarua commented Oct 30, 2024

I experimented with using the OptimizeProjections rule to remove redundant Projections, just for testing, and while it did to that it also ended up add projections to the TableScans, because does column pruning and projection pushdown, and also removed some projections entirely.

I'd say avoiding second project is a lesser evil.

That's where my head is at as well. @tokoko I like you're idea of hiding all of this in the emit kind utility. I should find some time to update this later this week.

@alamb alamb marked this pull request as draft October 31, 2024 19:28
@alamb
Copy link
Contributor

alamb commented Oct 31, 2024

Converting to draft as it sounds like it is waiting on feedback to be incorporated. Please mark it as ready for review when ready for another look (I am tying to keep the review queue under control, not tring to discourage contributions)

@vbarua vbarua requested a review from tokoko October 31, 2024 19:53
@vbarua
Copy link
Contributor Author

vbarua commented Oct 31, 2024

I am trying to keep the review queue under control

Makes sense. Amusingly I was mid-push when I saw your update 😅
Re-openning and tagging @Blizzara @tokoko

@vbarua vbarua marked this pull request as ready for review October 31, 2024 19:55
@vbarua vbarua requested a review from tokoko November 1, 2024 16:44
Copy link
Contributor

@tokoko tokoko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. thanks @vbarua this is a big one, it was probably the biggest "oversight" in the consumer.

@Blizzara
Copy link
Contributor

Blizzara commented Nov 1, 2024

LGTM, thanks @vbarua!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a quick skim and it looks good to me

Thank you @vbarua and @tokoko and @Blizzara for pushing substrait forward

@alamb alamb merged commit 592b924 into apache:main Nov 1, 2024
24 checks passed
@vbarua vbarua deleted the vbarua/substrait/consumer-emit-kind-handling branch November 1, 2024 20:53
@alamb alamb mentioned this pull request Nov 5, 2024
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[substrait] add support for Substrait Relation emit kind
4 participants