Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement right semi join and support in HashBuildProbeorder #3958

Merged
merged 15 commits into from
Oct 27, 2022

Conversation

Dandandan
Copy link
Contributor

@Dandandan Dandandan commented Oct 25, 2022

Which issue does this PR close?

Closes #3945

Rationale for this change

Possible perf / memory usage improvements when statistics are available (or when someone hard-codes the join type).

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules labels Oct 25, 2022
| JoinType::Left
| JoinType::Right
| JoinType::Full
| JoinType::Semi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to change JoinType::Semi to JoinType::LeftSemi explictly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be better for clarity, but also a breaking change.
I think we also should change Anti to LeftAnti if we want to do this.

FYI @alamb @andygrove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add the new variants and mark the old ones as deprecated?

let task_ctx = session_ctx.task_ctx();
let left = build_table(
("a2", &vec![10, 20, 30, 40]),
("b1", &vec![4, 5, 6, 5]), // 5 is double on the left
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend calling this b2 not b1 for clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing up #3945 -- I was very confused about the usecase for a "right semijoin" until I read that.

I learned something new today!

I like @xudong963 's suggestion:

we It's better to change JoinType::Semi to JoinType::LeftSemi explictly?

I think we also should change Anti to LeftAnti if we want to do this.

This makes sense to me -- I think the more explicit the join types the better 👍

I didn't quite understand the tests -- maybe I am confused.

@Dandandan
Copy link
Contributor Author

Ok - I went ahead and renamed everything to LeftSemi and LeftAnti. I don't think it's worth the complexity to keep Semi and Anti around as depricated (as this would cause extra cases everywhere and making this error prone).

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM -- thanks @Dandandan

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parquet-testing was changed, is it expected?

Maybe the change will break some tests.

@Dandandan
Copy link
Contributor Author

parquet-testing was changed, is it expected?

Maybe the change will break some tests.

My mistake! Thanks, fixed.

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thank you @Dandandan

Lol, btw I just finished all kinds of right-related join in databend community a few days ago to make join reorder work smoothly.

@Dandandan
Copy link
Contributor Author

Nice work, thank you @Dandandan

Lol, btw I just finished all kinds of right-related join in databend community a few days ago to make join reorder work smoothly.

Awesome - would be nice to see if we can learn something from databend 👍

@Dandandan
Copy link
Contributor Author

@tustvold Is there anything I should do to update the generated pbjson and prost files? I have merged master to this version, but can't seem to generate the same code :/

@@ -43,7 +43,6 @@ impl<'de> serde::Deserialize<'de> for AggregateExprNode {
D: serde::Deserializer<'de>,
{
const FIELDS: &[&str] = &[
"aggr_function",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only get aggrFunction

@Dandandan
Copy link
Contributor Author

Dandandan commented Oct 27, 2022

@tustvold Updating the lock-file solved the issue. Might be something to think about if that happens too often (at least document that this could be a reason)?

I saw the related update:

    Updating pbjson-build v0.5.0 -> v0.5.1
    Updating pbjson-types v0.5.0 -> v0.5.1

But I didn't yet realize this was a influxdata dependency ;)

@tustvold
Copy link
Contributor

Filed - #3987

@Dandandan Dandandan merged commit 002165b into apache:master Oct 27, 2022
@Dandandan
Copy link
Contributor Author

Filed - #3987

Thanks @tustvold

@ursabot
Copy link

ursabot commented Oct 27, 2022

Benchmark runs are scheduled for baseline = e73a43c and contender = 002165b. 002165b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dandandan added a commit to yuuch/arrow-datafusion that referenced this pull request Nov 5, 2022
…3958)

* Implement right semi join

* Change error a bit

* protobuf

* protobuf

* protobuf

* Change column name to b2

* Rename everything

* Rename & fmt

* Change display to leftanti

* Fix last expected plan

* Commit generated file

* generated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement right semi join and support it in HashBuildProbeOrder
6 participants