Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-12173][table] Optimize SELECT DISTINCT #25752

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

yiyutian1
Copy link

@yiyutian1 yiyutian1 commented Dec 5, 2024

What is the purpose of the change

This is to optimize SELECT DISTINCT query from using GroupAgg to DeDuplicate.

Brief change log

This PR implements a new Calcite rule that does the following:

  • Check if the user is running SELECT DISTINCT query.
  • Originally we want convert the current plan from StreamPhysicalDeduplicate instead of StreamPhysicalGroupAggregateRule in rowtime. After discussion in OSS community, Lincoln in OSS decided to refactor Deduplicate optimization to defer to StreamPhysicalRank for valid StreamExecDeduplicate node conversion to avoid exceptions. PR
    So instead, we are optimizing by converting the current plan from FlinkLogicalRank instead of FlinkLogicalAggregate in rowtime, which has the similar effect as of the original plan.
  • Modify existing tests to reflect the new optimizer.
  • We decided that because of this bug, https://issues.apache.org/jira/browse/FLINK-35792, we will avoid optimizing for this particular case.

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change is already covered by existing tests in the table/planner module. For example:
flink-table/flink-table-planner/src/test/resources/org/apache/flink/table/planner/plan/common/PartialInsertTest.xml

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): ( no )
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: ( no )
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@flinkbot
Copy link
Collaborator

flinkbot commented Dec 5, 2024

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@yiyutian1 yiyutian1 marked this pull request as ready for review December 6, 2024 19:39
* e.g. {SELECT DISTINCT a, b, c;} will be converted to [[FlinkLogicalRank]] instead of
* [[FlinkLogicalAggregate]] in rowtime.
*/
class StreamLogicalOptimizeSelectDistinctRule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yiyutian1 and I worked on this together and we started with a Scala example.

Given that we are working on migrating the Scala rules to Java, could you take a look at migrating this rule to Java?

Copy link
Author

@yiyutian1 yiyutian1 Dec 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment Jim!
Are we migrating from Scala->Java for converters now too, or are we mainly doing it for builtInFunctions?

My impression is that for buildInFunctions we want that migration because auto-generated Java code is hard to maintain, but that doesn't seem to be a problem here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@snuyanzin , could you provide some feedback here?
Could we try merging this ticket as is, so that the optimizer can be available soon, and then we can do the migration in a separate effort?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great to see green ci first

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CI is green! Woohoo!
Could we get some feedback? @snuyanzin
Many thanks.

@yiyutian1
Copy link
Author

@flinkbot run azure

@yiyutian1
Copy link
Author

@flinkbot run azure

2 similar comments
@yiyutian1
Copy link
Author

@flinkbot run azure

@jnh5y
Copy link
Contributor

jnh5y commented Dec 9, 2024

@flinkbot run azure

@yiyutian1
Copy link
Author

@flinkbot run azure

1 similar comment
@jnh5y
Copy link
Contributor

jnh5y commented Dec 9, 2024

@flinkbot run azure

@yiyutian1 yiyutian1 requested a review from snuyanzin December 10, 2024 14:03
@yiyutian1
Copy link
Author

Hi @xuyangzhong, @wuchong @lincoln-lil, could I get your feedback on this PR?
I don't have access to add you as reviewers, therefore pinging here. Your in-time feedback will be deeply appreciated.

@snuyanzin
Copy link
Contributor

@xuyangzhong I noticed you discussed some of this changes with @jnh5y in FLINK-35792
Would be great to see your thoughts here as well especially about usage of physical rule in logical set (I'm still confused by this)

//cc @lincoln-lil (i think you might be interested in this as well)

* e.g. {SELECT DISTINCT a, b, c;} will be converted to [[FlinkLogicalRank]] instead of
* [[FlinkLogicalAggregate]] in rowtime.
*/
class StreamLogicalOptimizeSelectDistinctRule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally it would be great to have it converted to java in this PR.
I looked into some previous PR and this is how it was: if a rule was new then it was requested to have it in java
if it is an old one then it's ok to convert it in a separate PR

Comment on lines +1018 to +1019
|SELECT DISTINCT a, b, c
|FROM MyTable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any IT case for that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants