Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize PushDownFilter to avoid recreating schema columns #11211

Merged
merged 2 commits into from
Jul 5, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 2, 2024

Which issue does this PR close?

Follow on to #11203

Rationale for this change

While reviewing #11203 I noticed that a HashSet<Column> (which copies a string for each column in the DFSchema) was re-created for:

  1. Each filter / on predicate for
  2. each input

What changes are included in this PR?

Create the HashSet once per join input rather than once per predicate per join input

Are these changes tested?

Functionally: Covered by existing CI

Performance: minor reported improvements (1-2% maybe)

Details

group                                         main                                   optimize_pushdown
-----                                         ----                                   -----------------
logical_aggregate_with_join                   1.00  1034.9±24.67µs        ? ?/sec    1.00  1038.8±11.80µs        ? ?/sec
logical_plan_tpcds_all                        1.00    156.1±1.18ms        ? ?/sec    1.00    155.5±1.14ms        ? ?/sec
logical_plan_tpch_all                         1.01     17.4±0.22ms        ? ?/sec    1.00     17.2±0.23ms        ? ?/sec
logical_select_all_from_1000                  1.00     17.9±0.12ms        ? ?/sec    1.00     17.9±0.11ms        ? ?/sec
logical_select_one_from_700                   1.00    832.1±7.97µs        ? ?/sec    1.00    832.4±9.22µs        ? ?/sec
logical_trivial_join_high_numbered_columns    1.00    780.7±7.75µs        ? ?/sec    1.00    783.4±8.61µs        ? ?/sec
logical_trivial_join_low_numbered_columns     1.00   772.1±18.00µs        ? ?/sec    1.00    770.0±8.78µs        ? ?/sec
physical_plan_tpcds_all                       1.03   1101.9±7.83ms        ? ?/sec    1.00  1066.8±10.18ms        ? ?/sec
physical_plan_tpch_all                        1.02     73.8±0.64ms        ? ?/sec    1.00     72.2±1.09ms        ? ?/sec
physical_plan_tpch_q1                         1.00      2.6±0.03ms        ? ?/sec    1.00      2.6±0.04ms        ? ?/sec
physical_plan_tpch_q10                        1.03      3.7±0.03ms        ? ?/sec    1.00      3.6±0.03ms        ? ?/sec
physical_plan_tpch_q11                        1.02      3.1±0.03ms        ? ?/sec    1.00      3.1±0.04ms        ? ?/sec
physical_plan_tpch_q12                        1.02      2.5±0.03ms        ? ?/sec    1.00      2.4±0.02ms        ? ?/sec
physical_plan_tpch_q13                        1.01  1837.3±15.75µs        ? ?/sec    1.00  1823.3±27.17µs        ? ?/sec
physical_plan_tpch_q14                        1.01      2.0±0.02ms        ? ?/sec    1.00      2.0±0.01ms        ? ?/sec
physical_plan_tpch_q16                        1.03      3.1±0.03ms        ? ?/sec    1.00      3.0±0.02ms        ? ?/sec
physical_plan_tpch_q17                        1.03      2.8±0.03ms        ? ?/sec    1.00      2.8±0.02ms        ? ?/sec
physical_plan_tpch_q18                        1.02      3.3±0.05ms        ? ?/sec    1.00      3.3±0.03ms        ? ?/sec
physical_plan_tpch_q19                        1.02      4.9±0.05ms        ? ?/sec    1.00      4.8±0.04ms        ? ?/sec
physical_plan_tpch_q2                         1.04      6.5±0.06ms        ? ?/sec    1.00      6.3±0.08ms        ? ?/sec
physical_plan_tpch_q20                        1.02      3.7±0.06ms        ? ?/sec    1.00      3.6±0.02ms        ? ?/sec
physical_plan_tpch_q21                        1.05      5.2±0.06ms        ? ?/sec    1.00      5.0±0.05ms        ? ?/sec
physical_plan_tpch_q22                        1.02      2.8±0.02ms        ? ?/sec    1.00      2.7±0.03ms        ? ?/sec
physical_plan_tpch_q3                         1.02      2.6±0.03ms        ? ?/sec    1.00      2.6±0.02ms        ? ?/sec
physical_plan_tpch_q4                         1.01  1995.3±20.91µs        ? ?/sec    1.00  1981.5±53.19µs        ? ?/sec
physical_plan_tpch_q5                         1.02      3.8±0.04ms        ? ?/sec    1.00      3.7±0.03ms        ? ?/sec
physical_plan_tpch_q6                         1.00  1333.0±11.66µs        ? ?/sec    1.00  1330.0±11.96µs        ? ?/sec
physical_plan_tpch_q7                         1.03      4.8±0.06ms        ? ?/sec    1.00      4.6±0.05ms        ? ?/sec
physical_plan_tpch_q8                         1.04      6.0±0.05ms        ? ?/sec    1.00      5.8±0.07ms        ? ?/sec
physical_plan_tpch_q9                         1.02      4.5±0.04ms        ? ?/sec    1.00      4.4±0.05ms        ? ?/sec
physical_select_all_from_1000                 1.01     45.0±0.17ms        ? ?/sec    1.00     44.7±0.18ms        ? ?/sec
physical_select_one_from_700                  1.00      3.4±0.02ms        ? ?/sec    1.00      3.4±0.02ms        ? ?/sec

Are there any user-facing changes?

Hopefully faster planning

@github-actions github-actions bot added the optimizer Optimizer rules label Jul 2, 2024
@@ -285,16 +321,7 @@ fn extract_or_clauses_for_join<'a>(
filters: &'a [Expr],
schema: &'a DFSchema,
) -> impl Iterator<Item = Expr> + 'a {
let schema_columns = schema
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was replicated, so I moved it into its own function

@alamb alamb marked this pull request as ready for review July 2, 2024 12:44
for predicate in predicates {
if left_preserved && can_pushdown_join_predicate(&predicate, left_schema)? {
if left_preserved && checker.is_left_only(&predicate) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this PR is that ever call to can_pushdown_join_predicatere-created the same HashSet<Column> for the left or right schema

Now the creation is done one

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I checked that this refactoring has consistent behavior with previous.

@alamb
Copy link
Contributor Author

alamb commented Jul 5, 2024

Thank you for the review @jonahgao

@alamb alamb merged commit b46d5b7 into apache:main Jul 5, 2024
23 checks passed
comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Jul 8, 2024
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants