Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix join order for TPCH Q17 & Q18 by improving FilterExec statistics #8126

Merged
merged 11 commits into from
Nov 12, 2023
6 changes: 4 additions & 2 deletions datafusion/physical-plan/src/filter.rs
Original file line number Diff line number Diff line change
Expand Up @@ -194,11 +194,13 @@ impl ExecutionPlan for FilterExec {
fn statistics(&self) -> Result<Statistics> {
let predicate = self.predicate();

let input_stats = self.input.statistics()?;
let schema = self.schema();
if !check_support(predicate, &schema) {
return Ok(Statistics::new_unknown(&schema));
// assume worst case, that the filter is highly selective and
andygrove marked this conversation as resolved.
Show resolved Hide resolved
// returns all the rows from its input
return Ok(input_stats.clone().into_inexact());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can make a slightly different assumption that is a better metric, e.g. each filter returning 50% or 20% of input rows?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a configuration option to control the default selectivity. I'll take a look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. Since statistics will be Inexact, it should never result in an incorrect output, but may improve average-case complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like making this configurable will be a larger change. I filed #8133 and linked to it from the comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The talk Join Order Optimization with (almost) no Statistics is focused on full join reordering rather than just choosing the build side of a join but talks about selectivity estimates and is very relevant to this discussion. They found that selectivity of 0.2 worked well with TPC-H.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a change to use 0.2 as the default, and now Q18 has an improved join order as well. I updated the results in the PR description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default selectivities / cost estimates work ok for TPCH queries where the data is relatively uniformly distributed.

However, in general in my experience they tend to cause problems when the data is skewed or has correlations between the columns.

Hopefully we'll be able to keep the number of hard coded constants / assumptions low in DataFusion (so there are fewer things for the optimizer to get wrong :) )

}
let input_stats = self.input.statistics()?;

let num_rows = input_stats.num_rows;
let total_byte_size = input_stats.total_byte_size;
Expand Down