Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make filter selectivity for statistics configurable #8243

Merged
merged 4 commits into from
Dec 5, 2023

Conversation

edmondop
Copy link
Contributor

@edmondop edmondop commented Nov 17, 2023

Close #8133

Expose an API to set a default value for filter selectivity when the exact value cannot be computed. This value is also exposed as a part of the configuration as datafusion.optimizer.default_filter_selectivity

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions optimizer Optimizer rules core Core DataFusion crate substrait labels Nov 17, 2023
@edmondop
Copy link
Contributor Author

@andygrove I do have a super early draft broken implementation, but it's already enough to ask your view on what's the right thing to do with SQL and when the filter is introduced as a part of an optimization but not as an optimization of an existing filter. Could you please review my comments?

@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Nov 20, 2023
@github-actions github-actions bot removed sql SQL Planner optimizer Optimizer rules substrait labels Nov 20, 2023
@edmondop edmondop changed the title Wip Make filter selectivity for statistics configurable Nov 20, 2023
@edmondop edmondop force-pushed the issue-8133 branch 6 times, most recently from 5b01f5e to cbf9d36 Compare November 20, 2023 21:59
@edmondop edmondop force-pushed the issue-8133 branch 3 times, most recently from 1e9c8d3 to 3609e1b Compare November 21, 2023 02:42
@edmondop edmondop marked this pull request as ready for review November 21, 2023 02:47
@edmondop edmondop requested a review from Dandandan November 21, 2023 02:48
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 21, 2023
/// The default filter selectivity used by Filter Statistics
/// when an exact selectivity cannot be determined. Valid values are
/// between 0 (no selectivity) and 100 (all rows are selected).
pub default_filter_selectivity: u8, default = 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to make this a float (0.2).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two main reasons for choosing a uint are the lack of Eq trait implementation for f32, as well as the problem that could arise when serializing numbers that cannot be perfectly represented as f32. If you had already made this consideration and you think f32 is still a better option, let me know and I will proceed

@Dandandan
Copy link
Contributor

Overall looks very good! I think it might make more sense to use a f32 for this :)

@Dandandan
Copy link
Contributor

FYI @alamb @andygrove

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @edmondop and @Dandandan -- this looks pretty neat.

@@ -994,4 +1014,22 @@ mod tests {

Ok(())
}

#[tokio::test]
async fn test_validation_filter_selectivity() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we also add a test showing that changing the default selectivity actually affects the output statistics?

I think if the selectivity got hard coded to 0.2 again, no tests would fail 🤔 Maybe we could add another unit tests here setting selectivity to 0.5 or something and demonstrating the statistics are different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should, however I didn't know how to observe the effect of such a parameter. Is there some form of observable state or result exposed that I can use to perform an assertion about what selectivity has been used ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to look at the output of FilterExec::statistics() and the row number estimates will change with different value of selectivity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to finish it between today and tomorrow, thanks for the suggestion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb fixed with 81034c2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure however how this should be handled

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure however how this should be handled

That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like sin(x) = 4.0.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @edmondop -- this looks good to me.

@@ -994,4 +1014,22 @@ mod tests {

Ok(())
}

#[tokio::test]
async fn test_validation_filter_selectivity() -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure however how this should be handled

That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like sin(x) = 4.0.

));
let filter = FilterExec::try_new(predicate, input)?;
let statistics = filter.statistics()?;
assert_eq!(statistics.num_rows, Precision::Inexact(200));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌 very nice

@alamb alamb merged commit c7a6965 into apache:main Dec 5, 2023
23 checks passed
appletreeisyellow pushed a commit to appletreeisyellow/datafusion that referenced this pull request Dec 15, 2023
* Turning filter selectivity as a configurable parameter

* Renaming API to be more consistent with struct value

* Adding a filter with custom selectivity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make default filter selectivity estimate configurable
3 participants