-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make filter selectivity for statistics configurable #8243
Changes from 3 commits
77ad090
879987b
8135173
81034c2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -62,6 +62,8 @@ pub struct FilterExec { | |
input: Arc<dyn ExecutionPlan>, | ||
/// Execution metrics | ||
metrics: ExecutionPlanMetricsSet, | ||
/// Selectivity for statistics. 0 = no rows, 100 all rows | ||
default_selectivity: u8, | ||
} | ||
|
||
impl FilterExec { | ||
|
@@ -75,13 +77,25 @@ impl FilterExec { | |
predicate, | ||
input: input.clone(), | ||
metrics: ExecutionPlanMetricsSet::new(), | ||
default_selectivity: 20, | ||
}), | ||
other => { | ||
plan_err!("Filter predicate must return boolean values, not {other:?}") | ||
} | ||
} | ||
} | ||
|
||
pub fn with_default_selectivity( | ||
mut self, | ||
default_selectivity: u8, | ||
) -> Result<Self, DataFusionError> { | ||
if default_selectivity > 100 { | ||
return plan_err!("Default flter selectivity needs to be less than 100"); | ||
} | ||
self.default_selectivity = default_selectivity; | ||
Ok(self) | ||
} | ||
|
||
/// The expression to filter on. This expression must evaluate to a boolean value. | ||
pub fn predicate(&self) -> &Arc<dyn PhysicalExpr> { | ||
&self.predicate | ||
|
@@ -91,6 +105,11 @@ impl FilterExec { | |
pub fn input(&self) -> &Arc<dyn ExecutionPlan> { | ||
&self.input | ||
} | ||
|
||
/// The default selectivity | ||
pub fn default_selectivity(&self) -> u8 { | ||
self.default_selectivity | ||
} | ||
} | ||
|
||
impl DisplayAs for FilterExec { | ||
|
@@ -167,6 +186,10 @@ impl ExecutionPlan for FilterExec { | |
mut children: Vec<Arc<dyn ExecutionPlan>>, | ||
) -> Result<Arc<dyn ExecutionPlan>> { | ||
FilterExec::try_new(self.predicate.clone(), children.swap_remove(0)) | ||
.and_then(|e| { | ||
let selectivity = e.default_selectivity(); | ||
e.with_default_selectivity(selectivity) | ||
}) | ||
.map(|e| Arc::new(e) as _) | ||
} | ||
|
||
|
@@ -197,10 +220,7 @@ impl ExecutionPlan for FilterExec { | |
let input_stats = self.input.statistics()?; | ||
let schema = self.schema(); | ||
if !check_support(predicate, &schema) { | ||
// assume filter selects 20% of rows if we cannot do anything smarter | ||
// tracking issue for making this configurable: | ||
// https://github.com/apache/arrow-datafusion/issues/8133 | ||
let selectivity = 0.2_f64; | ||
let selectivity = self.default_selectivity as f64 / 100.0; | ||
let mut stats = input_stats.into_inexact(); | ||
stats.num_rows = stats.num_rows.with_estimated_selectivity(selectivity); | ||
stats.total_byte_size = stats | ||
|
@@ -994,4 +1014,22 @@ mod tests { | |
|
||
Ok(()) | ||
} | ||
|
||
#[tokio::test] | ||
async fn test_validation_filter_selectivity() -> Result<()> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shall we also add a test showing that changing the default selectivity actually affects the output statistics? I think if the selectivity got hard coded to 0.2 again, no tests would fail 🤔 Maybe we could add another unit tests here setting selectivity to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree we should, however I didn't know how to observe the effect of such a parameter. Is there some form of observable state or result exposed that I can use to perform an assertion about what selectivity has been used ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think you should be able to look at the output of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will try to finish it between today and tomorrow, thanks for the suggestion There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure however how this should be handled There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That code will be invoked for 'complicated' predicates -- maybe we could fake it with something like |
||
let schema = Schema::new(vec![Field::new("a", DataType::Int32, false)]); | ||
let input = Arc::new(StatisticsExec::new( | ||
Statistics::new_unknown(&schema), | ||
schema, | ||
)); | ||
// WHERE a = 10 | ||
let predicate = Arc::new(BinaryExpr::new( | ||
Arc::new(Column::new("a", 0)), | ||
Operator::Eq, | ||
Arc::new(Literal::new(ScalarValue::Int32(Some(10)))), | ||
)); | ||
let filter = FilterExec::try_new(predicate, input)?; | ||
assert!(filter.with_default_selectivity(120).is_err()); | ||
Ok(()) | ||
} | ||
} |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to make this a float (
0.2
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two main reasons for choosing a uint are the lack of Eq trait implementation for f32, as well as the problem that could arise when serializing numbers that cannot be perfectly represented as f32. If you had already made this consideration and you think f32 is still a better option, let me know and I will proceed