-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Part2] Partition and Sort Enforcement, ExecutionPlan enhancement #4043
Conversation
@alamb @andygrove @Dandandan @isidentical @yahoNanJing |
Thanks @mingmwang -- I will review this carefully tomorrow |
I am sorry -- I ran out of time today -- will try and find time tomorrow |
@liukun4515 and @Ted-Jiang perhaps you have some time to help review this as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very impressive @mingmwang -- thank you very much
My biggest question is how are the changes to distribution tested? I see code that verifies partitioning (or rather not partitioning) with UnionExec but there are changes made to all the other physical operators.
For example what about tests for WindowAggregate
and outer joins and sort merge join?
I saw tests for some of the functions for operating on EquivalenceProperties
👍 but not all of them.
I left some style questions about encapsulating EquivalenceProperties
that might also help
So TLDR is I think the changes to the physical operators need more tests.
Maybe you could break out the equivalence class code into a separate PR?
None, | ||
)?; | ||
|
||
// join key ordering is different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
// TODO check the output ordering of CrossJoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still a todo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure whether our CrossJoin implementation can keep the ordering of right side or not.
/// | ||
/// For example, split "a1 = a2 AND b1 <= b2 AND c1 != c2" into ["a1 = a2", "b1 <= b2", "c1 != c2"] | ||
/// | ||
pub fn split_predicate(predicate: &Arc<dyn PhysicalExpr>) -> Vec<&Arc<dyn PhysicalExpr>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is called split_conjunction
in the logical optimizer -- perhaps it could be called the same thing in the physical layer. The logical expr implementation also avoids creating quite as many Vec
s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
if contains_first && !contains_second { | ||
prop.insert(new_condition.1.clone()); | ||
idx1 = idx as i32; | ||
} else if !contains_first && contains_second { | ||
prop.insert(new_condition.0.clone()); | ||
idx2 = idx as i32; | ||
} else if contains_first && contains_second { | ||
idx1 = idx as i32; | ||
idx2 = idx as i32; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could also use a match statement here and let the compiler heck that all important cases are covered:
if contains_first && !contains_second { | |
prop.insert(new_condition.1.clone()); | |
idx1 = idx as i32; | |
} else if !contains_first && contains_second { | |
prop.insert(new_condition.0.clone()); | |
idx2 = idx as i32; | |
} else if contains_first && contains_second { | |
idx1 = idx as i32; | |
idx2 = idx as i32; | |
break; | |
} | |
match (contains_first, contains_second) { | |
(true, false) => { | |
prop.insert(new_condition.1.clone()); | |
idx1 = idx as i32; | |
} | |
(false, true)=> { | |
prop.insert(new_condition.0.clone()); | |
idx2 = idx as i32; | |
} | |
(true, true) => { | |
idx1 = idx as i32; | |
idx2 = idx as i32; | |
break; | |
} | |
(false, false) => {} | |
} |
@@ -96,12 +96,15 @@ impl ExecutionPlan for CoalesceBatchesExec { | |||
self.input.output_partitioning() | |||
} | |||
|
|||
// Depends on how the CoalesceBatches was implemented, it is possible to keep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also SortPreservingMerge
that can be used to preserve order but there are tradeoffs there (specifically it takes more effort to keep the sort order than it does to append batches together)
@@ -231,6 +246,38 @@ impl RecordBatchStream for FilterExecStream { | |||
} | |||
} | |||
|
|||
/// Return the equals Column-Pairs and Non-equals Column-Pairs | |||
fn collect_columns_from_predicate(predicate: &Arc<dyn PhysicalExpr>) -> EqualAndNonEqual { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this would be better in utils.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used by FilterExec, I would prefer to keep this as a private func in filter.rs
Distribution::SinglePartition | ||
fn required_input_distribution(&self) -> Vec<Distribution> { | ||
if self.partition_keys.is_empty() { | ||
warn!("No partition defined for WindowAggExec!!!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why this would generate a warning -- can't this occur with a query like SELECT ROW_NUMBER OVER () from foo
(as in an empty over clause)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a valid case, but the SQL might run very slowly without any Partition By
clause due to collapsed to the Distribution::SinglePartition
. I can remove the warning if we think the warning is useless. There is one optimization we can do here in future after we add
the Range Partitioning
(I can work on this maybe next month). When there is not Partition By
clause but only Order By
, and depends on the window funcs, for some cases we can make the required_input_distribution
to be SortDistribution
, so that the WindowAggExec
can still run in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend removing the warning because it isn't clear to me what a user / administrator of the system would do in this case and so the warning will end up as spam in the logs I think.
Perhaps we can just change it to debug!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
} else { | ||
Distribution::UnspecifiedDistribution | ||
//TODO support PartitionCollections if there is no common partition columns in the window_expr | ||
vec![Distribution::HashPartitioned(self.partition_keys.clone())] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I agree this sounds good
eq_properties: &[EquivalenceProperties], | ||
) -> Arc<dyn PhysicalExpr> { | ||
let mut normalized = expr.clone(); | ||
if let Some(column) = expr.as_any().downcast_ref::<Column>() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to recursively rewrite exprs?
Like what if expr
was A + B
and you had an equivalence class with B = C
Wouldn't you have to rewrite A + B
into A + C
? But I don't see this code recursing.
This kind of rewrite could be tested as well I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, rewriting recursively is more safe. Currently the equal join conditions are just Columns,
and for AggregateExec, the output_group_expr are also Columns. For WindowAggExec, does DataFusion support Partition by complex exprs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does DataFusion support Partition by complex exprs ?
Yes I think so:
DataFusion CLI v13.0.0
❯ create table foo as values (1,2), (3,4), (3,2), (2,1), (null, 0);
❯ select first_value(column1) over (partition by (column2%2) order by column2) from foo;
+--------------------------+
| FIRST_VALUE(foo.column1) |
+--------------------------+
| 2 |
| |
| |
| |
| |
+--------------------------+
} | ||
|
||
/// Combine the new equal condition with the existing equivalence properties. | ||
pub fn combine_equivalence_properties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good interface design. It can be leveraged by both the Join and Filter
} | ||
} | ||
|
||
pub fn remove_equivalence_properties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the eq_properties contain the none equal columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove the related logic.
let matches = eq_properties.get_mut(match_idx as usize).unwrap(); | ||
matches.remove(remove_condition.0); | ||
matches.remove(remove_condition.1); | ||
if matches.len() <= 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic may be not correct. For example, original two equivalence properties, left side (l1,l2), right side (r1,r2), then after combine_equivalence_properties
, it becomes one equivalence properties, (l1,l2,r1,r2). Then we comes to remove_equivalence_properties
with remove condition (l1,r1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is confusing. I will remove the remove_equivalence_properties related logic.
/// The output ordering | ||
output_ordering: Option<Vec<PhysicalSortExpr>>, | ||
/// The alias map used to normalize out expressions like Partitioning and PhysicalSortExpr | ||
alias_map: HashMap<Column, Vec<Column>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add comments to indicate what does the key & value stand for.
For my understanding, the key is the column in the input schema of this Projection operator. While the values are the columns in this output schema of this Projection operator.
} | ||
|
||
pub fn merge_equivalence_properties_with_alias( | ||
eq_properties: &mut Vec<EquivalenceProperties>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The eq_properties
is the EquivalenceProperties
of some input for the current operator.
Here, the goal of this function to construct a new EquivalenceProperties
for the current operator
for (_idx, prop) in eq_properties.iter_mut().enumerate() { | ||
if prop.contains(column) { | ||
for col in columns { | ||
prop.insert(col.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it can be corrected by truncate_equivalence_properties_not_in_schema
, I still think it's better to construct a new one directly rather than do the merge based on the input EquivalenceProperties
Distribution::UnspecifiedDistribution | ||
/// Specifies the data distribution requirements for all the | ||
/// children for this operator, By default it's [[Distribution::UnspecifiedDistribution]] for each child, | ||
fn required_input_distribution(&self) -> Vec<Distribution> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can just use
vec![Distribution::UnspecifiedDistribution; self.children().len()]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb
How do you think, for leaf nodes, should we return an empty vec![]
here or return
vec![Distribution::UnspecifiedDistribution]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally think @yahoNanJing 's suggestion of
vec![Distribution::UnspecifiedDistribution; self.children().len()]
would make the intent clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will change it in the following PR.
) | ||
}) | ||
.unzip(); | ||
vec![ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it only supports exactly matched case. Is it possible to support partial matching case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible, but this PR will not include it . Originally I have plan to implement such optimizations In Phase 2
with a more dynamic Enforcement rules, but it has the risk to introduce skewed joins and currently we do not have good way to handle skewed joins.
JoinType::RightSemi | JoinType::RightAnti => { | ||
self.right.output_partitioning() | ||
} | ||
JoinType::Left |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these cases exist when the partition mode is CollectLeft?
@alamb @yahoNanJing |
retest please |
} | ||
|
||
/// Equivalent Class is a set of Columns that are known to have the same value in all tuples in a relation | ||
/// Equivalence Class is generated by equality predicates, typically equijoin conditions and equality conditions in filters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this abstraction an the comments.
/// equality predicates in Join or Filter | ||
pub fn add_equal_conditions(&mut self, new_conditions: (&Column, &Column)) { | ||
let mut idx1: Option<usize> = None; | ||
let mut idx2: Option<usize> = None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An option is much more correct now
@@ -472,7 +508,10 @@ pub enum Distribution { | |||
HashPartitioned(Vec<Arc<dyn PhysicalExpr>>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we need to add partition number and schema to the HashPartitioned
in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Should we make the Equivalence Properties schema aware ?
|
It would be great to add this schema constraint. Then we can avoid the ambiguous in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mingmwang -- I agree this PR is ready to merge as is.
It would be great to file tickets to track your follow on work (like the TODO about cross joins, etc)
Thanks for getting this great stuff into DataFusion
Arc::new(Column::new_with_schema("c1", &join_schema).unwrap()), | ||
Arc::new(Column::new_with_schema("c2", &join_schema).unwrap()), | ||
]; | ||
assert_eq!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -194,6 +258,73 @@ impl ExecutionPlan for UnionExec { | |||
} | |||
} | |||
|
|||
/// CombinedRecordBatchStream can be used to combine a Vec of SendableRecordBatchStreams into one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel there was already a piece of code that does this -- maybe @tustvold can remind me 🤔
Hi @alamb, should we merge this PR first so that @mingmwang will be able to continue the part 3 of this unnecessary shuffling optimization? |
Yes absolutely! |
Sorry for the delay |
Benchmark runs are scheduled for baseline = 238e179 and contender = b7a3331. b7a3331 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Partially Closes #3854.
Closes #3653
Closes #3400
Closes #189,
You can see the entire work in #3855
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?