[Part2] Partition and Sort Enforcement, ExecutionPlan enhancement #4043

mingmwang · 2022-10-31T14:16:15Z

Which issue does this PR close?

Partially Closes #3854.
Closes #3653
Closes #3400
Closes #189,

You can see the entire work in #3855

Rationale for this change

What changes are included in this PR?

Add methods required_input_ordering() to ExecutionPlan trait to specify the ordering requirements
Fix output_partitioning(), output_ordering(), required_input_distribution() in couple of trait implementations
Add method equivalence_properties() to ExecutionPlan trait to discover the equivalence properties in the Physical plan tree
Support partition aware UnionExec

Are there any user-facing changes?

mingmwang · 2022-10-31T14:34:39Z

@alamb @andygrove @Dandandan @isidentical @yahoNanJing
Please help to take a look

alamb · 2022-10-31T19:56:25Z

Thanks @mingmwang -- I will review this carefully tomorrow

alamb · 2022-11-01T20:58:27Z

I am sorry -- I ran out of time today -- will try and find time tomorrow

alamb · 2022-11-01T20:58:53Z

@liukun4515 and @Ted-Jiang perhaps you have some time to help review this as well

alamb

Looking very impressive @mingmwang -- thank you very much

My biggest question is how are the changes to distribution tested? I see code that verifies partitioning (or rather not partitioning) with UnionExec but there are changes made to all the other physical operators.

For example what about tests for WindowAggregate and outer joins and sort merge join?

I saw tests for some of the functions for operating on EquivalenceProperties 👍 but not all of them.

I left some style questions about encapsulating EquivalenceProperties that might also help

So TLDR is I think the changes to the physical operators need more tests.

Maybe you could break out the equivalence class code into a separate PR?

datafusion/core/src/dataframe.rs

alamb · 2022-11-02T18:13:46Z

datafusion/core/src/dataframe.rs

+            None,
+        )?;
+
+        // join key ordering is different


alamb · 2022-11-02T18:16:48Z

datafusion/core/src/physical_plan/joins/cross_join.rs

    }

+    // TODO check the output ordering of CrossJoin


is this still a todo?

Yeah, I'm not sure whether our CrossJoin implementation can keep the ordering of right side or not.

alamb · 2022-11-02T18:17:46Z

datafusion/physical-expr/src/utils.rs

+///
+/// For example, split "a1 = a2 AND b1 <= b2 AND c1 != c2" into ["a1 = a2", "b1 <= b2", "c1 != c2"]
+///
+pub fn split_predicate(predicate: &Arc<dyn PhysicalExpr>) -> Vec<&Arc<dyn PhysicalExpr>> {


This is called split_conjunction in the logical optimizer -- perhaps it could be called the same thing in the physical layer. The logical expr implementation also avoids creating quite as many Vecs

https://github.com/apache/arrow-datafusion/blob/345234550712173477e7807ba2cf67dd2ffb9ed5/datafusion/optimizer/src/utils.rs#L58-L78

alamb · 2022-11-02T18:21:00Z

datafusion/physical-expr/src/utils.rs

+        if contains_first && !contains_second {
+            prop.insert(new_condition.1.clone());
+            idx1 = idx as i32;
+        } else if !contains_first && contains_second {
+            prop.insert(new_condition.0.clone());
+            idx2 = idx as i32;
+        } else if contains_first && contains_second {
+            idx1 = idx as i32;
+            idx2 = idx as i32;
+            break;
+        }


You could also use a match statement here and let the compiler heck that all important cases are covered:

Suggested change

if contains_first && !contains_second {

prop.insert(new_condition.1.clone());

idx1 = idx as i32;

} else if !contains_first && contains_second {

prop.insert(new_condition.0.clone());

idx2 = idx as i32;

} else if contains_first && contains_second {

idx1 = idx as i32;

idx2 = idx as i32;

break;

}

match (contains_first, contains_second) {

(true, false) => {

prop.insert(new_condition.1.clone());

idx1 = idx as i32;

}

(false, true)=> {

prop.insert(new_condition.0.clone());

idx2 = idx as i32;

}

(true, true) => {

idx1 = idx as i32;

idx2 = idx as i32;

break;

}

(false, false) => {}

}

alamb · 2022-11-02T18:31:46Z

datafusion/core/src/physical_plan/coalesce_batches.rs

@@ -96,12 +96,15 @@ impl ExecutionPlan for CoalesceBatchesExec {
        self.input.output_partitioning()
    }

+    // Depends on how the CoalesceBatches was implemented, it is possible to keep


There is also SortPreservingMerge that can be used to preserve order but there are tradeoffs there (specifically it takes more effort to keep the sort order than it does to append batches together)

alamb · 2022-11-02T18:33:17Z

datafusion/core/src/physical_plan/filter.rs

@@ -231,6 +246,38 @@ impl RecordBatchStream for FilterExecStream {
    }
 }

+/// Return the equals Column-Pairs and Non-equals Column-Pairs
+fn collect_columns_from_predicate(predicate: &Arc<dyn PhysicalExpr>) -> EqualAndNonEqual {


Perhaps this would be better in utils.rs

Since this is only used by FilterExec, I would prefer to keep this as a private func in filter.rs

alamb · 2022-11-02T18:36:00Z

datafusion/core/src/physical_plan/windows/window_agg_exec.rs

-            Distribution::SinglePartition
+    fn required_input_distribution(&self) -> Vec<Distribution> {
+        if self.partition_keys.is_empty() {
+            warn!("No partition defined for WindowAggExec!!!");


I don't know why this would generate a warning -- can't this occur with a query like SELECT ROW_NUMBER OVER () from foo (as in an empty over clause)?

Yes, this is a valid case, but the SQL might run very slowly without any Partition By clause due to collapsed to the Distribution::SinglePartition. I can remove the warning if we think the warning is useless. There is one optimization we can do here in future after we add
the Range Partitioning (I can work on this maybe next month). When there is not Partition By clause but only Order By, and depends on the window funcs, for some cases we can make the required_input_distribution to be SortDistribution, so that the WindowAggExec can still run in parallel.

I would recommend removing the warning because it isn't clear to me what a user / administrator of the system would do in this case and so the warning will end up as spam in the logs I think.

Perhaps we can just change it to debug!

alamb · 2022-11-02T18:39:24Z

datafusion/core/src/physical_plan/windows/window_agg_exec.rs

        } else {
-            Distribution::UnspecifiedDistribution
+            //TODO support PartitionCollections if there is no common partition columns in the window_expr
+            vec![Distribution::HashPartitioned(self.partition_keys.clone())]


👍 I agree this sounds good

alamb · 2022-11-02T18:43:31Z

datafusion/physical-expr/src/utils.rs

+    eq_properties: &[EquivalenceProperties],
+) -> Arc<dyn PhysicalExpr> {
+    let mut normalized = expr.clone();
+    if let Some(column) = expr.as_any().downcast_ref::<Column>() {


Does this need to recursively rewrite exprs?

Like what if expr was A + B and you had an equivalence class with B = C

Wouldn't you have to rewrite A + B into A + C? But I don't see this code recursing.

This kind of rewrite could be tested as well I think

Yes, rewriting recursively is more safe. Currently the equal join conditions are just Columns,
and for AggregateExec, the output_group_expr are also Columns. For WindowAggExec, does DataFusion support Partition by complex exprs ?

does DataFusion support Partition by complex exprs ?

Yes I think so:

DataFusion CLI v13.0.0 ❯ create table foo as values (1,2), (3,4), (3,2), (2,1), (null, 0); ❯ select first_value(column1) over (partition by (column2%2) order by column2) from foo; +--------------------------+ | FIRST_VALUE(foo.column1) | +--------------------------+ | 2 | | | | | | | | | +--------------------------+

yahoNanJing · 2022-11-03T03:52:19Z

datafusion/physical-expr/src/utils.rs

+}
+
+/// Combine the new equal condition with the existing equivalence properties.
+pub fn combine_equivalence_properties(


Good interface design. It can be leveraged by both the Join and Filter

yahoNanJing · 2022-11-03T03:56:36Z

datafusion/physical-expr/src/utils.rs

+    }
+}
+
+pub fn remove_equivalence_properties(


Why does the eq_properties contain the none equal columns?

I will remove the related logic.

yahoNanJing · 2022-11-03T04:01:11Z

datafusion/physical-expr/src/utils.rs

+        let matches = eq_properties.get_mut(match_idx as usize).unwrap();
+        matches.remove(remove_condition.0);
+        matches.remove(remove_condition.1);
+        if matches.len() <= 1 {


This logic may be not correct. For example, original two equivalence properties, left side (l1,l2), right side (r1,r2), then after combine_equivalence_properties, it becomes one equivalence properties, (l1,l2,r1,r2). Then we comes to remove_equivalence_properties with remove condition (l1,r1).

I agree it is confusing. I will remove the remove_equivalence_properties related logic.

yahoNanJing · 2022-11-03T05:45:56Z

datafusion/core/src/physical_plan/projection.rs

+    /// The output ordering
+    output_ordering: Option<Vec<PhysicalSortExpr>>,
+    /// The alias map used to normalize out expressions like Partitioning and PhysicalSortExpr
+    alias_map: HashMap<Column, Vec<Column>>,


Better to add comments to indicate what does the key & value stand for.

For my understanding, the key is the column in the input schema of this Projection operator. While the values are the columns in this output schema of this Projection operator.

yahoNanJing · 2022-11-03T05:48:34Z

datafusion/physical-expr/src/utils.rs

+}
+
+pub fn merge_equivalence_properties_with_alias(
+    eq_properties: &mut Vec<EquivalenceProperties>,


The eq_properties is the EquivalenceProperties of some input for the current operator.

Here, the goal of this function to construct a new EquivalenceProperties for the current operator

yahoNanJing · 2022-11-03T05:51:36Z

datafusion/physical-expr/src/utils.rs

+        for (_idx, prop) in eq_properties.iter_mut().enumerate() {
+            if prop.contains(column) {
+                for col in columns {
+                    prop.insert(col.clone());


Although it can be corrected by truncate_equivalence_properties_not_in_schema, I still think it's better to construct a new one directly rather than do the merge based on the input EquivalenceProperties

yahoNanJing · 2022-11-03T07:38:19Z

datafusion/core/src/physical_plan/mod.rs

-        Distribution::UnspecifiedDistribution
+    /// Specifies the data distribution requirements for all the
+    /// children for this operator, By default it's [[Distribution::UnspecifiedDistribution]] for each child,
+    fn required_input_distribution(&self) -> Vec<Distribution> {


Maybe we can just use
vec![Distribution::UnspecifiedDistribution; self.children().len()]

@alamb
How do you think, for leaf nodes, should we return an empty vec![] here or return
vec![Distribution::UnspecifiedDistribution] ?

I personally think @yahoNanJing 's suggestion of

vec![Distribution::UnspecifiedDistribution; self.children().len()]

would make the intent clearer

Sure, I will change it in the following PR.

yahoNanJing · 2022-11-03T07:44:27Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+                        )
+                    })
+                    .unzip();
+                vec![


Currently it only supports exactly matched case. Is it possible to support partial matching case?

It is possible, but this PR will not include it . Originally I have plan to implement such optimizations In Phase 2
with a more dynamic Enforcement rules, but it has the risk to introduce skewed joins and currently we do not have good way to handle skewed joins.

yahoNanJing · 2022-11-03T07:54:53Z

datafusion/core/src/physical_plan/joins/hash_join.rs

+                JoinType::RightSemi | JoinType::RightAnti => {
+                    self.right.output_partitioning()
+                }
+                JoinType::Left


Should these cases exist when the partition mode is CollectLeft?

mingmwang · 2022-11-03T09:55:16Z

@alamb @yahoNanJing
Please help to take look again.

…nto issue-3854-part2

mingmwang · 2022-11-04T01:54:22Z

retest please

yahoNanJing · 2022-11-04T02:43:55Z

datafusion/physical-expr/src/equivalence.rs

+}
+
+/// Equivalent Class is a set of Columns that are known to have the same value in all tuples in a relation
+/// Equivalence Class is generated by equality predicates, typically equijoin conditions and equality conditions in filters.


I like this abstraction an the comments.

yahoNanJing · 2022-11-04T02:46:07Z

datafusion/physical-expr/src/equivalence.rs

+    /// equality predicates in Join or Filter
+    pub fn add_equal_conditions(&mut self, new_conditions: (&Column, &Column)) {
+        let mut idx1: Option<usize> = None;
+        let mut idx2: Option<usize> = None;


An option is much more correct now

yahoNanJing · 2022-11-04T03:24:29Z

datafusion/core/src/physical_plan/mod.rs

@@ -472,7 +508,10 @@ pub enum Distribution {
    HashPartitioned(Vec<Arc<dyn PhysicalExpr>>),


Maybe we need to add partition number and schema to the HashPartitioned in the future.

yahoNanJing

LGTM

mingmwang · 2022-11-04T07:47:22Z

Should we make the Equivalence Properties schema aware ?

/// Equivalence Properties is a vec of EquivalentClass.
#[derive(Debug, Default, Clone)]
pub struct EquivalenceProperties {
    classes: Vec<EquivalentClass>,
    schema: SchemaRef,
}

yahoNanJing · 2022-11-04T08:01:16Z

Should we make the Equivalence Properties schema aware ?

It would be great to add this schema constraint. Then we can avoid the ambiguous in
https://github.com/apache/arrow-datafusion/blob/e945c37d25cb173d03929084bcd8aac31f71580e/datafusion/core/src/physical_plan/projection.rs#L202-L209

alamb

Thanks @mingmwang -- I agree this PR is ready to merge as is.

It would be great to file tickets to track your follow on work (like the TODO about cross joins, etc)

Thanks for getting this great stuff into DataFusion

alamb · 2022-11-04T16:25:13Z

datafusion/core/src/dataframe.rs

+                        Arc::new(Column::new_with_schema("c1", &join_schema).unwrap()),
+                        Arc::new(Column::new_with_schema("c2", &join_schema).unwrap()),
+                    ];
+                    assert_eq!(


#4116
#4117
#4118

alamb · 2022-11-04T16:28:59Z

datafusion/core/src/physical_plan/union.rs

@@ -194,6 +258,73 @@ impl ExecutionPlan for UnionExec {
    }
 }

+/// CombinedRecordBatchStream can be used to combine a Vec of SendableRecordBatchStreams into one


I feel there was already a piece of code that does this -- maybe @tustvold can remind me 🤔

yahoNanJing · 2022-11-06T01:27:56Z

Hi @alamb, should we merge this PR first so that @mingmwang will be able to continue the part 3 of this unnecessary shuffling optimization?

alamb · 2022-11-06T11:41:24Z

Hi @alamb, should we merge this PR first so that @mingmwang will be able to continue the part 3 of this unnecessary shuffling optimization?

Yes absolutely!

alamb · 2022-11-06T11:41:31Z

Sorry for the delay

ursabot · 2022-11-06T11:51:20Z

Benchmark runs are scheduled for baseline = 238e179 and contender = b7a3331. b7a3331 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

mingmwang added 2 commits October 31, 2022 22:11

[Part2] Partition and Sort Enforcement, ExecutionPlan enhancement

fa9511f

Merge with upstream

26fdffb

github-actions bot added core Core DataFusion crate physical-expr Physical Expressions labels Oct 31, 2022

mingmwang added 3 commits November 1, 2022 11:01

Fix hash join output_partitioning

f21a2c1

fix

c32d772

fix format

e945c37

alamb reviewed Nov 2, 2022

View reviewed changes

yahoNanJing reviewed Nov 3, 2022

View reviewed changes

mingmwang added 2 commits November 3, 2022 16:46

Resolve review comments

33e9c18

tiny fix

8dc5d5f

mingmwang added 2 commits November 4, 2022 00:20

UT to verify hash join output_partitioning

4437235

Merge branch 'master' of https://github.com/apache/arrow-datafusion i…

fc18efc

…nto issue-3854-part2

yahoNanJing reviewed Nov 4, 2022

View reviewed changes

fix comments

fcfbf66

yahoNanJing reviewed Nov 4, 2022

View reviewed changes

yahoNanJing approved these changes Nov 4, 2022

View reviewed changes

alamb approved these changes Nov 4, 2022

View reviewed changes

alamb merged commit b7a3331 into apache:master Nov 6, 2022

crepererum mentioned this pull request Apr 13, 2023

UNION ALL with ORDER BY results are inconsistent #5970

Closed

		@@ -472,7 +508,10 @@ pub enum Distribution {
		HashPartitioned(Vec<Arc<dyn PhysicalExpr>>),

[Part2] Partition and Sort Enforcement, ExecutionPlan enhancement #4043

[Part2] Partition and Sort Enforcement, ExecutionPlan enhancement #4043

Conversation

mingmwang commented Oct 31, 2022 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

mingmwang commented Oct 31, 2022

alamb commented Oct 31, 2022

alamb commented Nov 1, 2022

alamb commented Nov 1, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmwang Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmwang Nov 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mingmwang commented Nov 3, 2022

mingmwang commented Nov 4, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing left a comment

Choose a reason for hiding this comment

mingmwang commented Nov 4, 2022

yahoNanJing commented Nov 4, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yahoNanJing commented Nov 6, 2022

alamb commented Nov 6, 2022

alamb commented Nov 6, 2022

ursabot commented Nov 6, 2022

mingmwang commented Oct 31, 2022 •

edited by alamb

Loading

mingmwang Nov 2, 2022 •

edited

Loading

alamb Nov 2, 2022 •

edited

Loading

yahoNanJing Nov 3, 2022 •

edited

Loading

mingmwang Nov 3, 2022 •

edited

Loading