Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick "Derive Combined Hashed Spec For Outer Joins" #804

Merged
merged 27 commits into from
Dec 31, 2024

Conversation

jiaqizho
Copy link
Contributor

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


@jiaqizho jiaqizho changed the title Cherry pick "Derive Combined Hashed Spec For Outer Joins" [DNM]Cherry pick "Derive Combined Hashed Spec For Outer Joins" Dec 20, 2024
@my-ship-it my-ship-it added the cherry-pick cherry-pick upstream commts label Dec 20, 2024
@jiaqizho jiaqizho force-pushed the cherry-pick-orca-in-path-order-2 branch 3 times, most recently from 6cf4434 to 005dcae Compare December 26, 2024 05:51
@jiaqizho jiaqizho changed the title [DNM]Cherry pick "Derive Combined Hashed Spec For Outer Joins" Cherry pick "Derive Combined Hashed Spec For Outer Joins" Dec 26, 2024
@jiaqizho jiaqizho force-pushed the cherry-pick-orca-in-path-order-2 branch 4 times, most recently from 5d30d7c to e74f9fc Compare December 30, 2024 08:39
avamingli
avamingli previously approved these changes Dec 30, 2024
THANATOSLAVA and others added 14 commits December 30, 2024 17:30
Issue: Outer join operations enforce unnecessary data redistribution, causing ORCA plan execution to be much longer than planner execution.

Root cause: Unlike inner join operators, outer join operators only derive hashed distribution spec from one out of the two relations. Children nodes not delivering all the distribution properties led to parent nodes enforcing unnecessary data redistribution.

Solution: To mimic inner join distribution spec derivation, derive combined hashed spec for outer join operations from both relations. Eg. 10-relation outer join delivers a combined hashed spec with 10 (including itself) equivalent specs.

Implementation:
1. [CPhysicalLeftOuterHashJoin] -- Override PdsDerive (distribution spec derivation) in CPhysicalJoin. Add a case where both outer and inner relations are hash distributed. Return combined distribution spec if both outer and inner relations are hash distributed. Since NULLs are only added to unmatched rows, set Nullscolocated to false for all equivalent distribution specs of the inner relation.
2. [CPhysicalHashJoin] -- Set Nullscolocated to false when requesting or matching the hash distributed spec.
3. [CDistributionSpecHashed] -- Rewrite Combine function for hash distribution spec with linked list concatenation.
4. [CDistributionSpecHashed] -- Rewrite Copy function with recursion to ensure deep copy.
5. [CDistributionSpecHashed] -- Add Copy function to allow fNullsColocated configuration
6. [CDistributionSpecHashed] -- Enforce nulls colocation for hash redistribution. This is necessary when the non-null hash distribution request is not met.
7. [CPhysicalFullMergeJoin] -- Fix PdsDerive (distribution spec derivation). In full joins, both tables are outer tables. The join output is hash distributed by non-NULL join keys.
8. [CDistributionSpecTest] -- Add function test for hash spec combinination and copy.
9. [regress] -- Update regression test output. Verified plan equivalency.
10. [minidump] -- MDP plan shape update: LOJNonNullRejectingPredicates, LOJReorderWithSimplePredicate, Remove-Distinct-From-Subquery. The rest are SpaceSize and scan order changes. Add LeftJoinNullsNotColocated.

Co-authored-by: Jingyu Wang <[email protected]>
Following scenario led to crash due to missing statistics:
  ```sql
  CREATE TABLE t1 (c11 varchar, c12 numeric(15,4));
  CREATE TABLE t2 (c2 varchar);
  CREATE TABLE t3 (c3 varchar);

  SET allow_system_table_mods=true;

  UPDATE pg_class SET relpages = 97399::int, reltuples = 9106730.0::real, relallvisible = 0::int WHERE relname = 't1';
  UPDATE pg_class SET relpages = 68553::int, reltuples = 7054520.0::real, relallvisible = 0::int WHERE relname = 't2';

  SET optimizer_join_order=exhaustive;
  SELECT
       (SELECT c11 FROM t1) AS column1,
       (SELECT sum(c12)
          FROM t1
                  INNER JOIN t2 ON c11 = c2
                  INNER JOIN t3 ON c2 = c3
                  INNER JOIN t3 a1 ON a1.c3 = a2.c3
                  LEFT OUTER JOIN t3 a3 ON a1.c3 = a3.c3
                  LEFT OUTER JOIN t3 a4 ON a1.c3 = a4.c3
        ) AS column2
  FROM t3 a2;
  ```

Underlying cause is due to the fact that derive and reset for group
stats was not symmetric. In the case of "exhaustive", multiple xforms
may be run on the same group that derive stats before applying and reset
stats afterward. Prior to this commit it was possible to have a group
with "dirty" stats where child nodes may have been cleaned up, but the
group still tecnically has stats object. If it is a duplicate of another
group then it was possible to "trick" the other group into believing
that the stats were already derived. That fake news could lead to a
crash.

Co-authored-by: Jingyu Wang <[email protected]>
During the optimisation of CTE’s for distributed replicated tables, Sequence operator optimize the first child with any distribution
Requirement and compute the distribution request on the other children based on derived distribution of the first child.
If distribution of first child is a Singleton, requests singleton on all children
If distribution of first child is a Non-Singleton, requests Non-Singleton on all children,
Here when the first child is a Replicated/TaintedReplicated, still we requests Non-Singleton, Hence optimiser adding redistribution motion on
Top of second child, which is creating a wrong plan hence query is getting hung.
So we are trying to request Non-singleton without enforcers when the first child is non-singleton, non-universal and Replicated/TaintedReplicated.
Which can avoid adding redistribution motion on top of second child.

Old plan:
                             QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
 Gather Motion 3:1 (slice4; segments: 3) (cost=0.00..1293.00 rows=1 width=24)
  -> Sequence (cost=0.00..1293.00 rows=1 width=24)
     -> Shared Scan (share slice:id 4:0) (cost=0.00..431.00 rows=1 width=1)
        -> Materialize (cost=0.00..431.00 rows=1 width=1)
           -> WindowAgg (cost=0.00..431.00 rows=1 width=16)
              Partition By: testtable.name
              -> Sort (cost=0.00..431.00 rows=1 width=5)
                 Sort Key: testtable.name
                 -> Seq Scan on testtable (cost=0.00..431.00 rows=1 width=5)
     -> Redistribute Motion 1:3 (slice3) (cost=0.00..862.00 rows=1 width=24)
        -> Hash Left Join (cost=0.00..862.00 rows=1 width=24)
           Hash Cond: (“outer”.tblnm = pg_catalog.textin(unknownout(“outer”.tblnm), ‘’::void, (-1)))
           -> Result (cost=0.00..431.00 rows=1 width=8)
              -> Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..431.00 rows=1 width=1)
                 -> Result (cost=0.00..431.00 rows=1 width=1)
                    -> Shared Scan (share slice:id 1:0) (cost=0.00..431.00 rows=1 width=1)
           -> Hash (cost=431.00..431.00 rows=1 width=16)
              -> Result (cost=0.00..431.00 rows=1 width=16)
                 -> Aggregate (cost=0.00..431.00 rows=1 width=8)
                    -> Gather Motion 1:1 (slice2; segments: 1) (cost=0.00..431.00 rows=1 width=1)
                       -> Result (cost=0.00..431.00 rows=1 width=1)
                          -> Shared Scan (share slice:id 2:0) (cost=0.00..431.00 rows=1 width=1)
 Optimizer: Pivotal Optimizer (GPORCA)
(23 rows)

New Plan:
                                 QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Sequence (cost=0.00..1293.00 rows=1 width=24) (actual time=1.120..1.120 rows=0 loops=1)
  -> Shared Scan (share slice:id 0:0) (cost=0.00..431.00 rows=1 width=1) (actual time=0.708..0.708 rows=0 loops=1)
     -> Materialize (cost=0.00..431.00 rows=1 width=1) (actual time=0.706..0.707 rows=0 loops=1)
        -> Gather Motion 1:1 (slice1; segments: 1) (cost=0.00..431.00 rows=1 width=16) (actual time=0.697..0.697 rows=0 loops=1)
           -> WindowAgg (cost=0.00..431.00 rows=1 width=16) (never executed)
              Partition By: testtable.name
              -> Sort (cost=0.00..431.00 rows=1 width=10) (never executed)
                 Sort Key: testtable.name
                 Sort Method: quicksort Memory: 33kB
                 -> Seq Scan on testtable (cost=0.00..431.00 rows=1 width=10) (never executed)
  -> Hash Left Join (cost=0.00..862.00 rows=1 width=24) (actual time=0.410..0.410 rows=0 loops=1)
     Hash Cond: (“outer”.tblnm = pg_catalog.textin(unknownout(“outer”.tblnm), ‘’::void, (-1)))
     Extra Text: Hash chain length 1.0 avg, 1 max, using 1 of 65536 buckets.
     -> Result (cost=0.00..431.00 rows=1 width=8) (actual time=0.001..0.001 rows=0 loops=1)
        -> Shared Scan (share slice:id 0:0) (cost=0.00..431.00 rows=1 width=1) (actual time=0.001..0.001 rows=0 loops=1)
     -> Hash (cost=431.00..431.00 rows=1 width=16) (actual time=0.014..0.014 rows=1 loops=1)
        Buckets: 65536 Batches: 1 Memory Usage: 1kB
        -> Result (cost=0.00..431.00 rows=1 width=16) (actual time=0.006..0.006 rows=1 loops=1)
           -> Aggregate (cost=0.00..431.00 rows=1 width=8) (actual time=0.004..0.004 rows=1 loops=1)
              -> Shared Scan (share slice:id 0:0) (cost=0.00..431.00 rows=1 width=1) (actual time=0.002..0.002 rows=0 loops=1)
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution time: 1.800 ms

Co-authored-by: Hari krishna Maddileti <[email protected]>
Commit 2d49b616fe updated memo group reset to include reset of a group's
duplicate. Previously, group reset would recursively traverse only the
children. However, by also traversing duplicates it became possible to
form cyclic reset paths in the memo (e.g. a group's child is a duplicate
of parent group). This can lead to a infinite reset loop.

Admittedly, this patch should only be a temporary solution. Intent of
the function FResetStats() is to only reset if logical operators were
added to any group reachable from reset group. In order to do that we
must first search the children before resetting ourselves. Ultimately,
we need a way to properly detect cycles.

Co-authored-by: Jingyu Wang <[email protected]>
Issue: Community reports regression that post-fc662ea plans had redundant redistribution motion in inner joins

Root cause: Blanket change of Nulls Colocation to false in computing a matching hashed distribution spec

Solution: In matching a hashed distribution spec in inner join operations, set Nulls Colocation to true; and in matching a hashed distribution spec in outer join operations, set Nulls Colocation to false. This reflects the Nulls Colocation property required for / delivered by the outer relation in hash join operations.

Implementation:
[CPhysicalHashJoin] -- Return Nulls Colocation in spec matching for inner joins, and Non Nulls Colocation for outer joins.
[CPhysicalLeftOuterHashJoin] -- Add TODO comment. Left outer join should be able to return a combined hash spec even when only one relation is hash distributed.
[minidump] -- Space size change only. Added user's example to verify inner join matches the outer relation's Nulls Colocation.

Co-authored-by: Jingyu Wang <[email protected]>
Postgres commit 578b229 (from Postgres 12 merge) removed WITH OIDS
support. That eliminated the "specialness" of oid columns which
previously were not stored as a normal column, but as part of the tuple
header. Now, in pg_class for example, it is a normal column.

ORCA had a framework in place to handle this "specialness".  During the
Postgres 12 merge the framework was kept in place and hardcoded to false
with a FIXME to remove later. This commit does that.
In Orca, we copy group statistics to avoid costly stats re-deriving.
However, we unintentionally didn't copy the relpages, relallvisible, and
rebinds fields.  These fields are used in costing, and in some cases the
wrong rebind value caused us to improperly cost a NLJ much lower and
Orca selected a non-optimal plan.
AssertOp is used by ORCA for run-time assertion checking. For example,
it guarantees that the following query will not violate implicit
cardinality constraints (i.e. foo cannot contain more than 1 row):

  ```
  CREATE TABLE foo(a int);
  CREATE TABLE bar(b int);

  SELECT * FROM foo WHERE (SELECT a FROM foo) IN (SELECT b FROM bar);
  ```

PLANNER handles that check in the executor subplan node where it can
determine if the subquery is used in an expression sublink where it
should only return 1 row. However, this is not sufficient for ORCA which
may generate de-correlated plan that contains a join node instead of a
subplan node.

Postgres 12 merge commit 2e653c6e54b disabled this feature in ORCA so
that implementation may be fixed at a later date. This commit does that.
* Revert "Derive Combined Hashed Spec For Outer Joins - Patch (#13899)"

This reverts commit 512561fe9920df5be844a60926e612562d782d4a.

* Revert "Derive Combined Hashed Spec For Outer Joins (#13714)"

This reverts commit fc662eadf9d4fcbeecdb32d661deccba72c86f1a.

* Rerun mdp, regress/with_clause

Co-authored-by: Jingyu Wang <[email protected]>
It was observed that, in response to a DELETE based DML query wherein we
know that the data resides on a particular segment, the commands should
be issued by Orca to that segment only. However, it was found that the
commands were issued to all the segment.  This behavior was
cross-checked with the Legacy Planner, it was found that Legacy Planner
was sending the query to single segment only.

To correct this behavior changes were done in the following files:

1. FILE NAME : CTranslatorExprToDXL.cpp

In function CtranslatorExprToDXL::PdxlnDML , object for
CDXLDirectDispatchInfo is created by calling GetDXLDirectDispatchInfo
function.

Before the update, in the GetDXLDirectDispatchInfo function, null
pointer was returned for any DML command other than INSERT.

Now, this condition has been changed to include DELETE command also,
thus now for INSERT and DELETE, nullpointer shall not be returned and
object for CDXLDirectDispatchInfo class shall be created.

2. FILE NAME :: CTranslatorDXLToPlStmt.cpp

The above created object is checked while creating the planned statement
for the query to enable the direct dispatch flag in the planned
statment.

Earlier this was allowed for INSERT command only, changes were done to
enable this logic for DELETE command also.
After partitioning rework absorbed in Postgres 12 merge, this GUC became
dead code as demonstrated in commit baad023.
This was already addressed in commit 49049ee67504, but missed this FIXME.
Commit 3ea20ad added function PdxlnBitmapIndexProbeForChildPart()
and commit 2826c2098a50 removed usage of it. Rather than refactor away
the "specialness" just delete it.
dgkimura and others added 12 commits December 30, 2024 17:30
A function that gets system column name, type, and length from attno
already exists in Postgres. Use that function and remove ORCA version.
Prior to this commit, the preprocessing for supported ordered-set agg
would split the ordered-set agg into a NLJ between total_count and the
CTEConsumer for the input table with a gp_percentile_* GbAgg on top.
For skewed dataset, this wouldn't be as performant as the JOIN would
return all the rows. This commit updates the code to split the
ordered-set agg into a NLJ between CTEConsumer for deduplicated data and
total_count on that CTE with a gp_percentile_* GbAgg on top. Since, we
deduplicate the data, we also pass along the count of each distinct row
as peer_count to gp_percentile_* agg.
Below, is the preporcessed query:
Input query:
```
+--CLogicalGbAgg( Global )
   |--CLogicalGet "t" ("t")
   +--CScalarProjectList
      +--CScalarProjectElement "percentile_cont"
         +--CScalarAggFunc (percentile_cont , Distinct: false , Aggregate Stage: Global)
            |--CScalarValuesList
            |  +--CScalarIdent "a" (0)
            |--CScalarValuesList
            |  +--CScalarConst (0.250)
            |--CScalarValuesList
            |  +--CScalarSortGroupClause(tleSortGroupRef:0,eqop:96,sortop:97,nulls_first:false,hashable:true)
            +--CScalarValuesList
```

Output preprocessed query:
```
Common Table Expressions:
+--CLogicalCTEProducer (0)
   +--CLogicalGbAgg( Global ) Grp Cols: ["a" (10)]
      |--CLogicalGet "t" ("t")
      +--CScalarProjectList
         +--CScalarProjectElement "ColRef_0009" (11)
            +--CScalarAggFunc (count , Distinct: false , Aggregate Stage: Global)
               |--CScalarValuesList
               |  +--CScalarIdent "a" (10)
               |--CScalarValuesList
               |--CScalarValuesList
               +--CScalarValuesList

Algebrized preprocessed query:
+--CLogicalCTEAnchor (0)
   +--CLogicalGbAgg( Global )
      |--CLogicalLimit ( (97,1.0), "a" (0), NULLsLast )  global
      |  |--CLogicalNAryJoin
      |  |  |--CLogicalCTEConsumer (0), Columns: ["a" (0), "ColRef_0009" (9)]
      |  |  |--CLogicalProject
      |  |  |  |--CLogicalGbAgg( Global )
      |  |  |  |  |--CLogicalCTEConsumer (0), Columns: ["a" (19), "ColRef_0009" (20)]
      |  |  |  |  +--CScalarProjectList
      |  |  |  |     +--CScalarProjectElement "ColRef_0035" (35)
      |  |  |  |        +--CScalarAggFunc (sum , Distinct: false , Aggregate Stage: Global)
      |  |  |  |           |--CScalarValuesList
      |  |  |  |           |  +--CScalarIdent "ColRef_0009" (20)
      |  |  |  |           |--CScalarValuesList
      |  |  |  |           |--CScalarValuesList
      |  |  |  |           +--CScalarValuesList
      |  |  |  +--CScalarProjectList
      |  |  |     +--CScalarProjectElement "ColRef_0036" (36)
      |  |  |        +--CScalarFunc (int8)
      |  |  |           +--CScalarIdent "ColRef_0035" (35)
      |  |  +--CScalarConst (1)
      |  |--CScalarConst (0)
      |  +--CScalarConst (null)
      +--CScalarProjectList
         +--CScalarProjectElement "percentile_disc" (8)
            +--CScalarAggFunc (percentile_disc , Distinct: false , Aggregate Stage: Global)
               |--CScalarValuesList
               |  |--CScalarIdent "a" (0)
               |  |--CScalarConst (0.250)
               |  |--CScalarIdent "ColRef_0036" (36)
               |  +--CScalarIdent "ColRef_0009" (9)
               |--CScalarValuesList
               |--CScalarValuesList
               +--CScalarValuesList
```

This also inclodes, updating C function for gp_percentile

Since we pass the peer_count value along with the total_count, we
need to consider it while calculating percentile values.
This commit also updates the transition functions to non-strict as for
strict transition function `advance_aggregates()` calls
`ExecInterpExpr()` which initializes the transition value for the first
row in the group as part of `EEOP_AGG_INIT_TRANS` step. This results in
calling the transition function from second row with first row passed in
as the previos state value. This worked fine previously as we read all
rows and the peer_count would always be 1, but now since we need to read
the information for peer_count for each row, initializing the first row
doesn't work.
Since, the transition function isn't strict anymore, we
explicitly handle for NULL inputs.
… Orca (#13873)

Previously, Orca disallowed all aggregate functions from being executed
on replicated slices. This meant that the results were broadcasted or
gathered on a single segment to ensure consistency and correct results.
This is necessary because some functions such as array_agg and custom
user-created functions are sensitive to the order of data. This can
cause wrong results in some cases.

However, many functions, especially commonly used ones such as sum, avg,
count, min, and max, are not sensitive to the order of data and can be
safely executed. We now make an exception for these common cases,
currently the above agg functions on ints and count(*).

See https://github.com/greenplum-db/gpdb/pull/10978 for previous
discussion.
ORCA commit f8990fb enables more datatypes in constraint
evaluation. However, it also exposed an issue in preprocessor step
PexprInferPredicates() which can cause ORCA to produce a plan with
duplicate casted predicates.

This commit fixes the issue by deduplicating cast equality predicates.

Example: Date-TimeStamp-HashJoin.mdp
Prior to this commit, partition propogation spec stored a partition info
list in an array that needed to stay sorted. It needed to stay sorted in
order to compare for equality against another partition propogation spec
where order of the stored partition info list is relevant.

By using an array implementation we pay a cost of O(N log N X N) to
insert data that is sorted on each insert and at best O(log N)) to find
using binary search. By contrast a hash map implementation costs O(N) to
build and O(1) to find.

This commit store partition info list in a hash map.
* Remove FIXME for signature change

As there are other functions that also take in Node as an input, its
better to keep it as is instead of changing the signature and breaking
something not captured as ICW.

* Remove dead code

This commit was to address the FIXME that removes CountLeafPartTables().
On looking at the call hierarchy CountLeafPartTables() was called by
RetrieveNumChildPartitions() -> GenerateStatsForSystemCols() ->
RetrieveColStats() for attno < 0 (system columns). Since we do not
extract/use the stats on system columns, the entire call stack is not
used code. This commit removes this part of the code altogether.

It also previously called RelPartIsNone() call to which was removed,
thus removing this function too.

This commit also removes FIXME for collation.
Postgres commit fc22b66 implemented SQL-standard feature for
generated columns. This was turned off in ORCA during the merge.

Afer this commit the following SQL works as expected:
    ```sql
    CREATE TABLE t_gencol(a int, b int GENERATED ALWAYS AS (a * 2) stored);
    EXPLAIN ANALYZE INSERT INTO t_gencol (a) VALUES (1), (2);
    SELECT * FROM t_gencol;

     a | b
    ---+---
     1 | 2
     2 | 4
    (2 rows)
    ```
There were a lot of asserts on NULL != target_list in the translator,
but most of them were unnecessary. Fix ORCA to handle empty target list.

- Add trace fallback to union testcase
- Fix up CXformDifference2LeftAntiSemiJoin to handle case of empty columns

Following SQL works
```
EXPLAIN (COSTS OFF) SELECT UNION SELECT;
```
Issue: Outer join operations enforce unnecessary data redistribution, causing ORCA plan execution to be much longer than planner execution.

Root cause: Unlike inner join operators, outer join operators only derive hashed distribution spec from one out of the two relations. Children nodes not delivering all the distribution properties led to parent nodes enforcing unnecessary data redistribution.

Solution: To mimic inner join distribution spec derivation, derive combined hashed spec for outer join operations from both relations. Eg. 10-relation outer join delivers a combined hashed spec with 10 (including itself) equivalent specs.

Implementation:

1. [CPhysicalLeftOuterHashJoin] -- Override PdsDerive (distribution spec derivation) in CPhysicalJoin. Add a case where both outer and inner relations are hash distributed. Return combined distribution spec if both outer and inner relations are hash distributed. Since NULLs are only added to unmatched rows, set Nullscolocated to false for all equivalent distribution specs of the inner relation.
2. [CPhysicalHashJoin] -- In matching a hashed distribution spec in inner join operations, set Nulls Colocation to true. In matching a hashed distribution spec in outer join operations, set Nulls Colocation to false only if the join condition isn't null-aware. This reflects the Nulls Colocation property required for / delivered by the outer relation in hash join operations.
3. [CDistributionSpecHashed] -- (1) Rewrite Combine function for hash distribution spec with linked list concatenation. (2) Rewrite Copy function with recursion to ensure deep copy. (3) Add Copy function to allow fNullsColocated configuration. (4) Enforce nulls colocation for hash redistribution. This is necessary when the non-null hash distribution request is not met. (5) Make ComputeEquivHashExprs recursive to compute hash expression for all equivalent hashed specs. (6) Expose FMatchHashedDistribution to public.
4. [CPhysicalFullMergeJoin] -- Fix PdsDerive (distribution spec derivation). In full joins, both tables are outer tables. The join output is hash distributed by non-NULL join keys.
5. [CPhysical*Join] -- Add is_null_aware member to all the classes using the AddHashOrMergeJoinAlternative template. If the join is null-aware, nulls colocation has to be set true in deriving/requesting hash distribution specs. If the join isn't null-aware, nulls colocation can be set false.
6. [CXformUtils] -- Check if the join condition is composed of equality predicates only. The output is passed to AddHashOrMergeJoinAlternative for determination of join condition null-awareness.
7. [CDistributionSpecTest] -- Add function test for hash spec combination and copy. Replace GPOS_ASSERT with GPOS_RTL_ASSERT
8. [regress] -- Test hashed distribution spec derivation and motion enforcement in outer join with INDF join condition
9. [minidump] -- MDP plan shape update: LOJNonNullRejectingPredicates, LOJReorderWithSimplePredicate, Remove -Distinct-From-Subquery. The rest are SpaceSize and scan order changes. Add LeftJoinNullsNotColocated. Added user examples to verify inner join matches the outer relation's Nulls Colocation.

Co-authored-by: Jingyu Wang <[email protected]>
GPDB have the same issue. The current issue was introduced by
"Update ordered-set agg preprocess step for skew"(GPDP commit 5280297)

This is because a not null DATUM(0) is returned during the call
function to gp_percentile_disc_transition.
@jiaqizho jiaqizho force-pushed the cherry-pick-orca-in-path-order-2 branch from 424acc8 to 40aa09e Compare December 30, 2024 09:31
- Fixed GPDB incorrect results in bfv_join.sql
- Fixed some plan diff
@jiaqizho jiaqizho force-pushed the cherry-pick-orca-in-path-order-2 branch from 40aa09e to acfc6c5 Compare December 30, 2024 10:24
@jiaqizho jiaqizho requested a review from avamingli December 31, 2024 01:43
@my-ship-it my-ship-it merged commit 7f919d8 into apache:main Dec 31, 2024
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick cherry-pick upstream commts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants