Removed transforming of hasharray to hashstring (#14121) #816

avamingli · 2024-12-26T04:25:55Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

As the hashing of text array is fixed in PR https://github.com/greenplum-db/gpdb/pull/4940 removing dependency to transform hasharray to hash string which is being done in function transformTextArrayCols()

In Greenplum, different from upstream, will try to pull up the correlated EXISTS sublink which has aggregate, just like: ``` postgres=# create table t(a int, b int); postgres=# create table s (a int, b int); postgres=# explain (costs off) select * from t where exists (select sum(s.a) from s where s.a = t.a group by s.a); QUERY PLAN ------------------------------------------ Gather Motion 3:1 (slice1; segments: 3) -> Hash Join Hash Cond: (t.a = s.a) -> Seq Scan on t -> Hash -> HashAggregate Group Key: s.a -> Seq Scan on s Optimizer: Postgres query optimizer (9 rows) ``` So Greenplum changed the behavior of function `convert_EXISTS_sublink_to_join()` and `simplify_EXISTS_query()`, so `simplify_EXISTS_query()` will simplify the EXISTS sublink which has aggregate node, it will reset `targetList` and other clauses to null, just like: ``` query->targetList = NIL; query->distinctClause = NIL; query->sortClause = NIL; query->hasDistinctOn = false; ``` But when we exec uncorrelated EXISTS sublink which also has an aggregate node, we can't reset `targetList`. Otherwise, we can't make a plan for an aggregate node that doesn't have an aggregate column, just like #11849. This patch tries to fix the above issue, and the thought is **NOT** simplify uncorrelated EXISTS sublink in some situation.

The commit 4aaf666ff removed system column stats from Orca, but forgot to update the expected output plan.

…multi-DQA expr. (#14135) This patch is to fix #14096. How does multi-DQA work: When creating path for multi-DQA (multiple distinct qualified aggregate), the planner generates the plan as follows: ``` Finalize Aggregate -> Gather Motion -> Partial Aggregate -> HashAggregate, to remove duplicates -> Redistribute Motion -> TupleSplit (according to DISTINCT expr) -> input ``` The plan is generated via the following steps: 1. The planner compares the expr in DQA with the input tuples, if the DQA expr is not in PathTarget, the TupleSplit node will append the corresponding column the the PathTarget together with a special column 'AggExprId' to the output tuple. E.g., if the input of TupleSplit is '(a: 1, b: 2, c: 3)' and the DQA expressions are 'DQA(a)', 'DQA(b)', the output of TupleSplit are '(a: 1, b: null, c: 3, AggExprId: 1)', '(a: null, b: 2, c: 3, AggExprId: 2)'. Only one of the DQA expr's column is set to non-null per tuple and the AggExprId is used to indicate which DQA expr's column is non-null in that tuple. 2. The output tuples of TupleSplit are hashed and sent to different segments via Redistribute-Motion. Because in the previous stage, only one of the DQA expr's column is set to non-null per tuple, DQA expr with same value is gathered on the same segment and we can apply HashAggregate on these tuples to remove duplicate values. 3. We apply Partial Aggregate function on tuples across segments. 4. We gather tuples to QD and apply the Finalize Aggregate function on tuples. The RCA is given below: 1. When there're type conversions in DQA exprs, the planner will wrap the expr with type conversion node. E.g., for binary-compatible datatypes, like 'VARCHAR' and 'TEXT' the expr will be wrapped with 'ReplabelType' node. 2. When there are queries like 'SELECT DQA(a::TEXT) FROM foo(a VARCHAR)', the planner compares the DQA expr with exprs in PathTarget directly without removing the 'RelabelType' node. So, the TupleSplit looks like: ``` TupleSplit: Input: a Output: a, a::TEXT ``` 3. When the planner generating the Redistribute-Motion operator, it tries to find the expr of equivalent class with the expr in targetlist as the hashkey (src/backend/cdb/cdbpullup.c:cdbpullup_findEclassInTargetList). However, in this stage, the planner will ignore the 'RelabelType' when comparing two exprs. Finally, the planner may choose incorrect column as the hash key. This patch helps resolve the issue by teaching 'TupleSplit' node to ignore 'RelabelType' node when comparing DQA expr and the one in PathTarget. So the TupleSplit node looks like: ``` TupleSplit: Input: a Output: a ``` when planning queries like: 'SELECT DQA(a::TEXT) FROM foo(a VARCHAR)'.

Background: 1c1f164 introduced the waitOnOutbound() call for non-error cases as well during TCP IC teardown, after "Interconnect error: connection closed prematurely." ERRORs were seen in Azure environments. This is essential because performing a close() does not necessitate that the peer has received any unflushed data in the kernel send buffers. Neither does it flag cases where the peer may have crashed after the sender performed the close(). Thus it is essential we have an app-defined protocol to ensure delivery - this is where the EOS message comes in. Motion senders send an EOS to the motion receiver. Then the receiver closes the socket (and sent an optional 'stop' message), which is detected by the sender in waitOnOutbound() by recv()ing 0 or the stop message. ONLY after this detection step can the sender can safely close() its socket. Implementation: We currently poll for a stop message in a motion sender teardown. However, we were performing infinite retries (as we weren't handling the return value of 0 from select()), which can cause hung queries. So, use a timeout equal to gp_interconnect_transmit_timeout, which is used throughout the interconnect to enforce an upper bound on the time it takes to transmit a packet (in this case the EOS). Once this timeout elapses, an ERROR will result. PS: The implementation is inspired by PerformRadiusTransaction().

MemtupleBinding is column aligned by 4 or 8 bytes. It can be set to 8 if we find a atrribute's attalign is 'd' (8 bytes on most machines), there is no need to check other attributes. Add some comments about how to compute null bitmap entries space. Authored-by: Mingli Zhang <[email protected]>

We can create user-defined access methods in GPDB 7, fill the storage type according to the access method name is more accurate.

pipeline failed with the log 'Table pg_am has a dependency issue on oid 152563 at content ...'. The newly created access method in psql_gp_commands regress test is not cleaned up at the end. Then the gpcheckcat checked that the newly created object in pg_am catalog doesn't have a reference in pg_depend under objid or refobjid column, it will show a dependency issue.

%s/compatable/compatible/g in file gparray.py Since initFromString is already compatible with repr() formatting, remove the TODO mark. Search `compatable` globally and change them all to `compatible`. Signed-off-by: Junwang Zhao <[email protected]>

netstat has been replaced with ss earlier but some of the variable names and commented out codes were left as it is. This commit cleans up the old mentions of netstat and changes the variable names to avoid any confusion in the future.

shrakesh and others added 14 commits December 26, 2024 15:47

Removed transforming of hasharray to hashstring (#14121)

87c2eaa

As the hashing of text array is fixed in PR https://github.com/greenplum-db/gpdb/pull/4940 removing dependency to transform hasharray to hash string which is being done in function transformTextArrayCols()

Update tidscan optimizer expected output (#14132)

e23059a

The commit 4aaf666ff removed system column stats from Orca, but forgot to update the expected output plan.

Fix typo.

383d9ff

psql: display the access method name as the storage (#14066)

519998c

We can create user-defined access methods in GPDB 7, fill the storage type according to the access method name is more accurate.

tcp-ic: Add more detail to register msg error

1049830

Change cryptography package version

a2bd6fb

Replace netstat with ss (code cleanup)

6d07ccb

netstat has been replaced with ss earlier but some of the variable names and commented out codes were left as it is. This commit cleans up the old mentions of netstat and changes the variable names to avoid any confusion in the future.

Fix cases of cherry-pick

195454d

avamingli force-pushed the cp_1225 branch from 5077068 to 195454d Compare December 26, 2024 07:48

my-ship-it added the cherry-pick cherry-pick upstream commts label Dec 26, 2024

my-ship-it approved these changes Dec 26, 2024

View reviewed changes

yjhjstz approved these changes Dec 27, 2024

View reviewed changes

avamingli merged commit 454fa31 into apache:main Dec 27, 2024
16 checks passed

avamingli deleted the cp_1225 branch December 27, 2024 11:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed transforming of hasharray to hashstring (#14121) #816

Removed transforming of hasharray to hashstring (#14121) #816

avamingli commented Dec 26, 2024

Removed transforming of hasharray to hashstring (#14121) #816

Removed transforming of hasharray to hashstring (#14121) #816

Conversation

avamingli commented Dec 26, 2024

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions