You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Nested column pruning should work for row inside arbitrary level of nested row/array/map. The current ongoing efforts only focused on the row type only. I would like to bring out this issue and start the discussion.
Example schema:
CREATETABLEtest(
a row(x int, y int, z int),
b array(row(x int, y int, z int)),
c array(array(row(x int, y int, z int))),
d map(int, row(x int, y int, z int)),
e map(row(x int, y int, z int), int)
);
For the following discussion, I assume the user wants to select x, y out of x, y, z. User may select only part of the columns using the following SQL:
SELECTa.x, a.yFROM test;
SELECTr.x, r.yFROM test CROSS JOIN UNNEST(b) as r;
SELECTv.x, v.yFROM test CROSS JOIN UNNEST(d) as m(k,v);
presto:test> EXPLAIN SELECTa.x, a.yFROM test;
Query Plan
--------------------------------------------------------------------------------------------------------------------
Output[x, y]
│ Layout: [expr:integer, expr_0:integer]
│ x := expr
│ y := expr_0
└─ RemoteExchange[GATHER]
│ Layout: [expr:integer, expr_0:integer]
└─ ScanProject[table = memory:2]
Layout: [expr:integer, expr_0:integer]
expr :="a".x
expr_0 :="a".y
a :=0
presto:test> EXPLAIN SELECTr.x, r.yFROM test CROSS JOIN UNNEST(b) as r;
Query Plan
-----------------------------------------------------------------------------------
Output[x, y]
│ Layout: [x:integer, y:integer]
└─ RemoteExchange[GATHER]
│ Layout: [x:integer, y:integer]
└─ Project[]
│ Layout: [x:integer, y:integer]
└─ Unnest[replicate=, unnest=b:array(row(x integer, y integer, z integer))]
│ Layout: [x:integer, y:integer, z:integer]
└─ TableScan[memory:3]
Layout: [b:array(row(x integer, y integer, z integer))]
b :=1
presto:test> EXPLAIN SELECTv.x, v.yFROM test CROSS JOIN UNNEST(d) as m(k,v);
Query Plan
------------------------------------------------------------------------------------------
Output[x, y]
│ Layout: [expr:integer, expr_2:integer]
│ x := expr
│ y := expr_2
└─ RemoteExchange[GATHER]
│ Layout: [expr:integer, expr_2:integer]
└─ Project[]
│ Layout: [expr:integer, expr_2:integer]
│ expr :="field_0".x
│ expr_2 :="field_0".y
└─ Unnest[replicate=, unnest=d:map(integer, row(x integer, y integer, z integer))]
│ Layout: [field:integer, field_0:row(x integer, y integer, z integer)]
└─ TableScan[memory:3]
Layout: [d:map(integer, row(x integer, y integer, z integer))]
d :=3
Objective
Nested column pruning should be pushed down as far as possible. If the connector support nested column pruning, it should be pushed to the connector (eg. Hive Parquet / ORC). Otherwise, the pruning should happen right after tableScan to minimize the data processed by other following operators.
The second example relies on being able to push down the field dereference through unnest. That's something we don't yet support, but it's somewhat orthogonal to being able to push down projections into connectors.
To be able able to push a dereference through unnest, we'd need to make a transformation like:
- unnest[z]
- project[transform_array(a, r -> r.x)] :: [z array(row(x))]
- T :: [a array(row(x, y))]
This works very similar to what I thought. However, it has some drawbacks. transform_array is a function call. This means it is a per row operation which is O(n). The ideal pruning operation should be block pruning operation which is O(1). The transform_array also doesn't support map type and nested pruning very well.
I'm considering a new expression for nested column pruning that can handle arbitrary nested of map/array/row.
// b array(row(x int, y int, z int)),
Output[x,y]
- unnest[b']
- project[prune(b.*.['x','y'])] :: [b' array(row(x, y))]
- T :: [b array(row(x, y, z))]
// c array(array(row(x int, y int, z int))),
Output[x,y]
- unnest
- unnest[c']
- project[prune(c.*.*.['x','y'])] :: [c' array(array(row(x, y)))]
- T :: [c array(array(row(x int, y int, z int)))]
// d map(int, row(x int, y int, z int)),
Output[x,y]
- unnest[b']
- project[prune(d.value.['x','y'])] :: [d' map(int, row(x int, y int, z int))]
- T :: [d map(int, row(x int, y int, z int))]
The reasons for me to consider a new expression:
Nested column pruning should be a block based O(1) operation. It doesn't fit into the current existing expressions
Having a new expression should works well with the new ConnectorExpressionTranslator mechanism
Other questions
How this work with the current dereference expression pushdown which works on rows only
How/Where to implement the nested pruning in operator
Maybe in the future this can be implemented by polymorphic table functions
Others
Other related works, but none of them proposed support on array / map type:
Summary
Nested column pruning should work for
row
inside arbitrary level of nestedrow/array/map
. The current ongoing efforts only focused on the row type only. I would like to bring out this issue and start the discussion.Example schema:
For the following discussion, I assume the user wants to select
x, y
out ofx, y, z
. User may select only part of the columns using the following SQL:Objective
Nested column pruning should be pushed down as far as possible. If the connector support nested column pruning, it should be pushed to the connector (eg. Hive Parquet / ORC). Otherwise, the pruning should happen right after tableScan to minimize the data processed by other following operators.
I would like to quote the inputs from @martint on Slack:
https://prestosql.slack.com/archives/CGJS76ZSR/p1560395355022100
This works very similar to what I thought. However, it has some drawbacks.
transform_array
is a function call. This means it is a per row operation which is O(n). The ideal pruning operation should be block pruning operation which is O(1). Thetransform_array
also doesn't support map type and nested pruning very well.I'm considering a new expression for nested column pruning that can handle arbitrary nested of map/array/row.
The reasons for me to consider a new expression:
ConnectorExpressionTranslator
mechanismOther questions
Others
Other related works, but none of them proposed support on array / map type:
The text was updated successfully, but these errors were encountered: