Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic nested column pruning on row/array/map types #1202

Open
oneonestar opened this issue Jul 29, 2019 · 0 comments
Open

Generic nested column pruning on row/array/map types #1202

oneonestar opened this issue Jul 29, 2019 · 0 comments
Labels
enhancement New feature or request performance

Comments

@oneonestar
Copy link
Member

Summary

Nested column pruning should work for row inside arbitrary level of nested row/array/map. The current ongoing efforts only focused on the row type only. I would like to bring out this issue and start the discussion.

Example schema:

CREATE TABLE test(
   a row(x int, y int, z int),
   b array(row(x int, y int, z int)),
   c array(array(row(x int, y int, z int))),
   d map(int, row(x int, y int, z int)),
   e map(row(x int, y int, z int), int)
);

For the following discussion, I assume the user wants to select x, y out of x, y, z. User may select only part of the columns using the following SQL:

SELECT a.x, a.y FROM test;
SELECT r.x, r.y FROM test CROSS JOIN UNNEST(b) as r;
SELECT v.x, v.y FROM test CROSS JOIN UNNEST(d) as m(k,v);

presto:test> EXPLAIN SELECT a.x, a.y FROM test;
                                                     Query Plan
--------------------------------------------------------------------------------------------------------------------
 Output[x, y]
 │   Layout: [expr:integer, expr_0:integer]
 │   x := expr
 │   y := expr_0
 └─ RemoteExchange[GATHER]
    │   Layout: [expr:integer, expr_0:integer]
    └─ ScanProject[table = memory:2]
           Layout: [expr:integer, expr_0:integer]
           expr := "a".x
           expr_0 := "a".y
           a := 0

presto:test> EXPLAIN SELECT r.x, r.y FROM test CROSS JOIN UNNEST(b) as r;
                                    Query Plan
-----------------------------------------------------------------------------------
 Output[x, y]
 │   Layout: [x:integer, y:integer]
 └─ RemoteExchange[GATHER]
    │   Layout: [x:integer, y:integer]
    └─ Project[]
       │   Layout: [x:integer, y:integer]
       └─ Unnest[replicate=, unnest=b:array(row(x integer, y integer, z integer))]
          │   Layout: [x:integer, y:integer, z:integer]
          └─ TableScan[memory:3]
                 Layout: [b:array(row(x integer, y integer, z integer))]
                 b := 1

presto:test> EXPLAIN SELECT v.x, v.y FROM test CROSS JOIN UNNEST(d) as m(k,v);
                                        Query Plan
------------------------------------------------------------------------------------------
 Output[x, y]
 │   Layout: [expr:integer, expr_2:integer]
 │   x := expr
 │   y := expr_2
 └─ RemoteExchange[GATHER]
    │   Layout: [expr:integer, expr_2:integer]
    └─ Project[]
       │   Layout: [expr:integer, expr_2:integer]
       │   expr := "field_0".x
       │   expr_2 := "field_0".y
       └─ Unnest[replicate=, unnest=d:map(integer, row(x integer, y integer, z integer))]
          │   Layout: [field:integer, field_0:row(x integer, y integer, z integer)]
          └─ TableScan[memory:3]
                 Layout: [d:map(integer, row(x integer, y integer, z integer))]
                 d := 3

Objective

Nested column pruning should be pushed down as far as possible. If the connector support nested column pruning, it should be pushed to the connector (eg. Hive Parquet / ORC). Otherwise, the pruning should happen right after tableScan to minimize the data processed by other following operators.

I would like to quote the inputs from @martint on Slack:
https://prestosql.slack.com/archives/CGJS76ZSR/p1560395355022100

The second example relies on being able to push down the field dereference through unnest. That's something we don't yet support, but it's somewhat orthogonal to being able to push down projections into connectors.
To be able able to push a dereference through unnest, we'd need to make a transformation like:

    - unnest :: [r row(x, y)]
        - T :: [a array(row(x, y))]

->

- unnest[z]
    - project[transform_array(a, r -> r.x)] :: [z array(row(x))]
        - T :: [a array(row(x, y))]

This works very similar to what I thought. However, it has some drawbacks. transform_array is a function call. This means it is a per row operation which is O(n). The ideal pruning operation should be block pruning operation which is O(1). The transform_array also doesn't support map type and nested pruning very well.

I'm considering a new expression for nested column pruning that can handle arbitrary nested of map/array/row.

// b array(row(x int, y int, z int)),
Output[x,y]
- unnest[b']
   - project[prune(b.*.['x','y'])] :: [b' array(row(x, y))]
      - T :: [b array(row(x, y, z))]

// c array(array(row(x int, y int, z int))),
Output[x,y]
- unnest
  - unnest[c']
   - project[prune(c.*.*.['x','y'])] :: [c' array(array(row(x, y)))]
      - T :: [c array(array(row(x int, y int, z int)))]

// d map(int, row(x int, y int, z int)),
Output[x,y]
- unnest[b']
   - project[prune(d.value.['x','y'])] :: [d' map(int, row(x int, y int, z int))]
      - T :: [d map(int, row(x int, y int, z int))]

The reasons for me to consider a new expression:

  • Nested column pruning should be a block based O(1) operation. It doesn't fit into the current existing expressions
  • Having a new expression should works well with the new ConnectorExpressionTranslator mechanism

Other questions

  • How this work with the current dereference expression pushdown which works on rows only
  • How/Where to implement the nested pruning in operator
  • Maybe in the future this can be implemented by polymorphic table functions

Others

Other related works, but none of them proposed support on array / map type:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Development

No branches or pull requests

2 participants