Support partition pruning #266

jychen7 · 2023-03-04T14:11:11Z

Background

assume we have

table/year=2022/month=03/day=20/log.parquet
table/year=2022/month=03/day=21/log.parquet

consider the query

select count(1) from table where year = '2022' and month = '03' and day = '20'

Actual

The above query will scan all parquet files, instead of only one (necessary)

currently, roapi supports reading all parquet files in a directory

tables:
  - name: "blogs"
    uri: "table/"
    option:
      format: "parquet"
      use_memory_table: false
    schema:
       # columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc

Expect

The above query will scan only one parquet file

however, since partition column is not in parquet schema, one idea is to improve roapi config to

tables:
  - name: "blogs"
    uri: "table/"
    option:
      format: "parquet"
      use_memory_table: false
    schema:
      # columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc
      partitions:
        - name: "year"
           data_type: "Utf8"
        - name: "month"
           data_type: "Utf8"
        - name: "day"
           data_type: "Utf8"

Reference

CreateExternalTable DDL supports table_partition_cols apache/datafusion#2061 (2022, Datafusion has file re-structure at late 2022, just for reference)

roapi/columnq/src/table/mod.rs

Lines 50 to 54 in 51e01ef

    
           #[derive(Deserialize, Clone, Debug, Eq, PartialEq)] 
        
           #[serde(deny_unknown_fields)] 
        
           pub struct TableSchema { 
        
               pub columns: Vec<TableColumn>, 
        
           }

https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/datasource/listing/table.rs#L318-L324

The text was updated successfully, but these errors were encountered:

houqp · 2023-03-06T08:30:32Z

Definitely a good feature to add 👍

chitralverma · 2023-04-24T16:24:02Z

@jychen7 shouldn't such partition directories be inferred automatically as Spark does instead of manually supplying them?

jychen7 · 2023-06-05T02:20:09Z

@chitralverma Yes, it is a good idea to support auto-detect partitions. On the other hand, it may be also reasonable to declare partitions manually for non-hive style partitions. E.g.

table/2022/03/20/log.parquet
table/2022/03/21/log.parquet

houqp · 2023-06-05T04:32:09Z

We also have some tables that are created outside of spark with non-hive style partitions, so being able to provide a custom partition scheme would be very useful to us.

jychen7 added good first issue Good for newcomers help wanted Extra attention is needed labels Mar 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support partition pruning #266

Support partition pruning #266

jychen7 commented Mar 4, 2023 •

edited

Loading

houqp commented Mar 6, 2023

chitralverma commented Apr 24, 2023

jychen7 commented Jun 5, 2023

houqp commented Jun 5, 2023

Support partition pruning #266

Support partition pruning #266

Comments

jychen7 commented Mar 4, 2023 • edited Loading

Background

Actual

Expect

Reference

houqp commented Mar 6, 2023

chitralverma commented Apr 24, 2023

jychen7 commented Jun 5, 2023

houqp commented Jun 5, 2023

jychen7 commented Mar 4, 2023 •

edited

Loading