Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support partition pruning #266

Open
jychen7 opened this issue Mar 4, 2023 · 4 comments
Open

Support partition pruning #266

jychen7 opened this issue Mar 4, 2023 · 4 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@jychen7
Copy link
Collaborator

jychen7 commented Mar 4, 2023

Background

assume we have

table/year=2022/month=03/day=20/log.parquet
table/year=2022/month=03/day=21/log.parquet

consider the query

select count(1) from table where year = '2022' and month = '03' and day = '20'

Actual

The above query will scan all parquet files, instead of only one (necessary)

currently, roapi supports reading all parquet files in a directory

tables:
  - name: "blogs"
    uri: "table/"
    option:
      format: "parquet"
      use_memory_table: false
    schema:
       # columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc

Expect

The above query will scan only one parquet file

however, since partition column is not in parquet schema, one idea is to improve roapi config to

tables:
  - name: "blogs"
    uri: "table/"
    option:
      format: "parquet"
      use_memory_table: false
    schema:
      # columns: [] # can ignore if table source support schema infer, e.g. csv, parquet, etc
      partitions:
        - name: "year"
           data_type: "Utf8"
        - name: "month"
           data_type: "Utf8"
        - name: "day"
           data_type: "Utf8"

Reference

  1. CreateExternalTable DDL supports table_partition_cols apache/datafusion#2061 (2022, Datafusion has file re-structure at late 2022, just for reference)
  2. #[derive(Deserialize, Clone, Debug, Eq, PartialEq)]
    #[serde(deny_unknown_fields)]
    pub struct TableSchema {
    pub columns: Vec<TableColumn>,
    }
  3. https://github.com/apache/arrow-datafusion/blob/e9852074bacd8c891d84eba38b3417aa16a2d18c/datafusion/core/src/datasource/listing/table.rs#L318-L324
@jychen7 jychen7 added good first issue Good for newcomers help wanted Extra attention is needed labels Mar 4, 2023
@houqp
Copy link
Member

houqp commented Mar 6, 2023

Definitely a good feature to add 👍

@chitralverma
Copy link
Contributor

@jychen7 shouldn't such partition directories be inferred automatically as Spark does instead of manually supplying them?

@jychen7
Copy link
Collaborator Author

jychen7 commented Jun 5, 2023

@chitralverma Yes, it is a good idea to support auto-detect partitions. On the other hand, it may be also reasonable to declare partitions manually for non-hive style partitions. E.g.

table/2022/03/20/log.parquet
table/2022/03/21/log.parquet

@houqp
Copy link
Member

houqp commented Jun 5, 2023

We also have some tables that are created outside of spark with non-hive style partitions, so being able to provide a custom partition scheme would be very useful to us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants