Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CreateExternalTable DDL supports table_partition_cols #2061

Closed
jychen7 opened this issue Mar 23, 2022 · 1 comment · Fixed by #2099
Closed

CreateExternalTable DDL supports table_partition_cols #2061

jychen7 opened this issue Mar 23, 2022 · 1 comment · Fixed by #2099
Labels
enhancement New feature or request

Comments

@jychen7
Copy link
Contributor

jychen7 commented Mar 23, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do

Assume we have a data lake stores as

table/year=2022/month=03/day=20/log.parquet
table/year=2022/month=03/day=21/log.parquet

Currently, CreateExternalTable supports defining columns and location (e.g. table/)

https://github.com/apache/arrow-datafusion/blob/5936edc2a94d5fb20702a41eab2b80695961b9dc/datafusion/src/sql/parser.rs#L70-L81

a sql query of select * from table where year = '2022' and month = '03' and day = '20' seems to scan all files under table/.

Describe the solution you'd like

CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
)
STORED AS CSV
WITH HEADER ROW
PARTITIONED BY (p1, p2)
LOCATION '/path/to/';

same as existing ListingOption, PARTITIONED BY only supports String
https://github.com/apache/arrow-datafusion/blob/5936edc2a94d5fb20702a41eab2b80695961b9dc/datafusion/src/datasource/listing/table.rs#L178

Describe alternatives you've considered

Additional context

partitioned by is also used in Trino and AWS Athena
https://trino.io/episodes/5.html
https://docs.aws.amazon.com/athena/latest/ug/create-table.html

I notice that ListingOptions supports table_partition_cols and also partition pruning, but just CreateExternalTable does not accept such input and pass through
https://github.com/apache/arrow-datafusion/blob/5936edc2a94d5fb20702a41eab2b80695961b9dc/datafusion/src/datasource/listing/table.rs#L165-L186
https://github.com/apache/arrow-datafusion/blob/5936edc2a94d5fb20702a41eab2b80695961b9dc/datafusion/src/datasource/listing/table.rs#L358-L365

@jychen7 jychen7 added the enhancement New feature or request label Mar 23, 2022
@alamb
Copy link
Contributor

alamb commented Mar 23, 2022

sounds like a good enhancement to me

jychen7 added a commit to jychen7/arrow-datafusion that referenced this issue Mar 27, 2022
alamb pushed a commit that referenced this issue Apr 3, 2022
* #2061 support "PARTITIONED BY" in CreateExternalTable DDL for datafusion

* support table_partition_cols in ballista and add ParquetReadOptions

* fix a few usage of read_parquet

* fix CsvReadOption clone due to removing the copy trait

* fix CsvReadOption clone due to removing the copy trait

* fix "missing documentation for a struct field"

* fix a few usage of register_parquet

* Allow ParquetReadOption to receive parquet_pruning from execution::Context::SessionConfig

https://github.com/apache/arrow-datafusion/blob/73ea6e16f5c8f34526c01490a5ec277a68f33791/datafusion/tests/parquet_pruning.rs#L143

* fix benches import

* Apply suggestions from code review (lint)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants