-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: 2061 create external table ddl table partition cols #2099
feat: 2061 create external table ddl table partition cols #2099
Conversation
datafusion/src/execution/options.rs
Outdated
listing::ListingOptions, | ||
}; | ||
|
||
/// CSV file read option | ||
#[derive(Copy, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arvo and Json read options do not use Copy trait. Removing it allows to use Vec<String>
datafusion/src/execution/options.rs
Outdated
@@ -61,7 +67,8 @@ impl<'a> CsvReadOptions<'a> { | |||
schema: None, | |||
schema_infer_max_records: 1000, | |||
delimiter: b',', | |||
file_extension: ".csv", | |||
file_extension: DEFAULT_CSV_EXTENSION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with same constant
datafusion/src/execution/options.rs
Outdated
|
||
ListingOptions { | ||
format: Arc::new(file_format), | ||
collect_stat: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this true
align with the old code in datafusion/src/logical_plan/builder.rs
datafusion/src/sql/parser.rs
Outdated
@@ -192,6 +194,35 @@ impl<'a> DFParser<'a> { | |||
} | |||
} | |||
|
|||
fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to parse_columns
. I didn't reuse parse_columns
because partitions only have identifier/name, there is no type or other options like column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry missed this comment yesterday 👍
28f6a8b
to
9492d17
Compare
datafusion/src/execution/context.rs
Outdated
table_partition_cols: vec![], | ||
}; | ||
let listing_options = options | ||
.parquet_pruning(parquet_pruning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for following test use
disable parquet pruning has one use case from #723
…Table-DDL-table_partition_cols # Conflicts: # ballista/rust/client/src/context.rs # datafusion/src/execution/context.rs
will merge master or rebase, because of the whole folder re-org in #2081 |
…Table-DDL-table_partition_cols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -43,8 +47,10 @@ pub struct CsvReadOptions<'a> { | |||
/// Max number of rows to read from CSV files for schema inference if needed. Defaults to 1000. | |||
pub schema_infer_max_records: usize, | |||
/// File extension; only files with this extension are selected for data input. | |||
/// Defaults to ".csv". | |||
/// Defaults to DEFAULT_CSV_EXTENSION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
/// Helper to convert these user facing options to `ListingTable` options | ||
pub fn to_listing_options(&self, target_partitions: usize) -> ListingOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
projection: Option<Vec<usize>>, | ||
target_partitions: usize, | ||
table_name: impl Into<String>, | ||
) -> Result<Self> { | ||
// TODO remove hard coded enable_pruning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
return Ok(partitions); | ||
} | ||
|
||
loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this code to parse a comma separated list is duplicated in parse_columns
below, perhaps we could refactor into a common function to reduce the replication -- not needed for this PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it maybe possible. Or maybe we can expose this as public method in sqlparser
crate?
(agree not needed for this PR)
This is a copy of the equivalent implementation in sqlparser
@@ -55,6 +55,21 @@ WITH HEADER ROW | |||
LOCATION '/path/to/aggregate_test_100.csv'; | |||
``` | |||
|
|||
If data sources are already partitioned in Hive style, `PARTITIONED BY` can be used for partition pruning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
b97bf56
to
e8993eb
Compare
Thanks again @jychen7 |
Which issue does this PR close?
Closes #2061
Rationale for this change
support partition pruning using CreateExternalTable DDL
What changes are included in this PR?
read_parquet
andregister_parquet
It maybe easier to review by commits, since first commit adds the functionality and other commits just modify a few dependent file/tests that use
read_parquet
andregister_parquet
with default ParquetReadOptionsAre there any user-facing changes?
No change if user is using SQL.
However, there is change in
context.read_parquet
andcontext.register_parquet
params, receiving a ParquetReadOptions fortable_partition_cols
. It is more align with other file format, since Csv, Avro, Json already have their own ReadOptionsManual Test in datafusion-cli
create tmp sample file
run in datafusion-cli