Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: 2061 create external table ddl table partition cols #2099

Merged

Conversation

jychen7
Copy link
Contributor

@jychen7 jychen7 commented Mar 27, 2022

Which issue does this PR close?

Closes #2061

Rationale for this change

support partition pruning using CreateExternalTable DDL

What changes are included in this PR?

  • datafusion CreateExternalTable DDL
  • datafusion and ballista context about read_parquet and register_parquet

It maybe easier to review by commits, since first commit adds the functionality and other commits just modify a few dependent file/tests that use read_parquet and register_parquet with default ParquetReadOptions

Are there any user-facing changes?

No change if user is using SQL.
However, there is change in context.read_parquet and context.register_parquet params, receiving a ParquetReadOptions for table_partition_cols. It is more align with other file format, since Csv, Avro, Json already have their own ReadOptions

Manual Test in datafusion-cli

create tmp sample file

echo "1,2" > tmp/year=2022/data.csv
echo "3,4" > tmp/year=2021/data.csv

run in datafusion-cli

❯ CREATE EXTERNAL TABLE t1 (a INT, b INT) STORED AS CSV LOCATION 'tmp';
0 rows in set. Query took 0.002 seconds.
❯ select * from t1;
+---+---+
| a | b |
+---+---+
| 3 | 4 |
| 1 | 2 |
+---+---+
2 rows in set. Query took 0.012 seconds.
❯ CREATE EXTERNAL TABLE t2 (a INT, b INT) STORED AS CSV PARTITIONED BY (year) LOCATION 'tmp';
0 rows in set. Query took 0.001 seconds.
❯ select * from t2;
+---+---+------+
| a | b | year |
+---+---+------+
| 1 | 2 | 2022 |
| 3 | 4 | 2021 |
+---+---+------+
2 rows in set. Query took 0.011 seconds.
❯ select * from t2 where year = '2022';
+---+---+------+
| a | b | year |
+---+---+------+
| 1 | 2 | 2022 |
+---+---+------+
1 row in set. Query took 0.013 seconds.

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate sql SQL Planner labels Mar 27, 2022
listing::ListingOptions,
};

/// CSV file read option
#[derive(Copy, Clone)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arvo and Json read options do not use Copy trait. Removing it allows to use Vec<String>

@@ -61,7 +67,8 @@ impl<'a> CsvReadOptions<'a> {
schema: None,
schema_infer_max_records: 1000,
delimiter: b',',
file_extension: ".csv",
file_extension: DEFAULT_CSV_EXTENSION,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with same constant


ListingOptions {
format: Arc::new(file_format),
collect_stat: true,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -192,6 +194,35 @@ impl<'a> DFParser<'a> {
}
}

fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to parse_columns. I didn't reuse parse_columns because partitions only have identifier/name, there is no type or other options like column

https://github.com/apache/arrow-datafusion/blob/f50e8000ee4de8432d7b59b63c18a8e0171a072b/datafusion/src/sql/parser.rs#L227-L229

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry missed this comment yesterday 👍

@jychen7 jychen7 force-pushed the 2061-CreateExternalTable-DDL-table_partition_cols branch from 28f6a8b to 9492d17 Compare March 27, 2022 15:31
@houqp houqp added the api change Changes the API exposed to users of the crate label Mar 27, 2022
table_partition_cols: vec![],
};
let listing_options = options
.parquet_pruning(parquet_pruning)
Copy link
Contributor Author

@jychen7 jychen7 Mar 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jychen7 jychen7 marked this pull request as ready for review March 27, 2022 22:57
@jychen7 jychen7 changed the title 2061 create external table ddl table partition cols feat: 2061 create external table ddl table partition cols Mar 27, 2022
…Table-DDL-table_partition_cols

# Conflicts:
#	ballista/rust/client/src/context.rs
#	datafusion/src/execution/context.rs
@jychen7
Copy link
Contributor Author

jychen7 commented Mar 30, 2022

will merge master or rebase, because of the whole folder re-org in #2081

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jychen7 -- Sorry for the delay reviewing this PR. This PR looks really good 🏅

I think the addition of read options to the parquet APIs is very nice

cc @rdettai and @tustvold

Hopefully fixing up the merge conflicts won't be too bad

datafusion/core/src/execution/context.rs Show resolved Hide resolved
@@ -43,8 +47,10 @@ pub struct CsvReadOptions<'a> {
/// Max number of rows to read from CSV files for schema inference if needed. Defaults to 1000.
pub schema_infer_max_records: usize,
/// File extension; only files with this extension are selected for data input.
/// Defaults to ".csv".
/// Defaults to DEFAULT_CSV_EXTENSION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

datafusion/core/src/execution/options.rs Outdated Show resolved Hide resolved
}

/// Helper to convert these user facing options to `ListingTable` options
pub fn to_listing_options(&self, target_partitions: usize) -> ListingOptions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

projection: Option<Vec<usize>>,
target_partitions: usize,
table_name: impl Into<String>,
) -> Result<Self> {
// TODO remove hard coded enable_pruning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

return Ok(partitions);
}

loop {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this code to parse a comma separated list is duplicated in parse_columns below, perhaps we could refactor into a common function to reduce the replication -- not needed for this PR though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2099 (comment)

yes, it maybe possible. Or maybe we can expose this as public method in sqlparser crate?
(agree not needed for this PR)

This is a copy of the equivalent implementation in sqlparser

datafusion/core/src/sql/parser.rs Outdated Show resolved Hide resolved
@@ -55,6 +55,21 @@ WITH HEADER ROW
LOCATION '/path/to/aggregate_test_100.csv';
```

If data sources are already partitioned in Hive style, `PARTITIONED BY` can be used for partition pruning.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@jychen7
Copy link
Contributor Author

jychen7 commented Apr 2, 2022

Hopefully fixing up the merge conflicts won't be too bad

thanks @alamb , if you mean merge master because of #2081, it is actually super easy (than I thought), since #2081 is just rename and this PR didn't introduce any new file

@jychen7 jychen7 force-pushed the 2061-CreateExternalTable-DDL-table_partition_cols branch from b97bf56 to e8993eb Compare April 2, 2022 14:06
@alamb alamb merged commit d54ba4e into apache:master Apr 3, 2022
@alamb
Copy link
Contributor

alamb commented Apr 3, 2022

Thanks again @jychen7

@jychen7 jychen7 deleted the 2061-CreateExternalTable-DDL-table_partition_cols branch April 3, 2022 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate datafusion Changes in the datafusion crate sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CreateExternalTable DDL supports table_partition_cols
3 participants