File partitioning for ListingTable #1141

rdettai · 2021-10-18T12:14:27Z

Which issue does this PR close?

Closes #1139.

Rationale for this change

Adding capability to parse file partitioning and prune unnecessary files

What changes are included in this PR?

add partition_values: Vec<ScalarValue> to PartitionedFile
implement pruned_partition_list
add extra column for the partition dimensions to execute() record batch result in file format execution plans
- avro/csv/json
- parquet
add the proper TableProviderFilterPushDown value to supports_filter_pushdown() to avoid re-evaluation of the partition pruning [1]]

What changes are planned in further issues/PRs?

Are there any user-facing changes?

Rename the ListingOptions.partitions to ListingOptions.table_partition_cols to make it a bit more explicit.

[1] re-evaluating the filters on the the partition column would be expensive:

it requires the the column to be pushed down, thus materialized in the source execution plan. This is acceptable if we use DictionaryArray<uint8> which is pretty cheap.
when applying the filtering expression the dictionary needs to be expanded because many kernel ops are not supported on Dictionaries for now (link and Add better and faster support for dictionary types #87). This could be very expensive!!!!

alamb · 2021-10-18T20:44:52Z

I plan to check this out carefully tomorrow

rdettai · 2021-10-19T08:41:22Z

Thanks! You can take a quick look, but this will not be ready before 1 or 2 days... If you want, I can ping you once it's in a more reviewable state 😃

alamb · 2021-10-19T10:19:42Z

Thanks! You can take a quick look, but this will not be ready before 1 or 2 days... If you want, I can ping you once it's in a more reviewable state 😃

Sounds good. Thank you 🙏

rdettai · 2021-10-22T16:54:10Z

This is a bit harder than I thought it would be 😅. I will have to keep working on this next week. Feel free to give some feedback already, most of the important elements are already there.

datafusion/src/datasource/listing/table.rs

houqp · 2021-10-25T06:07:16Z

datafusion/src/datasource/file_format/mod.rs

    /// The minimum number of records required from this source plan
    pub limit: Option<usize>,
+    /// The partitioning column names
+    pub table_partition_dims: Vec<String>,


curious why not name it as table_partition_cols to better align with the comment? the type should make it clear that it's storing column names.

Haha, I have to admit that I am in a huge hesitation regarding naming 😅. I am wondering if it's not the comment that should be changed. This partitions are originally encoded in the file path, that we then parse and project into a column if necessary. So they end up as columns, but they are not columns per say.

Conceptually they are handled as "virtual columns" during compute right? for example, when a user is writing a SQL query to filter against a partition, they will apply the filter to that partition just like other regular columns. I am suggesting partition column here because it's the term used in hive and spark, so readers would be more familiar with it. Are there systems that use partition dimensions as the naming convention?

not that I am aware of, I'll change this to cols 😉

datafusion/src/datasource/listing/helpers.rs

houqp · 2021-10-25T06:49:48Z

datafusion/src/datasource/listing/helpers.rs

+    } else {
+        let applicable_filters = filters
+            .iter()
+            .filter(|f| expr_applicable_for_cols(table_partition_dims, f));


to avoid the complexity of expr_applicable_for_cols, perhaps we could just throw a runtime error to the user if invalid filters are provided? this also makes sure that if there is a typo in the filter, we won't silent the error.

I don't think we are silencing any error here. Typos in the query were already caught (or should already be caught) upstream in the SQL parser with checks like https://github.com/apache/arrow-datafusion/blob/fe1a9e2c55392b934c85098430f78a26ef71380e/datafusion/src/sql/planner.rs#L769-L775

Here expr_applicable_for_cols is different, it only checks whether a given expression can be resolved using the partitioning dimensions only. For instance, if the table partitioning is of type mytable/date=2021-01-01/file.csv with a file schema of the form Schema([Field("a",int), Field("b",string)]). A filter that contains WHERE b='hello' or WHERE b=date is perfectly valid, but the filter should not be kept in the list of applicable_filters because they cannot be resolved with the partitioning column only.

houqp · 2021-10-25T07:05:44Z

Epic work @rdettai !

when applying the filtering expression the dictionary needs to be expanded because many kernel ops are not supported on Dictionaries for now

I think we won't need to worry too much about this for this particular use-case because it won't be a problem anymore after we move to use scalar columnar value to store the partition values.

alamb · 2021-10-25T20:08:35Z

I plan to review this PR tomorrow (when I am fresher and can git it the look it deserves)

alamb · 2021-10-26T10:48:24Z

I plan to review this PR tomorrow (when I am fresher and can git it the look it deserves)

@rdettai mentioned he has some more work planned for this PR so I will postpone additional review until that is complete

parquet exec still TODO

This helps avoid providing schemas with wrong datatypes.

refactored partition column projection and added test

rdettai · 2021-10-28T13:57:00Z

I just noticed that there seems to be a small irregularity in terms of columns statistics vector length. I'm adding one last test about that with the appropriate patch if required 😉

alamb

Epic work. I got through almost all of this PR but I need to finish up helpers.rs -- but I ran out of time today; Will finish tomorrow

I did leave some comment / feedback but I don't think any of it is absolutely required to merge

ballista/rust/core/proto/ballista.proto

alamb · 2021-10-28T20:04:33Z

ballista/rust/core/proto/ballista.proto

@@ -613,33 +614,28 @@ message ScanLimit {
  uint32 limit = 1;
 }

-message ParquetScanExecNode {
+message FileScanExecConf {


makes sense to me

alamb · 2021-10-28T20:06:32Z

ballista/rust/core/src/serde/physical_plan/to_proto.rs

-                            .collect(),
-                        batch_size: exec.batch_size() as u32,
+                        base_conf: Some(exec.base_config().try_into()?),
+                        // TODO serialize predicates


is this a TODO you plan for this PR? Or a follow on one?

This was already there.

alamb · 2021-10-28T20:08:23Z

ballista/rust/core/src/serde/physical_plan/to_proto.rs

+            projection: conf
+                .projection
+                .as_ref()
+                .unwrap_or(&vec![])


it looks like an improvement to me, but previously the code would error if projection: None was passed and this code will simply convert that to an empty list.

Is that intentional?

None is encoded with empty vec. Not the cleanest but works here as projection with no column could not have another meaning.

alamb · 2021-10-28T20:09:49Z

datafusion/src/datasource/file_format/avro.rs

@@ -61,16 +62,9 @@ impl FileFormat for AvroFormat {
    async fn create_physical_plan(
        &self,
        conf: PhysicalPlanConfig,
+        _filters: &[Expr],


this is neat that the creation of the physical plan gets the filters

alamb · 2021-10-28T20:30:32Z

datafusion/tests/path_partition.rs

+    )
+    .await;
+
+    let result = ctx


this is very cool

datafusion/src/datasource/listing/helpers.rs

alamb · 2021-10-28T20:37:55Z

datafusion/src/datasource/listing/table.rs

+    /// Note that only `DEFAULT_PARTITION_COLUMN_DATATYPE` is currently
+    /// supported for the column type.


as mentioned elsewhere I think it would be fine to say "these are always dictionary coded string columns" rather than "currently" which hints at changing it in the future.

Let's discuss this in the other #1141 (comment)

alamb · 2021-10-28T20:38:55Z

datafusion/src/datasource/listing/table.rs

    ) -> Result<TableProviderFilterPushDown> {
-        Ok(TableProviderFilterPushDown::Inexact)
+        if expr_applicable_for_cols(&self.options.table_partition_cols, filter) {


datafusion/src/datasource/listing/helpers.rs

apache#1141 (review)

alamb

I think this is great and ready to go. Any other thoughts @houqp or @Dandandan ?

alamb · 2021-10-29T11:02:21Z

datafusion/src/physical_plan/file_format/mod.rs

+
+lazy_static! {
+    /// The datatype used for all partitioning columns for now
+    pub static ref DEFAULT_PARTITION_COLUMN_DATATYPE: DataType = DataType::Dictionary(Box::new(DataType::UInt8), Box::new(DataType::Utf8));


I see -- so we are envisioning some how allowing users of this code to specify the type in the partitioning somehow (and provide their own code to determine the partition values). That makes sense

alamb · 2021-10-29T11:05:00Z

datafusion/src/datasource/listing/helpers.rs

+}
+
+impl ExpressionVisitor for ApplicabilityVisitor<'_> {
+    fn pre_visit(self, expr: &Expr) -> Result<Recursion<Self>> {


datafusion/src/datasource/listing/helpers.rs

alamb · 2021-10-29T11:10:44Z

datafusion/src/datasource/listing/helpers.rs

+}
+
+/// convert the paths of the files to a record batch with the following columns:
+/// - one column for the file size named `_df_part_file_size_`


This is a clever way to apply filtering -- convert the paths to batches and then run the evaluation on the batches, and then turn it back to paths 👍

alamb · 2021-10-29T11:13:05Z

datafusion/src/datasource/listing/helpers.rs

+        assert_eq!(&parsed_files[0].partition_values, &[]);
+        assert_eq!(&parsed_files[1].partition_values, &[]);
+
+        let parsed_metas = parsed_files


alamb · 2021-11-01T13:10:41Z

I plan to merge from master and if all the tests pass put this one in. FYI @houqp / @Dandandan @jimexist

Please let me know if you want more time to review

alamb · 2021-11-01T13:47:37Z

FYI fixed a logical conflict in 5d34be6

jimexist · 2021-11-01T13:51:44Z

ballista/rust/core/proto/ballista.proto

-  repeated uint32 projection = 6;
-  ScanLimit limit = 7;
-  Statistics statistics = 8;
+  uint32 batch_size = 3;


would this be back-compatible?

As long as you don't have DataFusion nodes with different versions, it should be ok!

datafusion/src/datasource/listing/helpers.rs

jimexist · 2021-11-01T14:03:41Z

datafusion/src/datasource/object_store/mod.rs

@@ -70,7 +70,7 @@ pub enum ListEntry {
 }

 /// The path and size of the file.
-#[derive(Debug, Clone)]
+#[derive(Debug, Clone, PartialEq)]


Suggested change

#[derive(Debug, Clone, PartialEq)]

#[derive(Debug, Clone, PartialEq, Eq)]

Not sure if it is really useful to add a trait that is not used 😊

jimexist · 2021-11-01T14:09:14Z

datafusion/tests/common.rs

+
+use arrow::datatypes::{DataType, Field, Schema, SchemaRef};
+
+pub fn aggr_test_schema() -> SchemaRef {


Suggested change

pub fn aggr_test_schema() -> SchemaRef {

pub(crate) fn aggr_test_schema() -> SchemaRef {

The tests module is enabled for tests only anyway, so the (crate) modifier does not have much effects here (you would need to build DataFusion in test mode to use it).

Co-authored-by: Jiayu Liu <[email protected]>

alamb · 2021-11-01T20:29:53Z

As this has been outstanding for a long time and is a fairly large change, the potential for conflict is large -- I am going to merge it in now and hopefully we can keep iterating in future PRs. Thanks again @rdettai @jimexist @houqp and @Dandandan -- 🚀

Dandandan · 2021-11-01T20:30:55Z

Thank you @rdettai ! Really nice work

github-actions bot added ballista datafusion Changes in the datafusion crate labels Oct 18, 2021

rdettai changed the title ~~[feat] adding partition_values to PartitionedFile~~ File partitioning for ListingTable Oct 18, 2021

rdettai force-pushed the file-partitioning branch 2 times, most recently from b1a9db2 to 34df752 Compare October 22, 2021 16:49

rdettai force-pushed the file-partitioning branch from 34df752 to 4ef66aa Compare October 23, 2021 07:25

houqp reviewed Oct 25, 2021

View reviewed changes

rdettai force-pushed the file-partitioning branch from 5c189dc to b3b32d4 Compare October 28, 2021 08:55

rdettai marked this pull request as ready for review October 28, 2021 08:55

This was referenced Oct 28, 2021

Make table partitioning accessible in register/read APIS #1185

Open

Implement statistics estimation for partition columns #1186

Open

Optimize how table partitions are pruned #1187

Open

rdettai added 11 commits October 28, 2021 11:47

[feat] adding partition_values to PartitionedFile

c184efa

[wip] implementing pruned_partition_list

2e82902

[feat] implement pruned_partition_list

cbbf347

[fix] outdated comment

f74ee67

[fix] adding extra partitioning col for csv avro json

fb15b3b

parquet exec still TODO

[fix] houqp's review

95216d8

[fix] ListingTable takes schema without partition cols

26bf105

This helps avoid providing schemas with wrong datatypes.

[feat] added partition columns for parquet

c34193d

refactored partition column projection and added test

[feat] specify supports_filter_pushdown

32c5bf6

[fix] parquet rowgroup pruning not kicking in

10b52eb

[test] added proper integration tests

f207b92

[fix] wrong schema passed to get_statistics_with_limit

c5cfcfb

alamb approved these changes Oct 28, 2021

View reviewed changes

rdettai added a commit to rdettai/arrow-datafusion that referenced this pull request Oct 29, 2021

[fix] improvements following review

d3cc283

apache#1141 (review)

[fix] improvements following review

cb0789e

apache#1141 (review)

rdettai force-pushed the file-partitioning branch from d3cc283 to cb0789e Compare October 29, 2021 08:59

alamb approved these changes Oct 29, 2021

View reviewed changes

alamb mentioned this pull request Nov 1, 2021

Support pseudo columns #1203

Closed

alamb added 2 commits November 1, 2021 09:11

Merge remote-tracking branch 'apache/master' into file-partitioning

0303913

Add GetIndexedField to the list of exprs that can be in predicates

5d34be6

jimexist reviewed Nov 1, 2021

View reviewed changes

datafusion/src/datasource/listing/helpers.rs Show resolved Hide resolved

jimexist reviewed Nov 1, 2021

View reviewed changes

jimexist approved these changes Nov 1, 2021

View reviewed changes

Update datafusion/src/datasource/listing/helpers.rs

a4ec801

Co-authored-by: Jiayu Liu <[email protected]>

alamb merged commit b2a5028 into apache:master Nov 1, 2021

This was referenced Nov 1, 2021

DataFusion compilation error with --no-default-features #1217

Closed

Fix build with --no-default-features #1219

Merged

This was referenced Nov 3, 2021

Fix bug of twice projection when creating ParquetExec during deserialization (#1210) #1211

Closed

Bug of twice projection when creating ParquetExec during deserialization #1210

Closed

alamb mentioned this pull request Nov 5, 2021

Optimized RecordBatch for constant columns #1248

Closed

houqp added enhancement New feature or request api change Changes the API exposed to users of the crate labels Nov 6, 2021

houqp mentioned this pull request Nov 7, 2021

Inlude new files in registry of tables dynamically roapi/roapi#55

Open

alamb mentioned this pull request Nov 29, 2021

Implement better tests for ParquetExec #134

Closed

		/// Note that only `DEFAULT_PARTITION_COLUMN_DATATYPE` is currently
		/// supported for the column type.

	#[derive(Debug, Clone, PartialEq)]
	#[derive(Debug, Clone, PartialEq, Eq)]


		use arrow::datatypes::{DataType, Field, Schema, SchemaRef};

		pub fn aggr_test_schema() -> SchemaRef {

	pub fn aggr_test_schema() -> SchemaRef {
	pub(crate) fn aggr_test_schema() -> SchemaRef {

File partitioning for ListingTable #1141

File partitioning for ListingTable #1141

Conversation

rdettai commented Oct 18, 2021 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

What changes are planned in further issues/PRs?

Are there any user-facing changes?

alamb commented Oct 18, 2021

rdettai commented Oct 19, 2021

alamb commented Oct 19, 2021

rdettai commented Oct 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

houqp commented Oct 25, 2021

alamb commented Oct 25, 2021

alamb commented Oct 26, 2021

rdettai commented Oct 28, 2021

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Nov 1, 2021 • edited Loading

alamb commented Nov 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdettai Nov 1, 2021 • edited Loading

Choose a reason for hiding this comment

alamb commented Nov 1, 2021

Dandandan commented Nov 1, 2021

rdettai commented Oct 18, 2021 •

edited

Loading

alamb commented Nov 1, 2021 •

edited

Loading

rdettai Nov 1, 2021 •

edited

Loading