feat: logical Node for find files #2194

hntd187 · 2024-02-19T21:19:03Z

Description

Some of my first workings on David's proposal in #2006, this is also meant to push #2048 and general CDF forward as well by making the logical operations of delta tables more composable than they are today.

Related Issue(s)

#2006
#2048

I think and @Blajda correct me here, we can build upon this and eventually move towards a DeltaPlanner esq enum for operations and their associated logical plan building.

Still to do

Implement different path for partition columns that don't require scanning the file
Plumbing into DeltaScan so delta scan can make use of this logical node
General polish and cleanup, there are lots of unnecessary fields and way things are built
More tests, there is currently a large integration style end to end test, but this can / should be broken down

github-actions · 2024-02-19T21:19:30Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

Blajda

There's quite a bit work that's still required. When implementing this operator my recommendation is to reuse as much of the DeltaScan functionality a possible. This will help avoid bugs in the future whenever support for schema evolution / column mapping is introduced.

Blajda · 2024-02-20T02:38:03Z

crates/core/src/delta_datafusion/find_files/logical.rs

+pub struct FindFilesNode {
+    pub id: String,
+    pub input: LogicalPlan,
+    pub predicates: Vec<Expr>,
+    pub files: Vec<String>,
+    pub schema: DFSchemaRef,
+}


The internal of this structure should be opaque and corresponding new functions should be made.
Users will not provide the list of files to scan, instead they will provide some reference to the DeltaTable (i.e EagerSnapshot). From that snapshot you can obtain the schema and files.

You also shouldn't need input here. This operation should function as a source.

Blajda · 2024-02-20T02:39:05Z

crates/core/src/delta_datafusion/find_files/logical.rs

+    fn fmt_for_explain(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
+        write!(
+            f,
+            "FindFiles id={}, predicate={:?}, files={:?}",


Having files here will be a bit too verbose here. Would be better to print the version of the table.

Blajda · 2024-02-20T02:47:33Z

crates/core/src/delta_datafusion/find_files/mod.rs

+    if let Some(column) = batch.column_by_name(PATH_COLUMN) {
+        let mut column_iter = column.as_string::<i32>().into_iter();
+        while let Some(Some(row)) = column_iter.next() {
+            let df = ctx
+                .read_parquet(row, ParquetReadOptions::default())
+                .await?
+                .filter(predicate.to_owned())?;
+            if df.count().await? > 0 {
+                batches.push(row);
+            }
+        }
+    }
+    let str_array = Arc::new(StringArray::from(batches));
+    RecordBatch::try_new(only_file_path_schema(), vec![str_array]).map_err(Into::into)


Multiple enhancements can be made here. The current implementation of find_files is able to perform a projection which reads only data required to evaluate the predicate. Keeping the projection behavior is a must since it reduces how much is read.

Something else to consider here is that FindFiles only needs to know if at least one record satisfies the predicate hence once one match is found the scan of that file can be stopped. Further optimization can be done by having a limit of 1 for the filter.

Another thing I have to add is when it's partition columns we don't need to read the file at all since it can be inferred from the path itself.

Blajda · 2024-02-20T02:53:56Z

crates/core/src/delta_datafusion/find_files/mod.rs

+
+        let e = p.execute(0, state.task_ctx())?;
+        let s = collect_sendable_stream(e).await.unwrap();
+        print_batches(&s)?;


Should assert on the final output result here. See other operations for an example of comparing a string batch representation to a batch.

Blajda · 2024-02-20T02:58:24Z

crates/core/src/delta_datafusion/find_files/physical.rs

+    }
+
+    fn output_partitioning(&self) -> Partitioning {
+        Partitioning::UnknownPartitioning(0)


This operation should not be limited a single partition. I.e think of each partition as a cpu thread here ideally we should be able to divide the files being scanned to each available cpu thread.

This was just a lazy initial implementation, I can fix that

Blajda

Significant improvements.
How much effort / what would be required to remove the dictionary output schema?

Blajda · 2024-03-10T20:01:39Z

crates/core/src/delta_datafusion/find_files/mod.rs

+            PATH_COLUMN,
+            DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
+            false,
+        ));


Yeah there's definitely a bit of a weird interface mismatch. This operation must only output a file path at most once so there is minimal value in having a dictionary being returned to end user however the dictionary encoding should be used when we perform a scan.
Is it easy to simply have string be returned?

I added to delta scan config the ability to turn off the dictionary encoding, it's on by default everywhere else, but here we use just a simple string as you mentioned

Blajda · 2024-03-10T20:13:07Z

crates/core/src/delta_datafusion/find_files/mod.rs

+        .map(|batch| batch.project(&[field_idx]).unwrap())
+        .collect();
+
+    let result_batches = concat_batches(&ONLY_FILES_SCHEMA.clone(), &path_batches)?;


I was thinking for this operation to output file paths as soon as they are discovered to allow operations downstream to begin their work ASAP. When performing an memory only scan to makes sense to output as a single batch since IO is minimal.
For this PR I think it's okay since it aligns with the previous behaviour but its something we can change.

can you elaborate a bit more? I understand what you are suggesting here, but I'm not sure what you are expecting. Like you want the return type to be a vec or record batches or a stream or something?

Yes output record batches and these record batches should simply be a string. There is no benefit for using dictionaries in the output.
I was thinking longer term. The current implementation waits for all files to be scanned prior to sending a record batch downstream. There might be some benefit to send a record batch with a single record as soon as a file match is determined.

Oh you're talking about the actual schema of the record batch, I thought you were saying the return type of the function was wrong or something of the sort for it to be "immediately available" or something of the sort.

…les_logical

…ether or not to wrap partition columns in dictionary encodings, this is on by default

Blajda · 2024-03-21T02:21:29Z

crates/core/src/delta_datafusion/find_files/mod.rs

+) -> Result<RecordBatch> {
+    register_store(log_store.clone(), state.runtime_env().clone());
+    let scan_config = DeltaScanConfigBuilder::new()
+        .wrap_partition_values(false)


So we don't want to completely disable dictionary values when performing a scan since it can provide significant memory savings benefits. I'd prefer we wrap the partition values during a scan however in the output batches for the operation they should be provided as strings.

…coding, now just turn it on and rebuild the batch

Blajda

LGTM. Thanks for making the changes.

feat: Logical Node for find files

a914d06

hntd187 requested review from wjones127, roeap and rtyler as code owners February 19, 2024 21:19

hntd187 marked this pull request as draft February 19, 2024 21:19

github-actions bot added the binding/rust Issues for the Rust crate label Feb 19, 2024

hntd187 changed the title ~~feat: Logical Node for find files~~ feat: logical Node for find files Feb 19, 2024

Blajda requested changes Feb 20, 2024

View reviewed changes

hntd187 added 3 commits February 23, 2024 14:34

feat: Logical Node for find files

640b744

feat: Logical Node for find files

54be04e

feat: Logical Node for find files

6301ef6

Blajda reviewed Mar 10, 2024

View reviewed changes

hntd187 and others added 5 commits March 13, 2024 11:25

Merge branch 'main' into find_files_logical

467b0e6

Merge remote-tracking branch 'upstream/main' into find_files_logical

ef4011f

Merge branch 'main' into find_files_logical

cf21b18

Merge remote-tracking branch 'origin/find_files_logical' into find_fi…

6cdd49d

…les_logical

feat: Logical Node for find files - added the ability to determine wh…

26949ce

…ether or not to wrap partition columns in dictionary encodings, this is on by default

hntd187 requested a review from Blajda March 17, 2024 18:44

Merge branch 'main' into find_files_logical

bb0d933

hntd187 marked this pull request as ready for review March 19, 2024 20:04

Blajda reviewed Mar 21, 2024

View reviewed changes

hntd187 and others added 2 commits March 21, 2024 12:38

Merge branch 'main' into find_files_logical

6e3d695

feat: Logical Node for find files - removed turning off dictionary en…

d33d8d3

…coding, now just turn it on and rebuild the batch

hntd187 requested review from fvaleye and ion-elgreco as code owners March 24, 2024 15:04

github-actions bot added the binding/python Issues for the Python package label Mar 24, 2024

Merge branch 'main' into find_files_logical

53e45fb

Blajda approved these changes Mar 24, 2024

View reviewed changes

Blajda merged commit f56d8c9 into delta-io:main Mar 24, 2024
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: logical Node for find files #2194

feat: logical Node for find files #2194

hntd187 commented Feb 19, 2024 •

edited

Loading

github-actions bot commented Feb 19, 2024

Blajda left a comment

Blajda Feb 20, 2024 •

edited

Loading

Blajda Feb 20, 2024

Blajda Feb 20, 2024 •

edited

Loading

hntd187 Feb 20, 2024

Blajda Feb 20, 2024

Blajda Feb 20, 2024

hntd187 Feb 20, 2024

Blajda left a comment

Blajda Mar 10, 2024

hntd187 Mar 17, 2024

Blajda Mar 10, 2024

hntd187 Mar 13, 2024

Blajda Mar 14, 2024 •

edited

Loading

hntd187 Mar 17, 2024

Blajda Mar 21, 2024

Blajda left a comment

feat: logical Node for find files #2194

feat: logical Node for find files #2194

Conversation

hntd187 commented Feb 19, 2024 • edited Loading

Description

Related Issue(s)

Still to do

github-actions bot commented Feb 19, 2024

Blajda left a comment

Choose a reason for hiding this comment

Blajda Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blajda Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blajda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blajda Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blajda left a comment

Choose a reason for hiding this comment

hntd187 commented Feb 19, 2024 •

edited

Loading

Blajda Feb 20, 2024 •

edited

Loading

Blajda Feb 20, 2024 •

edited

Loading

Blajda Mar 14, 2024 •

edited

Loading