Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: logical Node for find files #2194

Merged
merged 13 commits into from
Mar 24, 2024
Merged

Conversation

hntd187
Copy link
Collaborator

@hntd187 hntd187 commented Feb 19, 2024

Description

Some of my first workings on David's proposal in #2006, this is also meant to push #2048 and general CDF forward as well by making the logical operations of delta tables more composable than they are today.

Related Issue(s)

#2006
#2048

I think and @Blajda correct me here, we can build upon this and eventually move towards a DeltaPlanner esq enum for operations and their associated logical plan building.

Still to do

  • Implement different path for partition columns that don't require scanning the file
  • Plumbing into DeltaScan so delta scan can make use of this logical node
  • General polish and cleanup, there are lots of unnecessary fields and way things are built
  • More tests, there is currently a large integration style end to end test, but this can / should be broken down

@hntd187 hntd187 marked this pull request as draft February 19, 2024 21:19
@github-actions github-actions bot added the binding/rust Issues for the Rust crate label Feb 19, 2024
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@hntd187 hntd187 changed the title feat: Logical Node for find files feat: logical Node for find files Feb 19, 2024
Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's quite a bit work that's still required. When implementing this operator my recommendation is to reuse as much of the DeltaScan functionality a possible. This will help avoid bugs in the future whenever support for schema evolution / column mapping is introduced.

Comment on lines 7 to 13
pub struct FindFilesNode {
pub id: String,
pub input: LogicalPlan,
pub predicates: Vec<Expr>,
pub files: Vec<String>,
pub schema: DFSchemaRef,
}
Copy link
Collaborator

@Blajda Blajda Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The internal of this structure should be opaque and corresponding new functions should be made.
Users will not provide the list of files to scan, instead they will provide some reference to the DeltaTable (i.e EagerSnapshot). From that snapshot you can obtain the schema and files.

You also shouldn't need input here. This operation should function as a source.

fn fmt_for_explain(&self, f: &mut std::fmt::Formatter) -> std::fmt::Result {
write!(
f,
"FindFiles id={}, predicate={:?}, files={:?}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having files here will be a bit too verbose here. Would be better to print the version of the table.

Comment on lines 77 to 90
if let Some(column) = batch.column_by_name(PATH_COLUMN) {
let mut column_iter = column.as_string::<i32>().into_iter();
while let Some(Some(row)) = column_iter.next() {
let df = ctx
.read_parquet(row, ParquetReadOptions::default())
.await?
.filter(predicate.to_owned())?;
if df.count().await? > 0 {
batches.push(row);
}
}
}
let str_array = Arc::new(StringArray::from(batches));
RecordBatch::try_new(only_file_path_schema(), vec![str_array]).map_err(Into::into)
Copy link
Collaborator

@Blajda Blajda Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple enhancements can be made here. The current implementation of find_files is able to perform a projection which reads only data required to evaluate the predicate. Keeping the projection behavior is a must since it reduces how much is read.

Something else to consider here is that FindFiles only needs to know if at least one record satisfies the predicate hence once one match is found the scan of that file can be stopped. Further optimization can be done by having a limit of 1 for the filter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I have to add is when it's partition columns we don't need to read the file at all since it can be inferred from the path itself.


let e = p.execute(0, state.task_ctx())?;
let s = collect_sendable_stream(e).await.unwrap();
print_batches(&s)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should assert on the final output result here. See other operations for an example of comparing a string batch representation to a batch.

}

fn output_partitioning(&self) -> Partitioning {
Partitioning::UnknownPartitioning(0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation should not be limited a single partition. I.e think of each partition as a cpu thread here ideally we should be able to divide the files being scanned to each available cpu thread.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just a lazy initial implementation, I can fix that

Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Significant improvements.
How much effort / what would be required to remove the dictionary output schema?

PATH_COLUMN,
DataType::Dictionary(Box::new(DataType::UInt16), Box::new(DataType::Utf8)),
false,
));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah there's definitely a bit of a weird interface mismatch. This operation must only output a file path at most once so there is minimal value in having a dictionary being returned to end user however the dictionary encoding should be used when we perform a scan.
Is it easy to simply have string be returned?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added to delta scan config the ability to turn off the dictionary encoding, it's on by default everywhere else, but here we use just a simple string as you mentioned

.map(|batch| batch.project(&[field_idx]).unwrap())
.collect();

let result_batches = concat_batches(&ONLY_FILES_SCHEMA.clone(), &path_batches)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking for this operation to output file paths as soon as they are discovered to allow operations downstream to begin their work ASAP. When performing an memory only scan to makes sense to output as a single batch since IO is minimal.
For this PR I think it's okay since it aligns with the previous behaviour but its something we can change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate a bit more? I understand what you are suggesting here, but I'm not sure what you are expecting. Like you want the return type to be a vec or record batches or a stream or something?

Copy link
Collaborator

@Blajda Blajda Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes output record batches and these record batches should simply be a string. There is no benefit for using dictionaries in the output.
I was thinking longer term. The current implementation waits for all files to be scanned prior to sending a record batch downstream. There might be some benefit to send a record batch with a single record as soon as a file match is determined.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you're talking about the actual schema of the record batch, I thought you were saying the return type of the function was wrong or something of the sort for it to be "immediately available" or something of the sort.

@hntd187 hntd187 requested a review from Blajda March 17, 2024 18:44
@hntd187 hntd187 marked this pull request as ready for review March 19, 2024 20:04
) -> Result<RecordBatch> {
register_store(log_store.clone(), state.runtime_env().clone());
let scan_config = DeltaScanConfigBuilder::new()
.wrap_partition_values(false)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we don't want to completely disable dictionary values when performing a scan since it can provide significant memory savings benefits. I'd prefer we wrap the partition values during a scan however in the output batches for the operation they should be provided as strings.

@github-actions github-actions bot added the binding/python Issues for the Python package label Mar 24, 2024
Copy link
Collaborator

@Blajda Blajda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for making the changes.

@Blajda Blajda merged commit f56d8c9 into delta-io:main Mar 24, 2024
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants