-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bin packing optimization #607
Conversation
Hi @houqp, let me know if there's anything additional you would like to see in the PR and would appreciate any feedback with the issues items. I'll work on finishing this up by adding statistics on writes. |
Thanks @Blajda - this is a great update to delta-rs. Personally I think it would be good to re-use the existing writers and fix/extend them where they fall short for the optimize scenario. At least the Right now the writer does not handle muti-part files or desired file sizes, which might be a use case here. I.e. we could extend the writer to either eagerly flush files when the size of the in-memory writer reaches the desired size, or create a new underlying arrow writer and flush only when flush is called. In case of multiple files, these could be the multi-part files. A downside of the current implementation is, that we always try to partition the data written into the writer, which might be a fairly expensive operation when its not needed - not sure however how it will behave, if we don't have the partition columns present, since there is nothing to partition by :). We could try and use the |
very cool @Blajda ! I agree with @roeap the code reuse suggestion.
Ideally, we should also not write out partition columns into the actual parquet file. I think this is an oversight from our end. User of delta-rs should be responsible for populating the partition columns. We could track it as a follow up clean up task.
As far as I know, there is no easy way around this other than we need to manually compare the table schema with the schema we got after reading from the parquet file and backfill new columns in the arrow record batches before writing them out to the new parquet file. |
Just had a quick look into the RecordbatchWriter. To me it seems we do make sure the partition columns are not part of the data written out to storage. delta-rs/rust/src/writer/record_batch.rs Lines 184 to 194 in 5b2a09c
We use that partitioned schema to initialize the PartitonWriter and validate the Schema of data written into the PartitionWriter. Then again, I might have missed something, or there could be a bug ...# While we currently cannot fully handle schema evolution, in case of added columns, it is permissible to omit these from the written data, rather then filling them. Of course this only makes sense if the data is missing from all files being binned together. In other cases back-filling with null seems like the only way to go... |
Hi @roeap, I want to avoid the usage of Next steps for me is factoring out the drop partition columns functionality and creating some utility for back-filling. |
@Blajda - what do you think about adding a That way you get the create add (foremost stats) for free. More importantly though, as we move to support higher writer versions we will need to handle things like column alias / renames, column invariants, identity- and calculated columns etc. which I suspect will be non trivial logic. By leveraging a single writer struct I feel our lives might be a lot easier if we have a single code path that writes out data. I know @wjones127 has been thinking a lot about our writer designs as well. |
Yes that sounds like the right approach to ensure the writer's are unified. I assume the signature would look something like this. async fn write_partition(&mut self, values: RecordBatch, partition: &?) -> Result<(), DeltaWriterError> Now it's mostly a decision on how to represent the full partition path. We can reuse I think that abstraction would be very helpful and would also help cleanup the bin packing implementation. In terms of tracking the desired file size, my understanding is that in-memory size doesn't exactly match on disk size due to parquet compression and RLE. But If that calculation is trivial then I'll go for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if write_partition()
took a stream of record batches instead of a single one. Then optimize command would bin-pack to collect a set of record batch streams, and pass those streams on to write_partition()
(one at a time, or maybe even in parallel). The function would then be responsible for merging the schemas. Also need to think about how the optimize metrics are passed back; maybe the write_partition()
function returns metrics?
Left some other general comments as well.
rust/tests/optimize_test.rs
Outdated
@@ -0,0 +1,91 @@ | |||
#[cfg(feature = "datafusion-ext")] | |||
mod optiize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you may want a few more tests. Here are some properties we likely want to test for:
- Optimize is idempotent. If you run twice, the second time it won't write any new files.
- Optimize bin packs. For example, with a max file size of 100MB, files of size 70MB, 100MB, 30MB will turn into two files of 100MB, regardless of the order they show up in the log. (I think this is handled, but might be nice to test.)
- Optimize fails if a concurrent writer overwrites the tables. It might be able to succeed if a concurrent writer appends to the table, at the very least in the case where the append happens in a different partition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the implementation to sort files to be optimized to ensure that it is idempotent. Added a couple of tests to validate that behavior too. For item 3 I think it will have to wait until we have a generalized pattern for non-append writers.
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
Co-authored-by: Will Jones <[email protected]>
rust/src/optimize.rs
Outdated
//Check for remove actions on the optimized partitions | ||
let mut dtx = table.create_transaction(None); | ||
dtx.add_actions(actions); | ||
dtx.commit(None, None).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a Optimize
operation to the DeltaOperation
enum, and pass it here to have it generate the respective data in the commit info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a new struct and associated tests for this.
Looking good! Left some minor questions. Reading this I realized that we should probably treat other operations - like vacuum - the same as here. I.e. having it as a separtae struct that can be executed ... But this is something for a follow up. |
Hi @roeap @wjones127 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding those concurrency tests!
I just have one question on that commit info test.
last_commit["operationParameters"]["targetSize"], | ||
json!(2_000_000) | ||
); | ||
assert_eq!(last_commit["operationParameters"]["predicate"], Value::Null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should predicate equal the filter used above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it should. I added an additional TODO here. There's isn't a function that obtains the String representation of PartitionFilters. Should be fairly simple to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. That seems sufficient for now.
rust/src/writer/utils.rs
Outdated
@@ -165,3 +193,33 @@ pub(crate) fn stringified_partition_value( | |||
|
|||
Ok(Some(s)) | |||
} | |||
|
|||
/// Remove any partition related fields from the schema | |||
pub(crate) fn schema_without_partitions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you still need this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Reverted and restored the original function from where it was sourced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I'll wait a few days before merging to give others a chance to provide any final feedback.
Great job on this!
|
||
let partition_value = partition_value | ||
.as_deref() | ||
.unwrap_or(NULL_PARTITION_VALUE_DATA_PATH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that this crate traditionally uses the constant, but the protocol specifies a different behaviour when it comes to nulls. I opened a ticket to track this question #619 - since it was also not introduced here...
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#partition-value-serialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - great work @Blajda!
Thanks @wjones127 and @roeap for helping carry this to completion. |
Amazing work on this @Blajda ! |
# Description Implement string representation for PartitionFilter and add it to optimize command commit info for a TODO. Following json representations of CommitInfo from delta scala project were found: ``` "operationParameters":{"predicate":"[\"(id#480 < 50)\"]"} "operationParameters":{"predicate":"[\"(col1#1939 = 2)\"]"} "operationParameters":{"predicate":"[\"`col2` LIKE 'data-2-%'\"]"} "operationParameters":{"predicate":"[\"(id#378L < 2)\"]"} "operationParameters":{ "predicate":"[\"(spark_catalog.delta.`/table-uuid`.id IN (0, 180, 308, 225, 756, 1007, 1503))\"]" } ``` So I think representing predicate as a list of filters in string representation would be logical, for example: ``` // single PartitionFilter "operationParameters":{"predicate":"[\"date = '2022-05-22'\"]"} // two filters "operationParameters":{"predicate":"[\"date = '2022-05-22'\", \"country = 'US'\"]"} ``` # Related Issue(s) implement TODO in #607
Description
An optimization implementation provided by Databricks is bin-packing which coalesces small files into a larger file. This reduces the number of calls to the underlying storage and results in faster table reads.
This is a high level description of the process. A user can provide a filter for which partitions they want to optimize. Active add actions for those partitions are obtain and then placed into bins labeled by the partition. Once actions are in their respective bin, start building additional bins which consists of a list files to be merged together. Files with a size larger than
delta.targetFileSize
or bins with only a single file are not optimized.Each bin is then processed and the smaller files are written to a larger one. Corresponding Add and Remove actions are created. Metrics on how many files were considered and total file size are also captured.
Related Issue(s)
Issues
Currently the writer's provided by delta-rs write partition information to the parquet file. This differs from the Databricks implementation which does not. This causes a schema mismatch to occur when packing files from different writers. This is currently handled by dropping partition columns when rewriting.
Similar to (1.), schema evolution allows the addition of new columns, converting NullTypes -> any other type, and upcasts. Currently this is not handled and will cause the a Schema mismatch error. Looking for some guidance on how these scenarios should be handled.
Todo