Bin packing optimization #607

Blajda · 2022-05-16T02:15:07Z

Description

An optimization implementation provided by Databricks is bin-packing which coalesces small files into a larger file. This reduces the number of calls to the underlying storage and results in faster table reads.

This is a high level description of the process. A user can provide a filter for which partitions they want to optimize. Active add actions for those partitions are obtain and then placed into bins labeled by the partition. Once actions are in their respective bin, start building additional bins which consists of a list files to be merged together. Files with a size larger than delta.targetFileSize or bins with only a single file are not optimized.

Each bin is then processed and the smaller files are written to a larger one. Corresponding Add and Remove actions are created. Metrics on how many files were considered and total file size are also captured.

Related Issue(s)

MVP for Implement optimize command #98

Issues

Currently the writer's provided by delta-rs write partition information to the parquet file. This differs from the Databricks implementation which does not. This causes a schema mismatch to occur when packing files from different writers. This is currently handled by dropping partition columns when rewriting.
Similar to (1.), schema evolution allows the addition of new columns, converting NullTypes -> any other type, and upcasts. Currently this is not handled and will cause the a Schema mismatch error. Looking for some guidance on how these scenarios should be handled.

Todo

Add file statistics using new writers
Implement partition column drop on partition write
Determine criteria for when optimize should fail
idempotent tests

Blajda · 2022-05-16T02:21:15Z

Hi @houqp, let me know if there's anything additional you would like to see in the PR and would appreciate any feedback with the issues items. I'll work on finishing this up by adding statistics on writes.

roeap · 2022-05-16T05:25:39Z

Thanks @Blajda - this is a great update to delta-rs. Personally I think it would be good to re-use the existing writers and fix/extend them where they fall short for the optimize scenario. At least the RecordBatchWriter should handle dropping columns that are represented as partition folders within the divide_by_partitions function. There are some tests for that, but if they don't that's a bug we should definitely fix. They also handle the file naming and computing file statistics. That way we don't duplicate that logic.

Right now the writer does not handle muti-part files or desired file sizes, which might be a use case here. I.e. we could extend the writer to either eagerly flush files when the size of the in-memory writer reaches the desired size, or create a new underlying arrow writer and flush only when flush is called. In case of multiple files, these could be the multi-part files.

A downside of the current implementation is, that we always try to partition the data written into the writer, which might be a fairly expensive operation when its not needed - not sure however how it will behave, if we don't have the partition columns present, since there is nothing to partition by :).

We could try and use the PartitionWriter directly, but the logic for add actions / statistics is currently part of the higher level writer - i think.

houqp · 2022-05-16T05:32:52Z

very cool @Blajda ! I agree with @roeap the code reuse suggestion.

Currently the writer's provided by delta-rs write partition information to the parquet file. This differs from the Databricks implementation which does not. This causes a schema mismatch to occur when packing files from different writers. This is currently handled by dropping partition columns when rewriting.

Ideally, we should also not write out partition columns into the actual parquet file. I think this is an oversight from our end. User of delta-rs should be responsible for populating the partition columns. We could track it as a follow up clean up task.

Similar to (1.), schema evolution allows the addition of new columns, converting NullTypes -> any other type, and upcasts. Currently this is not handled and will cause the a Schema mismatch error. Looking for some guidance on how these scenarios should be handled.

As far as I know, there is no easy way around this other than we need to manually compare the table schema with the schema we got after reading from the parquet file and backfill new columns in the arrow record batches before writing them out to the new parquet file.

roeap · 2022-05-16T05:41:23Z

Just had a quick look into the RecordbatchWriter. To me it seems we do make sure the partition columns are not part of the data written out to storage.

delta-rs/rust/src/writer/record_batch.rs

Lines 184 to 194 in 5b2a09c

    
           /// Returns the arrow schema representation of the partitioned files written to table 
        
           pub fn partition_arrow_schema(&self) -> ArrowSchemaRef { 
        
               Arc::new(ArrowSchema::new( 
        
                   self.arrow_schema_ref 
        
                       .fields() 
        
                       .iter() 
        
                       .filter(|f| !self.partition_columns.contains(f.name())) 
        
                       .map(|f| f.to_owned()) 
        
                       .collect::<Vec<_>>(), 
        
               )) 
        
           }

We use that partitioned schema to initialize the PartitonWriter and validate the Schema of data written into the PartitionWriter. Then again, I might have missed something, or there could be a bug ...#

While we currently cannot fully handle schema evolution, in case of added columns, it is permissible to omit these from the written data, rather then filling them. Of course this only makes sense if the data is missing from all files being binned together. In other cases back-filling with null seems like the only way to go...

https://github.com/delta-io/delta/blob/7103115962ab795272d9a259b0c069c277777939/PROTOCOL.md?plain=1#L477-L480

Blajda · 2022-05-17T02:26:18Z

Hi @roeap,
I originally wrote this at the start of April with the original JSON writer and I haven't tested that scenario with the new writers. Glad to hear it was resolved and sorry about the confusion. The protocol link was helpful, I was trying determine if default values are supported when adding columns and this confirms that null for back-filling is acceptable.

I want to avoid the usage of divide_by_partition_values since the sort is not required in this case but I do agree there are utilities such as next_data_path which can be reused.

Next steps for me is factoring out the drop partition columns functionality and creating some utility for back-filling.

roeap · 2022-05-17T05:28:18Z

@Blajda - what do you think about adding a write_partition method to the Writer trait and essentially carving out the inner loop over partitions in the current write method? It would then be great to have the writer aware of the desired file size as well, since right not it is not. In the case of optimize you do have more prior knowledge about how to bin the files, that the optimize command can already leverage - but the general logic would be a great addition to the writer.

That way you get the create add (foremost stats) for free. More importantly though, as we move to support higher writer versions we will need to handle things like column alias / renames, column invariants, identity- and calculated columns etc. which I suspect will be non trivial logic. By leveraging a single writer struct I feel our lives might be a lot easier if we have a single code path that writes out data.

I know @wjones127 has been thinking a lot about our writer designs as well.

Blajda · 2022-05-18T01:36:31Z

what do you think about adding a write_partition method to the Writer trait and essentially carving out the inner loop over partitions in the current write method?

Yes that sounds like the right approach to ensure the writer's are unified. I assume the signature would look something like this.

async fn write_partition(&mut self, values: RecordBatch, partition: &?) -> Result<(), DeltaWriterError>

Now it's mostly a decision on how to represent the full partition path. We can reuse get_partition_key and have ? be a String. My only concern is that it might be bit confusing for new users. Maybe we can make a new type called PartitionPath which can be created in the same way as get_partition_key and simply encapsulates a String.

I think that abstraction would be very helpful and would also help cleanup the bin packing implementation.

In terms of tracking the desired file size, my understanding is that in-memory size doesn't exactly match on disk size due to parquet compression and RLE. But If that calculation is trivial then I'll go for it.

wjones127

It would be nice if write_partition() took a stream of record batches instead of a single one. Then optimize command would bin-pack to collect a set of record batch streams, and pass those streams on to write_partition() (one at a time, or maybe even in parallel). The function would then be responsible for merging the schemas. Also need to think about how the optimize metrics are passed back; maybe the write_partition() function returns metrics?

Left some other general comments as well.

rust/src/optimize.rs

wjones127 · 2022-05-18T02:22:56Z

rust/tests/optimize_test.rs

@@ -0,0 +1,91 @@
+#[cfg(feature = "datafusion-ext")]
+mod optiize {


I think you may want a few more tests. Here are some properties we likely want to test for:

Optimize is idempotent. If you run twice, the second time it won't write any new files.

Optimize bin packs. For example, with a max file size of 100MB, files of size 70MB, 100MB, 30MB will turn into two files of 100MB, regardless of the order they show up in the log. (I think this is handled, but might be nice to test.)

Optimize fails if a concurrent writer overwrites the tables. It might be able to succeed if a concurrent writer appends to the table, at the very least in the case where the append happens in a different partition.

I changed the implementation to sort files to be optimized to ensure that it is idempotent. Added a couple of tests to validate that behavior too. For item 3 I think it will have to wait until we have a generalized pattern for non-append writers.

Co-authored-by: Will Jones <[email protected]>

roeap · 2022-05-31T06:31:56Z

rust/src/optimize.rs

+        //Check for remove actions on the optimized partitions
+        let mut dtx = table.create_transaction(None);
+        dtx.add_actions(actions);
+        dtx.commit(None, None).await?;


Should we add a Optimize operation to the DeltaOperation enum, and pass it here to have it generate the respective data in the commit info?

Added a new struct and associated tests for this.

rust/tests/optimize_test.rs

roeap · 2022-05-31T07:43:12Z

Looking good!

Left some minor questions. Reading this I realized that we should probably treat other operations - like vacuum - the same as here. I.e. having it as a separtae struct that can be executed ... But this is something for a follow up.

Blajda · 2022-06-02T02:25:50Z

Hi @roeap @wjones127
I've implemented the suggested changes and add additional tests. Let me know what you think! 😃

wjones127

Thanks for adding those concurrency tests!

I just have one question on that commit info test.

wjones127 · 2022-06-02T02:54:05Z

rust/tests/optimize_test.rs

+            last_commit["operationParameters"]["targetSize"],
+            json!(2_000_000)
+        );
+        assert_eq!(last_commit["operationParameters"]["predicate"], Value::Null);


Should predicate equal the filter used above?

Yes it should. I added an additional TODO here. There's isn't a function that obtains the String representation of PartitionFilters. Should be fairly simple to implement.

Okay. That seems sufficient for now.

wjones127 · 2022-06-02T03:00:32Z

rust/src/writer/utils.rs

@@ -165,3 +193,33 @@ pub(crate) fn stringified_partition_value(

    Ok(Some(s))
 }
+
+/// Remove any partition related fields from the schema
+pub(crate) fn schema_without_partitions(


Also, do you still need this function?

Nope. Reverted and restored the original function from where it was sourced.

wjones127

Looks good to me. I'll wait a few days before merging to give others a chance to provide any final feedback.

Great job on this!

roeap · 2022-06-02T06:09:41Z

rust/src/writer/utils.rs

+
+            let partition_value = partition_value
+                .as_deref()
+                .unwrap_or(NULL_PARTITION_VALUE_DATA_PATH);


Just realized that this crate traditionally uses the constant, but the protocol specifies a different behaviour when it comes to nulls. I opened a ticket to track this question #619 - since it was also not introduced here...

https://github.com/delta-io/delta/blob/master/PROTOCOL.md#partition-value-serialization

roeap

LGTM - great work @Blajda!

Blajda · 2022-06-02T23:32:21Z

Thanks @wjones127 and @roeap for helping carry this to completion.

houqp · 2022-06-05T05:59:09Z

Amazing work on this @Blajda !

# Description Implement string representation for PartitionFilter and add it to optimize command commit info for a TODO. Following json representations of CommitInfo from delta scala project were found: ``` "operationParameters":{"predicate":"[\"(id#480 < 50)\"]"} "operationParameters":{"predicate":"[\"(col1#1939 = 2)\"]"} "operationParameters":{"predicate":"[\"`col2` LIKE 'data-2-%'\"]"} "operationParameters":{"predicate":"[\"(id#378L < 2)\"]"} "operationParameters":{ "predicate":"[\"(spark_catalog.delta.`/table-uuid`.id IN (0, 180, 308, 225, 756, 1007, 1503))\"]" } ``` So I think representing predicate as a list of filters in string representation would be logical, for example: ``` // single PartitionFilter "operationParameters":{"predicate":"[\"date = '2022-05-22'\"]"} // two filters "operationParameters":{"predicate":"[\"date = '2022-05-22'\", \"country = 'US'\"]"} ``` # Related Issue(s) implement TODO in #607

Blajda added 7 commits May 15, 2022 20:37

:WIP: Basic bin packing optimization

240f688

Add test data for optimizing a table with partitions

1c57065

:WIP: Optimize for tables with partitions

814647a

Clean up Optimize definition and use targetFileSize config

ffd3f6d

Run clippy and fmt

b6cb9d3

Use fs_extra in tests to copy data

7834b5f

Update commit call sites

bb3c3ae

houqp requested review from roeap, houqp, rtyler, fvaleye, mosyp, wjones127 and xianwill May 16, 2022 05:33

wjones127 reviewed May 18, 2022

View reviewed changes

Blajda added 8 commits May 21, 2022 19:57

:WIP: Naive backfill implementation

191abbf

:WIP: Add Partition Path and reimplement optimize

d31bb58

:WIP: Use high level writers for optimize

e23c122

Add rand as dev dep

89d2246

Fix up tests that use PartitionPath

e807465

Fix up optimize tests

b3b254e

Refactor optimize tests that used datafusion and implement backfill

2f99551

Remove backfill implementation

d1587f0

Blajda and others added 7 commits May 30, 2022 21:04

Update rust/src/optimize.rs

e9061c2

Co-authored-by: Will Jones <[email protected]>

Update rust/src/optimize.rs

d4a2f09

Co-authored-by: Will Jones <[email protected]>

Update rust/src/optimize.rs

78e0f7b

Co-authored-by: Will Jones <[email protected]>

Update rust/tests/optimize_test.rs

7163610

Co-authored-by: Will Jones <[email protected]>

Update rust/src/optimize.rs

aec7ee9

Co-authored-by: Will Jones <[email protected]>

Update rust/src/optimize.rs

b299acc

Co-authored-by: Will Jones <[email protected]>

Update rust/src/optimize.rs

bc68375

Co-authored-by: Will Jones <[email protected]>

roeap reviewed May 31, 2022

View reviewed changes

Blajda added 4 commits May 31, 2022 20:29

Update Docs and add concurrent writer tests

a27892d

Ensure the writer only provides a single add action

5fcfdae

Add commit information with tests

fec7c0a

fix clippy error

ba107f0

wjones127 mentioned this pull request Jun 1, 2022

Implement optimize command #98

Closed

Blajda added 2 commits June 1, 2022 21:44

Use project for drop partition

8ae18ee

Small refactor on tests

0982dac

wjones127 requested changes Jun 2, 2022

View reviewed changes

wjones127 reviewed Jun 2, 2022

View reviewed changes

revert writer utils and add TODO for predicate

df0e355

wjones127 approved these changes Jun 2, 2022

View reviewed changes

roeap reviewed Jun 2, 2022

View reviewed changes

roeap approved these changes Jun 2, 2022

View reviewed changes

wjones127 merged commit 92c35d8 into delta-io:main Jun 2, 2022

Blajda deleted the bin-optimize branch June 2, 2022 23:32

wjones127 mentioned this pull request Jun 4, 2022

Expose optimize in Python bindings #622

Closed

sonhmai mentioned this pull request Feb 12, 2024

feat: implement string representation for PartitionFilter #2183

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bin packing optimization #607

Bin packing optimization #607

Blajda commented May 16, 2022 •

edited

Loading

Blajda commented May 16, 2022

roeap commented May 16, 2022 •

edited

Loading

houqp commented May 16, 2022

roeap commented May 16, 2022 •

edited

Loading

Blajda commented May 17, 2022

roeap commented May 17, 2022 •

edited

Loading

Blajda commented May 18, 2022 •

edited

Loading

wjones127 left a comment

wjones127 May 18, 2022 •

edited

Loading

Blajda May 29, 2022

roeap May 31, 2022

Blajda Jun 2, 2022

roeap commented May 31, 2022

Blajda commented Jun 2, 2022

wjones127 left a comment

wjones127 Jun 2, 2022

Blajda Jun 2, 2022

wjones127 Jun 2, 2022

wjones127 Jun 2, 2022

Blajda Jun 2, 2022

wjones127 left a comment

roeap Jun 2, 2022

roeap left a comment

Blajda commented Jun 2, 2022

houqp commented Jun 5, 2022

		@@ -0,0 +1,91 @@
		#[cfg(feature = "datafusion-ext")]
		mod optiize {

Bin packing optimization #607

Bin packing optimization #607

Conversation

Blajda commented May 16, 2022 • edited Loading

Description

Related Issue(s)

Issues

Todo

Blajda commented May 16, 2022

roeap commented May 16, 2022 • edited Loading

houqp commented May 16, 2022

roeap commented May 16, 2022 • edited Loading

Blajda commented May 17, 2022

roeap commented May 17, 2022 • edited Loading

Blajda commented May 18, 2022 • edited Loading

wjones127 left a comment

Choose a reason for hiding this comment

wjones127 May 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap commented May 31, 2022

Blajda commented Jun 2, 2022

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wjones127 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roeap left a comment

Choose a reason for hiding this comment

Blajda commented Jun 2, 2022

houqp commented Jun 5, 2022

Blajda commented May 16, 2022 •

edited

Loading

roeap commented May 16, 2022 •

edited

Loading

roeap commented May 16, 2022 •

edited

Loading

roeap commented May 17, 2022 •

edited

Loading

Blajda commented May 18, 2022 •

edited

Loading

wjones127 May 18, 2022 •

edited

Loading