-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support partitioned writes #208
Comments
Hi @Fokko and Iceberg community, I and @syun64 are continuing working on testing the write capability in Write support pr. We are excited about it as it will help us with our use case a lot. Our use case also includes overwriting partitions of tables. So I am highly interested in opportunities to contribute to this issue. Would it be alright for me to start working on this issue, based on the Write support pr if no one else has already begun? |
Hey @jqin61 Thanks for replying here. I'm not aware of the fact that anyone already started on this. It would be great if you can take a stab at it 🚀 |
It seems Spark's iceberg support has such overwrite behaviors under partition scheme evolution:
As Fokko mentioned, we need to make sure in the implementation we use the latest partition spec_id when overwriting partitions so that the data in the old partition spec is not touched. |
It just comes to me that when spark writes to iceberg, it requires the input dataframe to be sorted by the partition value otherwise an error will be raised during writing. Do we want to take the same assumption for pyiceberg? If not, if we have to use arrow.compute.filter() to extract each partition before serialization, it seems a global sort before the filter() on the entire table is unnecessary since the filter makes no assumption of the array organization? To extract the partitions by filter(), would it be helpful if we firstly build an API in pyarrow which does a full scan of the array and bucket-sorts it into partitions and returns buckets (partitions) as a list of arrow arrays? These arrays could be further passed as input to writing jobs which are executed in a multi-threading way. |
I currently see two approaches:
I'm not sure what is the best. I think the first one works better if you have few partitions, and the latter one is more efficient when you have many partitions.
Starting with the API is always a great idea. My only concern is that we make sure that we don't take copies of the data, since that might blow up the memory quite quickly. Hope this helps! |
@jqin61 I did some more thinking over the weekend, and I think that the approach that you suggested is the most flexible. I forgot about the sort-order that we also want to add at some point. Then we would need to sort twice 👎 |
Based on the existing discussion, there are 3 major possible directions for detecting partitions and writing each partition in a multi-threaded way to maximize I/O. It seems there isn’t any approach simple enough that we could purely leverage the existing Pyarrow APIs in Pyiceberg. I marshalled Fokko's suggestions and list these approaches for discussion purpose: Filter Out Partitions
With it, we could filter the table to get partitions and provide them as inputs to concurrent jobs.
Sort and Single-direction Writing
Then we could do
Bucketing
We could write each batch:
As Fokko pointed out, the filter method will not be efficient if there are many partitions - the filter takes O(table_length) time and although each thread can filter on its own, on a single node, the execution will be O(table_length * number_of_partitions) for all the jobs. Technically we only need one same scan to get all the buckets. It seems the sort method is not as efficient compared to the bucketing method because the relative order of partitions does not matter, so a general sort algorithm on the partition column might be overkill (compared with bucketing). I feel like all 3 directions require some implementation on Arrow itself (I did not find any approach simple enough that we could purely leverage the existing Pyarrow APIs to implement any of the directions). And I want to get opinions on whether pursuing arrow API level utilities smells good. Thank you! Specifically, for the third direction of bucketing and returning materialized tables/batches, since Arrow has dataset.write_dataset() which supports partition-respected writing, I did some reading to see how it partitions and whether we could leverage anything from it. https://github.com/apache/arrow/blob/main/cpp/src/arrow/dataset/partition.cc#L118 is where the partition happens. The partition algorithm is a full scan with bucket sort leveraging Grouper class utilities in arrow's compute component. Specifically: Other than being used in the dataset writing, Grouper from Arrow's compute component is used to support other exposed compute APIs such as aggregation functions. At the end of the day, what we want (in order to support Pyiceberg's partitioned write) is an API that returns record batches/tables based on an input table and an input partition scheme, so maybe we could expose such a new API under compute leveraging Grouper. |
@jqin61 just wondering if we can use this directly |
Thank you Ashish! I overlooked it, as you mention, we could just use write_dataset() with specified args of partitioning base_nametemplate to write out the partitioned datafiles as iceberg needs. |
Yes - I think we learned this from our earlier attempts: https://github.com/apache/iceberg-python/pull/41/files/1398a2fb01341087a1334482db84a193843a2362#r1427302782 As @jqin61 pointed out in a previous PR, adding these to the schema should output parquet files with the correct field_id. |
@Fokko @jqin61 from pyiceberg.io.pyarrow import schema_to_pyarrow
import pyarrow as pa
from pyarrow import parquet as pq
data = {'key': ['001', '001', '002', '002'],
'value_1': [10, 20, 100, 200],
'value_2': ['a', 'b', 'a', 'b']}
my_partitioning = pa.dataset.partitioning(pa.schema([pa.field("key", pa.string())]), flavor='hive')
TABLE_SCHEMA = Schema(
NestedField(field_id=1, name="key", field_type=StringType(), required=False),
NestedField(field_id=2, name="value_1", field_type=StringType(), required=False),
NestedField(field_id=3, name="value_2", field_type=StringType(), required=False),
)
schema = schema_to_pyarrow(TABLE_SCHEMA)
patbl = pa.Table.from_pydict(data)
pq.write_to_dataset(patbl,'partitioned_data',partitioning=my_partitioning,schema=schema) If I don't use schema in write it works fine. But if I pass the schema create
It fails with
I also tried the parquet write the way we are doing currently:
Do we do any other transformation for the schema before we write in current write support. |
Hey @jqin61 Thanks for the elaborate post, and sorry for my slow reply. I did want to take the time to write a good answer. Probably the following statement needs another map step: partitions: list[dict] = pyarrow.compute.unique(arrow_table) The above is true for an identity partition, but often we take truncate the month, day or hour from a field, and use that as a partition. Another example is the bucketing partition where we hash the field, and determine in which bucket it will fall. With regard of utilizing the Arrow primitives that are already there. I think that's a great idea, we just have to make sure that they are flexible enough for Iceberg. There are a couple of questions that pop into my mind:
@asheeshgarg Thanks for giving it a try. Looking at the schema, there is a discrapency. The test-data that you generate has |
@Fokko thanks for pointing out the mismatch it worked. After modifying the datatype it worked. |
@Fokko Thank you! These 2 points of supporting hidden partitioning and extracting metrics efficiently during writing are very insightful! For using pyarrow.dataset.write_dataset(), its behavior removes the partition columns in the written-out parquet files. I think this is the deal breaker for using write_dataset(). So either we extend pyarrow.dataset.write_dataset() or fall back to the arrow API direction. Some findings during chasing a solution of dataset.write_dataset():
|
Right, as @jqin61 mentioned, if we only had to support Transformed Partitions, we could have employed some hack to add partition column to the dataset, which gets consumed by write_dataset API when we pass the column in pyarrow.dataset.partitioning. But we can't apply the same hack with Identity Partitions, where the HIVE partition scheme on the file path shares the same name as the partition column that needs to be persisted into the data file. Arrow does not allow two columns to share the same name, and this hack will lead to an exception on write_dataset. So it sounds like we might be running out of options in using the existing APIs... If we are in agreement that we need a new PyArrow API to optimally bucket sort the partitions and produce partitioned pyarrow tables or record batches to pass into WriteTask, do we see any value in introducing a simpler PyIceberg feature in the interim, where write_file can support partitioned tables as long as the provided arrow_table only has a single partition of data? I think introducing this first would have two upsides:
|
Maybe another approach we could take if we want to use existing PyArrow functions is:
If there was an existing PyArrow API that gave us the outcome of (1) + (2) in one pass, it would have been the most optimal, but it seems like there isn't... so I think taking just one more pass to find the indices is maybe not the worst idea. We could also argue that (1) should be a requirement that we check on the provided PyArrow table, rather than running the sort within the PyIceberg API. Please let me know your thoughts! |
@jqin61 I have also seen this behavior pyarrow.dataset.write_dataset(), its behavior removes the partition columns in the written-out parquet files. It would have been ideal if the partition write we can be done directly using arrow dataset API and use meta data based hidden partitioning using Pyiceberg API. But we need to do good amount of lift in order to that. Haven't seen support for bucket partitioning. I think we can add write directly using the Pyarrow API as suggested above. |
@Fokko @syun64 @syun64 another option I can think is use polars to do it simple example below with hashing and partitioning sorting in partition. Where all the partition is handle by rust layer in Polars and we write parquet based on arrow table returned. import pyarrow as pa N = 2 for tbl in tables: |
@jqin61 and I discussed this offline, and just wanted to follow up on possible options for step (2). If we wanted to use existing PyArrow functions, I think we could use a 2 pass algorithm to figure out the row index of each permutation of partition groups on a partition-sorted pyarrow table:
Then, how do we handle transformed partitions? I think going back to the previous idea, we could create intermediate helper columns to generate the transformed partition values in order to use them for sorting. We can keep track of these columns and ensure that we drop the column after we use the above algorithm to split up the table by partition slices. Regardless of whether we choose to support sorting within the PyIceberg write API or have it as a requirement, maybe we can create a helper function that takes the PartitionSpec of the iceberg table and the pyarrow table and makes sure that the table is sortedByPartition by using the above method. |
The Design Document on data file writes that was discussed during the monthly sync. The document summarizes all of the approaches discussed above |
I have an incoming PR with working code samples that conform to the design above and cover identity transform + append as the first step of supporting partitioned write. During implementation, I find if the partition column has nulls the code will break. This issue is the same with the existing write where append() or overwrite() would break for any arrow table with a column consisting only of nulls. So I opened Issue#348 to track separately. |
Opened draft PR with working code samples (it supports partitioned append with identity transform for now): #353 |
Updates for monthly sync:
|
Idea from @Fokko - support day/month/year transforms first |
You can also try using the transforms that Daft has already implemented. Full list of transforms:
There were a lot of intricacies in these transforms that we had to make sure to get exactly right, so as to be compatible with the existing Java implementations. Especially wrt hashing. Should be zero-copy conversions between arrow and Daft as well (cheap!): import pyarrow as pa
from daft import Series
pyarrow_array = pa.array(list(range(10000)))
# Should be very cheap! Under the hood just uses the same arrow buffers
daft_series = Series.from_arrow(pyarrow_array)
print(daft_series)
╭──────────────╮
│ arrow_series │
│ --- │
│ Int64 │
╞══════════════╡
│ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ … │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9995 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9996 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9997 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9998 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 9999 │
╰──────────────╯
partitioned = daft_series.partitioning.iceberg_bucket(32)
print(partitioned)
╭─────────────────────╮
│ arrow_series_bucket │
│ --- │
│ Int32 │
╞═════════════════════╡
│ 28 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 20 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 19 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ … │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 10 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 28 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 13 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 28 │
╰─────────────────────╯
# Convert back to arrow
partitioned_arrow_arr = partitioned.to_arrow() |
The PR says "Support partitioned writes", but on version 0.6.1 it still fails for me with
|
@Pilipets that's correct. The partitioned writes are not yet released. You can use it by installing directly from Github as @mike-luabase suggested above, or wait for 0.7.0 to be released, which will happen soon. |
@Fokko , also note that pyiceberg doesn't write partition specification in the manifest files as spark-sql does, which results in parsing errors on the attempt to read Manifest files with Java SDK. I don't know if it's bug or not but shared with you as a part of the iceberg community.
In particular, pyiceberg doesn't write "partition-spec". There is a workaround for this - request table metadata and use table specs, though. Error above is for non-partitioned table created with pyiceberg. |
@Pilipets That looks like a bug, let me check if I can reproduce this. Thanks for sharing! |
@Fokko , spark-sql can still read it, so maybe it's expected - i don't know ( In order to read it with Java SDK, i need to provide
What happens in Java SDK
For the repro, create table with no partitioning as below.
|
Thanks again for the code! The json that's being generated through the SQLCatalog/SQLite is: {
"location": "file:/tmp/whatever.db/my_table",
"table-uuid": "d394d593-ed82-4c3b-8c58-ad3405965407",
"last-updated-ms": 1718918248213,
"last-column-id": 3,
"schemas": [
{
"type": "struct",
"fields": [
{
"id": 1,
"name": "city",
"type": "string",
"required": false
},
{
"id": 2,
"name": "lat",
"type": "double",
"required": false
},
{
"id": 3,
"name": "long",
"type": "double",
"required": false
}
],
"schema-id": 0,
"identifier-field-ids": []
}
],
"current-schema-id": 0,
"partition-specs": [
{
"spec-id": 0,
"fields": []
}
],
"default-spec-id": 0,
"last-partition-id": 1000,
"properties": {},
"current-snapshot-id": 2465573600625458700,
"snapshots": [
{
"snapshot-id": 2465573600625458700,
"sequence-number": 1,
"timestamp-ms": 1718918248213,
"manifest-list": "file:/tmp/whatever.db/my_table/metadata/snap-2465573600625458552-0-c5284c9e-77bc-4a94-b412-1a3a9db4100d.avro",
"summary": {
"operation": "append",
"added-files-size": "1656",
"added-data-files": "1",
"added-records": "4",
"total-data-files": "1",
"total-delete-files": "0",
"total-records": "4",
"total-files-size": "1656",
"total-position-deletes": "0",
"total-equality-deletes": "0"
},
"schema-id": 0
}
],
"snapshot-log": [
{
"snapshot-id": 2465573600625458700,
"timestamp-ms": 1718918248213
}
],
"metadata-log": [],
"sort-orders": [
{
"order-id": 0,
"fields": []
}
],
"default-sort-order-id": 0,
"refs": {
"main": {
"snapshot-id": 2465573600625458700,
"type": "branch"
}
},
"format-version": 2,
"last-sequence-number": 1
} With SQLCatalog: {
"format-version" : 2,
"table-uuid" : "70ed598e-01e2-48d2-98c4-33396b3b4bc4",
"location" : "s3://warehouse/nyc/taxis",
"last-sequence-number" : 0,
"last-updated-ms" : 1718918405513,
"last-column-id" : 19,
"current-schema-id" : 0,
"schemas" : [ {
"type" : "struct",
"schema-id" : 0,
"fields" : [ {
"id" : 1,
"name" : "VendorID",
"required" : false,
"type" : "long"
}, {
"id" : 2,
"name" : "tpep_pickup_datetime",
"required" : false,
"type" : "timestamptz"
}, {
"id" : 3,
"name" : "tpep_dropoff_datetime",
"required" : false,
"type" : "timestamptz"
}, {
"id" : 4,
"name" : "passenger_count",
"required" : false,
"type" : "double"
}, {
"id" : 5,
"name" : "trip_distance",
"required" : false,
"type" : "double"
}, {
"id" : 6,
"name" : "RatecodeID",
"required" : false,
"type" : "double"
}, {
"id" : 7,
"name" : "store_and_fwd_flag",
"required" : false,
"type" : "string"
}, {
"id" : 8,
"name" : "PULocationID",
"required" : false,
"type" : "long"
}, {
"id" : 9,
"name" : "DOLocationID",
"required" : false,
"type" : "long"
}, {
"id" : 10,
"name" : "payment_type",
"required" : false,
"type" : "long"
}, {
"id" : 11,
"name" : "fare_amount",
"required" : false,
"type" : "double"
}, {
"id" : 12,
"name" : "extra",
"required" : false,
"type" : "double"
}, {
"id" : 13,
"name" : "mta_tax",
"required" : false,
"type" : "double"
}, {
"id" : 14,
"name" : "tip_amount",
"required" : false,
"type" : "double"
}, {
"id" : 15,
"name" : "tolls_amount",
"required" : false,
"type" : "double"
}, {
"id" : 16,
"name" : "improvement_surcharge",
"required" : false,
"type" : "double"
}, {
"id" : 17,
"name" : "total_amount",
"required" : false,
"type" : "double"
}, {
"id" : 18,
"name" : "congestion_surcharge",
"required" : false,
"type" : "double"
}, {
"id" : 19,
"name" : "airport_fee",
"required" : false,
"type" : "double"
} ]
} ],
"default-spec-id" : 0,
"partition-specs" : [ {
"spec-id" : 0,
"fields" : [ ]
} ],
"last-partition-id" : 999,
"default-sort-order-id" : 0,
"sort-orders" : [ {
"order-id" : 0,
"fields" : [ ]
} ],
"properties" : {
"owner" : "root",
"write.parquet.compression-codec" : "zstd"
},
"current-snapshot-id" : -1,
"refs" : { },
"snapshots" : [ ],
"statistics" : [ ],
"partition-statistics" : [ ],
"snapshot-log" : [ ],
"metadata-log" : [ ]
} I don't see any discrepancies with the sort-orders, but I can continue tomorrow. @adrianqin feel free to jump in here :) |
Try generating table with spark-sql to compare and it will be there. Just use any catalog...
|
Thanks @Pilipets, again, really appreciate the examples. So the difference is with the versions. I was generating V2 tables, and you have a V1 table (Java 1.3.0 has the default on V1, which should be changed later on.). ![]() I think we need to fix this on the Java side |
Yeah, I actually don't know if partition-spec is the culprit, but that's the difference that I spotted. Lake formation, spark-sql, snowflake and other writers probably add it, so read suceeds.
Below is full error stack.
|
I should have gone deeper into the stacktrace, it looks like the Python:
Java:
Thanks again @Pilipets for raising this, I'll come up with a fix shortly. |
@Fokko any expected timeline you can share on support for bucket transform? Is there a separate issue I can follow for that? Thanks for all the hard work so far!! |
@RLashofRegas Sorry for the long wait, @sungwy has been working on adding a rust extension to efficiently run the bucketing transform 🥳 We're blocked on a release on the rust side of things, but we're working on a release there as well 👍 |
#1345 has been merged, closing this one :) |
This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the the approach where we determine the unique partition tuples then filter on them one by one. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :)
This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the the approach where we determine the unique partition tuples then filter on them one by one. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :)
This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the the approach where we determine the unique partition tuples then filter on them one by one. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :)
This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the the approach where we determine the unique partition tuples then filter on them one by one. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :)
This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the the approach where we determine the unique partition tuples then filter on them one by one. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :)
Second attempt of #1539 This was already being discussed back here: #208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes #1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <[email protected]>
Feature Request / Improvement
Support partitioned writes
So I think we want to tackle the static overwrite first, and then we can compute the predicate for the dynamic overwrite to support that. We can come up with a separate API. I haven't really thought this trough, and we can still change this.
I think the most important steps are the breakdown of the work. There is a lot involved, but luckily we already get the test suite from the full overwrite.
Steps I can see:
Other things on my mind:
The good part:
The text was updated successfully, but these errors were encountered: