-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Snapshot validation fails when running table compaction #9969
Comments
I tried to reproduce this, also by adding delete files into the mix into our Spark integration tests, which are executed against Nessie (with and without Iceberg REST), but no luck so far. See #9974 Can you provide a full reproducer - like all the statements that are necessary, because |
@snazy as far as I understand (based on Zulip discussion) the key point is inserting via the Iceberg REST API, but compacting via Nessie API. |
the full spark action command is: sparkActions
.rewriteDataFiles(table)
.option("partial-progress.enabled", "true")
.option("target-file-size-bytes", (1024 * 1024 * 1024).toString)
.filter(Expressions.greaterThanOrEqual("created_at", startMillis))
.filter(Expressions.lessThanOrEqual("created_at", endMillis))
.zOrder("watermark", "user_id")
.execute() |
the spark settings are as follows:
|
@gpathak128 we need the whole scenario - from |
The table was created via trino (setup using the rest catalog settings): CREATE TABLE corp.user_data (
contains_pii boolean,
watermark timestamp(3) WITH TIME ZONE,
created_at bigint,
<snipped other cols>
)
WITH (
partitioning = ARRAY['hour(watermark)']
) Partitioning was changed a few times to add/remove user_id buckets. there is a spark streaming job that is writing to the table continuously. df
.writeStream
.format("iceberg")
.outputMode("append")
.option("fanout-enabled", "true")
.trigger(Trigger.ProcessingTime(triggerIntervalValue))
.option("checkpointLocation", outputCheckpointPath)
.options(optionsValue)
.toTable(s"${catalogName.get}.$schemaNameValue.$tableNameValue") I am running a "rewriteDataFiles" compaction job based on the spark actions as described above. |
I was able to run compaction after stopping the streaming spark job that was writing to the same table. Looks like this is a multiple writer issue. |
The current behavior of Nessie's Iceberg REST is to return only the most recent Iceberg snapshot. However, this seems to conflict with some Iceberg operations, which are not only maintenance operations, but rather related to "merge on read" / (equality) deletes. This change changes Nessie's behavior by returning older snapshots from a load-table and update-table operations. Register-table operation however do not change, because only the latest snapshot is actually imported. The behavior does change by returning an error if the table to be registered has more than 1 snapshots. Fixes projectnessie#10013 Fixes projectnessie#9969
The current behavior of Nessie's Iceberg REST is to return only the most recent Iceberg snapshot. However, this seems to conflict with some Iceberg operations, which are not only maintenance operations, but rather related to "merge on read" / (equality) deletes. This change changes Nessie's behavior by returning older snapshots from a load-table and update-table operations. Register-table operation however do not change, because only the latest snapshot is actually imported. The behavior does change by returning an error if the table to be registered has more than 1 snapshots. Fixes projectnessie#10013 Fixes projectnessie#9969
The current behavior of Nessie's Iceberg REST is to return only the most recent Iceberg snapshot. However, this seems to conflict with some Iceberg operations, which are not only maintenance operations, but rather related to "merge on read" / (equality) deletes. This change changes Nessie's behavior by returning older snapshots from a load-table and update-table operations. Register-table operation however do not change, because only the latest snapshot is actually imported. The behavior does change by returning an error if the table to be registered has more than 1 snapshots. Fixes projectnessie#10013 Fixes projectnessie#9969
The current behavior of Nessie's Iceberg REST is to return only the most recent Iceberg snapshot. However, this seems to conflict with some Iceberg operations, which are not only maintenance operations, but rather related to "merge on read" / (equality) deletes. This change changes Nessie's behavior by returning older snapshots from a load-table and update-table operations. Register-table operation however do not change, because only the latest snapshot is actually imported. The behavior does change by returning an error if the table to be registered has more than 1 snapshots. Fixes projectnessie#10013 Fixes projectnessie#9969
The current behavior of Nessie's Iceberg REST is to return only the most recent Iceberg snapshot. However, this seems to conflict with some Iceberg operations, which are not only maintenance operations, but rather related to "merge on read" / (equality) deletes. This change changes Nessie's behavior by returning older snapshots from a load-table and update-table operations. Register-table operation however do not change, because only the latest snapshot is actually imported. The behavior does change by returning an error if the table to be registered has more than 1 snapshots. Fixes projectnessie#10013 Fixes projectnessie#9969
The current behavior of Nessie's Iceberg REST is to return only the most recent Iceberg snapshot. However, this seems to conflict with some Iceberg operations, which are not only maintenance operations, but rather related to "merge on read" / (equality) deletes. This change changes Nessie's behavior by returning older snapshots from a load-table and update-table operations. Register-table operation however do not change, because only the latest snapshot is actually imported. The behavior does change by returning an error if the table to be registered has more than 1 snapshots. Fixes projectnessie#10013 Fixes projectnessie#9969
The current behavior of Nessie's Iceberg REST is to return only the most recent Iceberg snapshot. However, this seems to conflict with some Iceberg operations, which are not only maintenance operations, but rather related to "merge on read" / (equality) deletes. This change changes Nessie's behavior by returning older snapshots from a load-table and update-table operations. Register-table operation however do not change, because only the latest snapshot is actually imported. The behavior does change by returning an error if the table to be registered has more than 1 snapshots. Fixes projectnessie#10013 Fixes projectnessie#9969
What happened
I am running table compaction using Spark Actions. my spark action code is:
The job runs fine, however none of the file groups are able to commit. the commits fail with the following error:
The text was updated successfully, but these errors were encountered: