Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iceberg connector cannot perform write operations when use Nessie catalog #17813

Closed
junshiwang opened this issue Jun 9, 2023 · 11 comments · Fixed by #19524
Closed

iceberg connector cannot perform write operations when use Nessie catalog #17813

junshiwang opened this issue Jun 9, 2023 · 11 comments · Fixed by #19524
Labels
bug Something isn't working iceberg Iceberg connector

Comments

@junshiwang
Copy link

junshiwang commented Jun 9, 2023

if table propeties has nessie.commit.id, cannot perform any write operations

image

Tasks

Preview Give feedback
No tasks being tracked yet.
@junshiwang
Copy link
Author

trino log
image

@ebyhr ebyhr added bug Something isn't working iceberg Iceberg connector labels Jun 9, 2023
@ajantha-bhat
Copy link
Member

ajantha-bhat commented Jun 22, 2023

Hi, I saw this ticket today.

@junshiwang: can you please provide a step how the table was created? was it from spark?

Because I tried creating the table locally in Trino with Nessie, It can insert the data

trino:db1> CREATE TABLE yearly_clicks (
        ->     year,
        ->     clicks
        -> )
        -> WITH (
        ->     partitioning = ARRAY['year']
        -> )
        -> AS VALUES
        ->     (2021, 10000),
        ->     (2022, 20000);
CREATE TABLE: 2 rows

Query 20230622_110108_00007_5c9cy, FINISHED, 1 node
Splits: 50 total, 50 done (100.00%)
2.73 [0 rows, 0B] [0 rows/s, 0B/s]

trino:db1> select * from "yearly_clicks$properties";
         key          | value 
----------------------+-------
 write.format.default | ORC   
(1 row)

Query 20230622_110127_00008_5c9cy, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.28 [1 rows, 33B] [3 rows/s, 119B/s]

trino:db1> insert into yearly_clicks select 2021, 22;
INSERT: 1 row

Query 20230622_110403_00009_5c9cy, FINISHED, 1 node
Splits: 50 total, 50 done (100.00%)
1.44 [0 rows, 0B] [0 rows/s, 0B/s]

trino:db1> insert into yearly_clicks select 2021, 22;
INSERT: 1 row

Query 20230622_110405_00010_5c9cy, FINISHED, 1 node
Splits: 50 total, 50 done (100.00%)
0.91 [0 rows, 0B] [0 rows/s, 0B/s]

trino:db1> select * from yearly_clicks;
 year | clicks 
------+--------
 2022 |  20000 
 2021 |     22 
 2021 |  10000 
 2021 |     22 
(4 rows)

Query 20230622_110420_00011_5c9cy, FINISHED, 1 node
Splits: 4 total, 4 done (100.00%)
0.48 [4 rows, 1.62KB] [8 rows/s, 3.37KB/s]

trino:db1> select * from "yearly_clicks$properties";
         key          | value 
----------------------+-------
 write.format.default | ORC   
(1 row)

Query 20230622_110423_00012_5c9cy, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.23 [1 rows, 33B] [4 rows/s, 143B/s]

@ajantha-bhat
Copy link
Member

I was able to reproduce locally this only when I Trino tries to write to a Spark created Iceberg table using Nessie catalog.
Analysis in progress.

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Jun 23, 2023

Problem:
Cannot insert the data on Spark written Iceberg tables with Nessie catalog from Trino.

Analysis:
Spark sets the table property NESSIE_COMMIT_ID_PROPERTY in NessieTableOperations#loadTableMetadata.
Then NessieIcebergClient.commitTable uses this property.

In Trino, this property is never set but used in NessieIcebergClient.commitTable as it is a common code. Hence, the commit id is old and doesn't allow new commits.

Solution:
Move NessieTableOperations#loadTableMetadata as a static method to NessieIcebergClient and use it in both Spark and Trino integration.
So, this fix needs a Iceberg side Nessie change and we need to wait for the Iceberg release :(

Note: alternative solution is to duplicate the NessieTableOperations#loadTableMetadata in Trino's IcebergNessieTableOperations but we cannot get the reference as UpdateableReference is not a public class and it still needs changes at Iceberg side. Compared to this, I prefer the above static method to NessieIcebergClient

@ajantha-bhat
Copy link
Member

As a quick fix,
I can probably remove NESSIE_COMMIT_ID_PROPERTY if exists from the Table property during committing from Trino (so it uses the latest reference state from the client)

But the other functionalities (like disabling GC) from NessieTableOperations#loadTableMetadata is still needed for Trino in the long run and can be updated with the next Iceberg version bump.

@nastra
Copy link
Contributor

nastra commented Jun 23, 2023

I added a fix for this a while ago in nastra@34455ed and forgot to follow-up after Nessie support was merged to Trino. The changes are obviously outdated, but it should give enough info on how to fix this @ajantha-bhat

@ajantha-bhat
Copy link
Member

ajantha-bhat commented Jun 23, 2023

I added a fix for this a while ago in
nastra@34455ed and forgot to follow-up after Nessie support was merged to Trino. The changes are obviously outdated, but it should give enough info on how to fix this @ajantha-bhat

This approach seems to be what I mentioned as an alternate solution above in the note.
This still needs an Iceberg side change of changing UpdateableReference to Public. (That time it NessieClient code was living at Trino side. But now it lives at Iceberg repo)

But instead of changing UpdateableReference to Public, I prefer moving the NessieTableOperations#loadTableMetadata to the client which also avoids replicating same at the Trino side.

@ajantha-bhat
Copy link
Member

Iceberg side fix apache/iceberg#7893 is merged. With a new Iceberg release, we can work on a PR at Trino side to fix this issue.

@marvin-roesch
Copy link

@ajantha-bhat What is the current state of this? We have mixed write access to some of our tables from Trino and Spark and this is currently blocking us from migrating to Iceberg. Iceberg has had a release since your PR was merged (1.3.1), but it doesn't appear to include the change. Do you know a rough estimate for when these changes might land in an Iceberg release?

@ajantha-bhat
Copy link
Member

@marvin-roesch: Yeah, we need Iceberg 1.4.0 release for this to work. 1.3.1 didn't include this change.
Iceberg usually takes 2-3 months for a new version release.

As a temporary workaround, maybe you can try removing this table property(nessie.commit.id) from spark. So that it can be queried from Trino.

@ajantha-bhat
Copy link
Member

@junshiwang: #19524 fixes this bug.
If it is useful for you, please comment on the PR that it is needed. So, it can be taken as priority for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working iceberg Iceberg connector
Development

Successfully merging a pull request may close this issue.

5 participants