Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support nested namespaces for Iceberg REST catalog #22916

Closed
TomerHeber opened this issue Aug 2, 2024 · 4 comments · Fixed by #23453
Closed

Support nested namespaces for Iceberg REST catalog #22916

TomerHeber opened this issue Aug 2, 2024 · 4 comments · Fixed by #23453
Assignees
Labels
enhancement New feature or request iceberg Iceberg connector

Comments

@TomerHeber
Copy link

Tested on Trino 450.
Iceberg Rest API is used.

In spark I created the namespace:
pyspark_sample

And a child namespace:
cool

full path would be iceberg.pyspark_sample.cool

In Trino:
If I call show schemas from iceberg; it will return only the high level namespaces.
Therefore, any child namespaces will not be returned.

Nice to have:
Support show scehmas from iceberg.pyspark_sample - this will return cool.

Trying to switch to a child schema in Trino isn't supported:

trino> use iceberg.pyspark_sample.cool;
Query 20240802_013127_00008_4x4p5 failed: line 1:27: mismatched input '.'. Expecting: <EOF>
use iceberg.pyspark_sample.cool

Nice to have:
Add support to it.

With that said, I had expected the following to work:

trino> use iceberg."pyspark_sample.cool";
Query 20240802_013207_00009_4x4p5 failed: Schema does not exist: iceberg.pyspark_sample.cool

Root cause:

Trino is sending the following API request -

GET /iceberg/v1/.../namespaces/pyspark_sample.cool

This in incorrect and does not follow the Iceberg rest API spec: https://github.com/apache/iceberg/blob/main/open-api/rest-catalog-open-api.yaml

The correct API request should be -

GET /iceberg/v1/.../namespaces/pyspark_sample%1Fcool

In other words, if a namespace value has a "." it should be sent as %1F.
I haven't reviewed the code, but this probably a simple fix.

@ebyhr
Copy link
Member

ebyhr commented Aug 2, 2024

This is exactly a bug. For instance, creating a new schema with a dot result in unexpected results.

trino:tpch> CREATE SCHEMA "a.test";
CREATE SCHEMA
trino:tpch> SHOW SCHEMAS;
       Schema
--------------------
 a
 information_schema
 tpch
(3 rows)

@ebyhr ebyhr added the iceberg Iceberg connector label Aug 5, 2024
@sftwrdvlpr
Copy link
Contributor

sftwrdvlpr commented Aug 23, 2024

Hi @ebyhr,

I researched this issue and discovered that it is not the only problem. I've noticed that the string attribute representing the schema, which is passed to Iceberg catalog methods, is transformed into a Namespace object using the Namespace.of(schema) method. If the schema represents a hierarchy, it will not be processed correctly.

To address this issue, I propose adding a configurable delimiter that will be applied to each schema string attribute, like this: Namespace.of(Splitter.on(DELIMITER).omitEmptyStrings().trimResults().splitToList(schema)). This will resolve all API calls, such as creating tables or views. However, schema creation will be a bit more complex, as it will require checking each level for existence.

The second part of the issue relates to changes needed for the listNamespaces methods. In TrinoNessieCatalog, making a call to nessieCatalog.getApi().getMultipleNamespaces() is sufficient to retrieve all namespaces. However, TrinoRestCatalog and the Iceberg REST API do not have an endpoint to do the same, so the method will need to make recursive calls to collect all levels of namespaces.

During my research, I only checked TrinoRestCatalog and TrinoNessieCatalog. If you're okay with this approach, I can prepare a PR with the changes I've described.

If you have any questions or concerns, I'd be happy to discuss them.

@mayankvadariya mayankvadariya self-assigned this Sep 13, 2024
@mayankvadariya
Copy link
Contributor

@sftwrdvlpr please feel free to create a PR at the earliest and let me know if there is anything I can help you with.

@findinpath
Copy link
Contributor

findinpath commented Sep 18, 2024

Adding for context a way to test first hand the functionality of nested namespaces in Spark

testing/bin/ptl env up --environment singlenode-spark-iceberg-rest
docker container exec -ti ptl-spark /bin/bash
spark-sql ()> create schema grandfather.father.child;
Time taken: 1.176 seconds
spark-sql ()> create table grandfather.t1 (c integer);
Time taken: 0.514 seconds
spark-sql ()> create table grandfather.father.t2 (c integer);
Time taken: 0.546 seconds
spark-sql ()> create table grandfather.father.child.t3 (c integer);
Time taken: 0.564 seconds
spark-sql ()> show tables in grandfather;
t1
Time taken: 0.257 seconds, Fetched 1 row(s)
spark-sql ()> show tables in grandfather.father;
t2
Time taken: 0.041 seconds, Fetched 1 row(s)
spark-sql ()> show tables in grandfather.father.child;
t3
Time taken: 0.056 seconds, Fetched 1 row(s)

Related issue #1135

@findepi findepi changed the title [BUG] Iceberg Rest API - multipart namespaces aren't working. Support nested namespaces for Iceberg REST catalog Sep 30, 2024
@findepi findepi added the enhancement New feature or request label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request iceberg Iceberg connector
Development

Successfully merging a pull request may close this issue.

6 participants