Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Iceberg table support #7712

Open
1 task done
lostmygithubaccount opened this issue Dec 11, 2023 · 6 comments
Open
1 task done

feat: Iceberg table support #7712

lostmygithubaccount opened this issue Dec 11, 2023 · 6 comments
Labels
feature Features or general enhancements

Comments

@lostmygithubaccount
Copy link
Member

Is your feature request related to a problem?

Support Iceberg tables in Ibis

Using: https://github.com/apache/iceberg-python

Main blocker is write support, tracked here: apache/iceberg-python#23

Describe the solution you'd like

ibis.read_iceberg

table.to_iceberg

What version of ibis are you running?

n/a

What backend(s) are you using, if any?

local backends that would support this

Code of Conduct

  • I agree to follow this project's Code of Conduct
@cpcloud
Copy link
Member

cpcloud commented Dec 11, 2023

I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:

  1. It appears to only support in-memory results. This seems like it defeats the purpose of using iceberg in python unless you can guarantee your projections and filters are selective enough that they allow results to fit in memory.
  2. The python client seems to want to own any compute related to projections and filters, which again seems to defeat the purpose of decoupling storage and compute.

At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.

@cpcloud
Copy link
Member

cpcloud commented Dec 11, 2023

I think a better option might be https://duckdb.org/docs/extensions/iceberg.html at least for DuckDB.

@deepyaman
Copy link
Contributor

Main blocker is write support, tracked here: apache/iceberg-python#23

This, at least, is resolved. :)

@lostmygithubaccount
Copy link
Member Author

@deepyaman any interest in taking a stab at this?

@deepyaman
Copy link
Contributor

@deepyaman any interest in taking a stab at this?

Is this ready for implementation? It seems @cpcloud's first concern is resolved, but is the second one? We could get a pa.Table from DataScan.to_pyarrow, but that would mean not pushing down the projection and/or leaving the execution up to pyiceberg.

@mfatihaktas
Copy link
Contributor

I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:

  1. It appears to only support in-memory results. This seems like it defeats the purpose of using iceberg in python unless you can guarantee your projections and filters are selective enough that they allow results to fit in memory.
  2. The python client seems to want to own any compute related to projections and filters, which again seems to defeat the purpose of decoupling storage and compute.

As far as I know, Iceberg is a table format for compute engines (e.g., Spark) to work with. Along that line, I think it is expected for pyiceberg to execute projections and filters in memory in the absence of an intermediate compute engine. Iceberg maintains a rich set of meta-data for the tables, which enables scanning the meta-data to (significantly) reduce the number of (partition) files pulled. However, yes, it is on the user to make sure the results fit in memory.

As raised in 2. above, pyiceberg.to_arrow() first calls plan_files() to get a list of relevant files, then calls project_table() to run the projections and filters in-memory and returns the data in a pyarrow table.

At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.

Looking at the implementation of pyiceberg.to_arrow(), my initial impression is that it should be straightforward to (1) scan the Iceberg table and pull only the relevant files, (2) put the files in a pyarrow dataset.

@cpcloud Do these points make sense to you? If they do, I can take a stab at this issue.

Disclaimer: My understanding of Iceberg might not be fully correct as my knowledge of Iceberg is limited :)

Refs:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements
Projects
Status: backlog
Development

No branches or pull requests

4 participants