feat: Iceberg table support #7712

lostmygithubaccount · 2023-12-11T15:24:20Z

Is your feature request related to a problem?

Support Iceberg tables in Ibis

Using: https://github.com/apache/iceberg-python

Main blocker is write support, tracked here: apache/iceberg-python#23

Describe the solution you'd like

ibis.read_iceberg

table.to_iceberg

What version of ibis are you running?

n/a

What backend(s) are you using, if any?

local backends that would support this

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

cpcloud · 2023-12-11T15:40:16Z

I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:

It appears to only support in-memory results. This seems like it defeats the purpose of using iceberg in python unless you can guarantee your projections and filters are selective enough that they allow results to fit in memory.
The python client seems to want to own any compute related to projections and filters, which again seems to defeat the purpose of decoupling storage and compute.

At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.

cpcloud · 2023-12-11T15:42:06Z

I think a better option might be https://duckdb.org/docs/extensions/iceberg.html at least for DuckDB.

deepyaman · 2024-01-18T18:21:35Z

Main blocker is write support, tracked here: apache/iceberg-python#23

This, at least, is resolved. :)

lostmygithubaccount · 2024-01-18T22:33:40Z

@deepyaman any interest in taking a stab at this?

deepyaman · 2024-01-18T23:15:13Z

@deepyaman any interest in taking a stab at this?

Is this ready for implementation? It seems @cpcloud's first concern is resolved, but is the second one? We could get a pa.Table from DataScan.to_pyarrow, but that would mean not pushing down the projection and/or leaving the execution up to pyiceberg.

mfatihaktas · 2024-02-08T20:29:15Z

I see a number of technical issues with the iceberg python client that I think are blockers for using it as the basis for iceberg support in Ibis:

It appears to only support in-memory results. This seems like it defeats the purpose of using iceberg in python unless you can guarantee your projections and filters are selective enough that they allow results to fit in memory.

The python client seems to want to own any compute related to projections and filters, which again seems to defeat the purpose of decoupling storage and compute.

As far as I know, Iceberg is a table format for compute engines (e.g., Spark) to work with. Along that line, I think it is expected for pyiceberg to execute projections and filters in memory in the absence of an intermediate compute engine. Iceberg maintains a rich set of meta-data for the tables, which enables scanning the meta-data to (significantly) reduce the number of (partition) files pulled. However, yes, it is on the user to make sure the results fit in memory.

As raised in 2. above, pyiceberg.to_arrow() first calls plan_files() to get a list of relevant files, then calls project_table() to run the projections and filters in-memory and returns the data in a pyarrow table.

At the very least, we'd need to be able to get back a PyArrow Dataset that can be streamed into a query engine like DuckDB before we can consider using the iceberg python client.

Looking at the implementation of pyiceberg.to_arrow(), my initial impression is that it should be straightforward to (1) scan the Iceberg table and pull only the relevant files, (2) put the files in a pyarrow dataset.

@cpcloud Do these points make sense to you? If they do, I can take a stab at this issue.

Disclaimer: My understanding of Iceberg might not be fully correct as my knowledge of Iceberg is limited :)

Refs:

lostmygithubaccount added the feature Features or general enhancements label Dec 11, 2023

lostmygithubaccount added this to Ibis planning and roadmap Dec 11, 2023

lostmygithubaccount moved this to backlog in Ibis planning and roadmap Dec 11, 2023

lostmygithubaccount self-assigned this Dec 11, 2023

lostmygithubaccount removed their assignment Jan 25, 2024

mfatihaktas mentioned this issue Feb 6, 2024

feat(flink): Support temporal join on Iceberg table #8254

Open

1 task

mfatihaktas mentioned this issue Feb 13, 2024

feat: support iceberg read/write #8343

Closed

chloeh13q mentioned this issue Feb 29, 2024

meta: increase Flink's streaming backend coverage #8250

Closed

1 task

noklam mentioned this issue Sep 25, 2024

Expose PyIceberg table as PyArrow Dataset apache/iceberg-python#30

Open

noklam mentioned this issue Nov 26, 2024

[Versioning]: Explore Kedro + Iceberg for versioning kedro-org/kedro#4241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Iceberg table support #7712

feat: Iceberg table support #7712

lostmygithubaccount commented Dec 11, 2023

cpcloud commented Dec 11, 2023 •

edited

Loading

cpcloud commented Dec 11, 2023

deepyaman commented Jan 18, 2024

lostmygithubaccount commented Jan 18, 2024

deepyaman commented Jan 18, 2024

mfatihaktas commented Feb 8, 2024

feat: Iceberg table support #7712

feat: Iceberg table support #7712

Comments

lostmygithubaccount commented Dec 11, 2023

Is your feature request related to a problem?

Describe the solution you'd like

What version of ibis are you running?

What backend(s) are you using, if any?

Code of Conduct

cpcloud commented Dec 11, 2023 • edited Loading

cpcloud commented Dec 11, 2023

deepyaman commented Jan 18, 2024

lostmygithubaccount commented Jan 18, 2024

deepyaman commented Jan 18, 2024

mfatihaktas commented Feb 8, 2024

cpcloud commented Dec 11, 2023 •

edited

Loading