Support Iceberg in Pandas #9

ehariri · 2022-06-30T06:22:17Z

Apache Iceberg is an open source table format designed for efficiently reading from large datasets. By leveraging detailed statistics information and hidden partitioning, reading becomes efficient for extremely large datasets, even up to pedabytes according to the author.

As a result, there is a lot of interest in Iceberg inside of the Data Engineering community and many people seem willing/eager to try running their production workloads using Iceberg. To help meet this growing demand, we believe that Pandas should be able to read DataFrames from datasets using Iceberg format.

Bodo can be used underneath Pandas to read Iceberg, probably through pd.read_sql/pd.read_sql_table.

The text was updated successfully, but these errors were encountered:

datapythonista · 2022-07-14T16:05:18Z

Personally, I think it's limiting and not very scalable if pandas has to support every format. I think pandas should provide a standard way to load I/O plugins, and this (and many other readers/writers) should be implemented as third-party projects. Not only maintaining them will be easier and more efficient, but having competing options for certain formats would be beneficial. Like a csv format that makes loads of assumptions and making things easier for users, and another that requires more programming but is faster and safer.

mroeschke · 2022-07-14T16:11:24Z

If there exists a python wrapper to read iceberg via read_sql, I think it would be wise to develop an interface in pandas to for users to plug-and-play any "SQL" engine.
pandas-dev/pandas#41728
pandas-dev/pandas#36893

ehsantn · 2022-08-03T16:26:38Z

Yes, this could be a plugin. From the user perspective, something like pd.read_sql_table("table1", "iceberg+thrift://...", "my_schema") should just work.

https://docs.bodo.ai/2022.7/file_io/#iceberg-section

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Iceberg in Pandas #9

Support Iceberg in Pandas #9

ehariri commented Jun 30, 2022 •

edited by ehsantn

Loading

datapythonista commented Jul 14, 2022

mroeschke commented Jul 14, 2022

ehsantn commented Aug 3, 2022

Support Iceberg in Pandas #9

Support Iceberg in Pandas #9

Comments

ehariri commented Jun 30, 2022 • edited by ehsantn Loading

datapythonista commented Jul 14, 2022

mroeschke commented Jul 14, 2022

ehsantn commented Aug 3, 2022

ehariri commented Jun 30, 2022 •

edited by ehsantn

Loading