For now, you either need to clone the repo, or install directly from github
pip install git+
This software is not released and very poorly tested. With luck, you might find it useful.
Let's say you have access to an Iceberg dataset at a known path. The "root" path of an Iceberg dataset is the one above the "metadata/" and "data/" directories. In this quickstart, I will demo with the test data included in this repo, so assume you are in the root directory of the repo.
In [14]: ORIG_DIR = "/Users/mdurant/temp/warehouse/db/my_table"
In [15]: ice ="./test-data/my_table/", ORIG_DIR)
In [16]: ice.version # latest version file found
Out[16]: 5
In [17]: ice.schema
[{'id': 1, 'name': 'name', 'required': False, 'type': 'string'},
{'id': 2, 'name': 'age', 'required': False, 'type': 'int'},
{'id': 3, 'name': 'email', 'required': False, 'type': 'string'}]
In [18]: len(ice.snapshots)
Out[18]: 3
In [19]:
Dask DataFrame Structure:
name age email
object Int32 object
... ... ...
... ... ... ...
... ... ...
... ... ...
Dask Name: read-parquet, 1 graph layer
In [20]:
name age email
0 Bob 20 None
0 John 56
0 Fiona 25 None
0 Roger 25 None
0 Alex 36 None
In [21]: ice.open_snapshot(-1)
In [22]:
name age
0 Bob 20
0 Fiona 25
0 Roger 25
0 Alex 36
Some notes:
- the data were created in a different location to where they are now found. Iceberg doesn't normally allow you to do this, but we can correct for it with ORIGIN_DIR.
- We can introspect the schema and any partitioning without touching any data files
- we create dask dataframes by default, and you can use these on a distributed clister if all the workers can access the data files
- You can move to different snapshots. Here we went one step back in time. See how the schema changed.
(the data were created with pyspark SQL and following a Dremio community tutorial)
- most data types
- filtering (meaning you don't load data files or even manifest files)
- derived partitions in filters
- some basic operations with the REST iceberg service, particularly to find the current metadata file's location for some table
- Reading from any storage backend supported by
Testing was mostly done with fastparquet, which newly supports schema evolution.
- any writing at all
- we do not make use of much of the available metadata, as dask's API was not built thinking you might already have such information.
- only handles parquet
- the REST client does no auth (!) and most routes are not implemented.