-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate pyiceberg with Dask #5800
Comments
Here's a rough version of a from pyiceberg.catalog import load_catalog
from pyiceberg.expressions import GreaterThanOrEqual
from pyiceberg.io.pyarrow import _file_to_table
from pyiceberg.io.pyarrow import (
PyArrowFileIO, bind, extract_field_ids, schema_to_pyarrow, MapType, ListType,
)
import pyarrow as pa
import dask
import dask.dataframe as dd
def _file_to_pandas(fs, task, bound_row_filter, projected_schema, projected_field_ids, case_sensitive):
return _file_to_table(fs, task, bound_row_filter, projected_schema, projected_field_ids, case_sensitive).to_pandas()
def to_dask_dataframe(sc):
"""Convert a DataScan to a Dask DataFrame"""
# arguments
tasks = list(sc.plan_files())
table = sc.table
row_filter = sc.row_filter
projected_schema = sc.projection()
case_sensitive = sc.case_sensitive
# stuff stolen from to_arrow()
if isinstance(table.io, PyArrowFileIO):
scheme, _ = PyArrowFileIO.parse_location(table.location())
fs = table.io.get_fs(scheme)
else:
raise ValueError(f"Expected PyArrowFileIO, got: {table.io}")
bound_row_filter = bind(table.schema(), row_filter, case_sensitive=case_sensitive)
projected_field_ids = {
id for id in projected_schema.field_ids if not isinstance(projected_schema.find_type(id), (MapType, ListType))
}.union(extract_field_ids(bound_row_filter))
# build the Dask DataFrame
schema = schema_to_pyarrow(projected_schema)
names = [x.name for x in projected_schema.fields]
meta = pa.table([[]] * len(schema.names), schema=schema).to_pandas()
# TODO: ensure deterministic
token = dask.base.tokenize(fs, bound_row_filter, projected_schema, projected_field_ids, case_sensitive)
name = f'from-iceberg-{token}'
dsk = {
(name, i): (
_file_to_pandas, fs, task, bound_row_filter, projected_schema, projected_field_ids, case_sensitive
)
for i, task in enumerate(tasks)
}
divisions = [None] * len(dsk)
df = dd.DataFrame(dsk, name, meta, divisions)
return df It seems to work, but I haven't tested it beyond a basic
|
I'm keen to push this forward. @TomAugspurger's implementation works in single-threaded mode but fails in a distributed scenario due to current lack of pickle support (I raised a separate issue #7644). Extending Tom's approach, this is a solution which utilises Dask's class IcebergFunctionWrapper(DataFrameIOFunction):
def __init__(
self,
fs: FileSystem,
bound_row_filter: BooleanExpression,
projected_schema: Schema,
case_sensitive: bool,
):
self._fs = fs
self._bound_row_filter = bound_row_filter
self._projected_schema = projected_schema
self._case_sensitive = case_sensitive
self._projected_field_ids = {
id for id in projected_schema.field_ids
if not isinstance(projected_schema.find_type(id), (MapType, ListType))
}.union(extract_field_ids(bound_row_filter))
super().__init__()
@property
def columns(self) -> List[str]:
self._projected_schema.column_names
@property
def empty_table(self) -> pd.DataFrame:
return schema_to_pyarrow(self._projected_schema).empty_table().to_pandas(date_as_object=False)
def project_columns(self, columns: Sequence[str]) -> 'IcebergFunctionWrapper':
if list(columns) == self.columns:
return self
return IcebergFunctionWrapper(
self._fs,
self._bound_row_filter,
self._projected_schema.select(*columns),
self._case_sensitive,
)
def __call__(self, task: FileScanTask) -> pd.DataFrame:
table = _file_to_table(
self._fs,
task,
self._bound_row_filter,
self._projected_schema,
self._projected_field_ids,
self._case_sensitive,
0, # no limit support yet
)
if table is None:
return self.empty_table
return table.to_pandas(date_as_object=False)
def to_dask_dataframe(scan: DataScan) -> dd.DataFrame:
tasks = scan.plan_files()
table = scan.table
row_filter = scan.row_filter
projected_schema = scan.projection()
case_sensitive = scan.case_sensitive
scheme, _ = PyArrowFileIO.parse_location(table.location())
if isinstance(table.io, PyArrowFileIO):
fs = table.io.get_fs(scheme)
else:
try:
from pyiceberg.io.fsspec import FsspecFileIO
if isinstance(table.io, FsspecFileIO):
fs = PyFileSystem(FSSpecHandler(table.io.get_fs(scheme)))
else:
raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {table.io}")
except ModuleNotFoundError as e:
# When FsSpec is not installed
raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {table.io}") from e
bound_row_filter = bind(table.schema(), row_filter, case_sensitive=case_sensitive)
io_func = IcebergFunctionWrapper(fs, bound_row_filter, projected_schema, case_sensitive)
return dd.from_map(
io_func,
tasks,
meta=io_func.empty_table,
enforce_metadata=False,
) I'm also looking into adding Generally, should this be part of the Dask library instead of PyIceberg? |
I was wondering that too. I think either would be sensible, but I'd lean slightly towards putting the implementation in Dask (and pyiceberg could add a That said, the current implementation does use a private |
PyIceberg 0.4.0 has been released today, and should fix the pickling issues! 👍🏻 |
What's the latest on this issue? Also keen to write to Iceberg directly from Dask. |
I think the main decision point was around where this should live: dask or pyiceberg. I don't see the private |
Hello, is there any update on this? |
Feature Request / Improvement
It would be awesome to integrate pyiceberg with Dask to allow reading in data from iceberg tables into Dask.
Query engine
Other
The text was updated successfully, but these errors were encountered: