Add Daft examples and code into PyIceberg docs and Table #355

jaychia · 2024-02-02T21:59:41Z

Adds a new optional installation arg daft, so that pip install pyiceberg[daft] will pull Daft in as a dependency
Adds a new Table.to_daft() method to convert a table into a Daft dataframe
Adds documentation with examples on Daft usage with PyIceberg

Fokko

This looks good @jaychia

Fokko · 2024-02-04T20:49:17Z

pyproject.toml

@@ -105,6 +105,7 @@ pyarrow = ["pyarrow"]
 pandas = ["pandas", "pyarrow"]
 duckdb = ["duckdb", "pyarrow"]
 ray = ["ray", "pyarrow", "pandas"]
+daft = ["getdaft>=0.2.12"]


In Poetry you need to define daft as a requirement above, and you can reference it here. Does Daft ship with PyArrow by default?

Daft does define a transitive dependency on pyarrow!

Fokko · 2024-02-04T21:19:47Z

Should we also have some sanity checks, for example:

iceberg-python/tests/integration/test_reads.py

Line 184 in a4856bc

def test_ray_nan(catalog: Catalog) -> None:

jaychia · 2024-02-05T20:48:25Z

Should we also have some sanity checks, for example:

iceberg-python/tests/integration/test_reads.py

Line 184 in a4856bc

def test_ray_nan(catalog: Catalog) -> None:

We could either do this in this PR, or as a follow-up. Let me know your preference!

I have some tests written up, but am having some trouble starting the dev environment locally.

Is make test-integration still the recommended way for running integration tests? I get some errors when provisioning data:

docker-compose -f dev/docker-compose-integration.yml exec -T spark-iceberg ipython ./provision.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/05 20:33:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
24/02/05 20:33:58 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
24/02/05 20:33:58 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
24/02/05 20:34:04 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
24/02/05 20:34:05 WARN metastore: Failed to connect to the MetaStore Server...
24/02/05 20:34:06 WARN metastore: Failed to connect to the MetaStore Server...
24/02/05 20:34:07 WARN metastore: Failed to connect to the MetaStore Server...
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
File /opt/spark/provision.py:51
     27 catalogs = {
     28     'rest': load_catalog(
     29         "rest",
   (...)
     47     ),
     48 }
     50 for catalog_name, catalog in catalogs.items():
---> 51     spark.sql(
     52         f"""
     53       CREATE DATABASE IF NOT EXISTS {catalog_name}.default;
     54     """
     55     )
     57     schema = Schema(
     58         NestedField(field_id=1, name="uuid_col", field_type=UUIDType(), required=False),
     59         NestedField(field_id=2, name="fixed_col", field_type=FixedType(25), required=False),
     60     )
     62     catalog.create_table(identifier="default.test_uuid_and_fixed_unpartitioned", schema=schema)

File /opt/spark/python/pyspark/sql/session.py:1440, in SparkSession.sql(self, sqlQuery, args, **kwargs)
   1438 try:
   1439     litArgs = {k: _to_java_column(lit(v)) for k, v in (args or {}).items()}
-> 1440     return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
   1441 finally:
   1442     if len(kwargs) > 0:

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File /opt/spark/python/pyspark/errors/exceptions/captured.py:169, in capture_sql_exception.<locals>.deco(*a, **kw)
    167 def deco(*a: Any, **kw: Any) -> Any:
    168     try:
--> 169         return f(*a, **kw)
    170     except Py4JJavaError as e:
    171         converted = convert_exception(e.java_exception)

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o41.sql.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore

Fokko · 2024-02-06T07:21:44Z

@jaychia Recently Hive integration tests have been added. First, you want to make sure that you're on a recent version of Docker. Also, it is good to periodically run make test-integration-rebuild to build a fresh copy of the image.

Jay Chia added 3 commits February 2, 2024 13:51

Add Daft integration docs

3ae8bcf

Add to_daft() implementation

fffbf2b

nit: slight amendment to show call

9553bc3

Fokko approved these changes Feb 4, 2024

View reviewed changes

Fokko reviewed Feb 4, 2024

View reviewed changes

Jay Chia added 3 commits February 5, 2024 09:58

Proper dependency handling for poetry

b3826a3

lints

ad79891

lint

3a3069a

Merge branch 'main' into jay/daft-docs

7139567

Fokko merged commit cc0fd86 into apache:main Feb 6, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Daft examples and code into PyIceberg docs and Table #355

Add Daft examples and code into PyIceberg docs and Table #355

jaychia commented Feb 2, 2024

Fokko left a comment •

edited

Loading

Fokko Feb 4, 2024

jaychia Feb 5, 2024

Fokko commented Feb 4, 2024

jaychia commented Feb 5, 2024

Fokko commented Feb 6, 2024

Add Daft examples and code into PyIceberg docs and Table #355

Add Daft examples and code into PyIceberg docs and Table #355

Conversation

jaychia commented Feb 2, 2024

Fokko left a comment • edited Loading

Choose a reason for hiding this comment

Fokko Feb 4, 2024

Choose a reason for hiding this comment

jaychia Feb 5, 2024

Choose a reason for hiding this comment

Fokko commented Feb 4, 2024

jaychia commented Feb 5, 2024

Fokko commented Feb 6, 2024

Fokko left a comment •

edited

Loading