Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Daft examples and code into PyIceberg docs and Table #355

Merged
merged 7 commits into from
Feb 6, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Feb 2, 2024

  1. Adds a new optional installation arg daft, so that pip install pyiceberg[daft] will pull Daft in as a dependency
  2. Adds a new Table.to_daft() method to convert a table into a Daft dataframe
  3. Adds documentation with examples on Daft usage with PyIceberg

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good @jaychia

pyproject.toml Outdated
@@ -105,6 +105,7 @@ pyarrow = ["pyarrow"]
pandas = ["pandas", "pyarrow"]
duckdb = ["duckdb", "pyarrow"]
ray = ["ray", "pyarrow", "pandas"]
daft = ["getdaft>=0.2.12"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Poetry you need to define daft as a requirement above, and you can reference it here. Does Daft ship with PyArrow by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Daft does define a transitive dependency on pyarrow!

@Fokko
Copy link
Contributor

Fokko commented Feb 4, 2024

Should we also have some sanity checks, for example:

def test_ray_nan(catalog: Catalog) -> None:

@jaychia
Copy link
Contributor Author

jaychia commented Feb 5, 2024

Should we also have some sanity checks, for example:

def test_ray_nan(catalog: Catalog) -> None:

We could either do this in this PR, or as a follow-up. Let me know your preference!

I have some tests written up, but am having some trouble starting the dev environment locally.

Is make test-integration still the recommended way for running integration tests? I get some errors when provisioning data:

docker-compose -f dev/docker-compose-integration.yml exec -T spark-iceberg ipython ./provision.py
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/05 20:33:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
24/02/05 20:33:58 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
24/02/05 20:33:58 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
24/02/05 20:34:04 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
24/02/05 20:34:05 WARN metastore: Failed to connect to the MetaStore Server...
24/02/05 20:34:06 WARN metastore: Failed to connect to the MetaStore Server...
24/02/05 20:34:07 WARN metastore: Failed to connect to the MetaStore Server...
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
File /opt/spark/provision.py:51
     27 catalogs = {
     28     'rest': load_catalog(
     29         "rest",
   (...)
     47     ),
     48 }
     50 for catalog_name, catalog in catalogs.items():
---> 51     spark.sql(
     52         f"""
     53       CREATE DATABASE IF NOT EXISTS {catalog_name}.default;
     54     """
     55     )
     57     schema = Schema(
     58         NestedField(field_id=1, name="uuid_col", field_type=UUIDType(), required=False),
     59         NestedField(field_id=2, name="fixed_col", field_type=FixedType(25), required=False),
     60     )
     62     catalog.create_table(identifier="default.test_uuid_and_fixed_unpartitioned", schema=schema)

File /opt/spark/python/pyspark/sql/session.py:1440, in SparkSession.sql(self, sqlQuery, args, **kwargs)
   1438 try:
   1439     litArgs = {k: _to_java_column(lit(v)) for k, v in (args or {}).items()}
-> 1440     return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
   1441 finally:
   1442     if len(kwargs) > 0:

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File /opt/spark/python/pyspark/errors/exceptions/captured.py:169, in capture_sql_exception.<locals>.deco(*a, **kw)
    167 def deco(*a: Any, **kw: Any) -> Any:
    168     try:
--> 169         return f(*a, **kw)
    170     except Py4JJavaError as e:
    171         converted = convert_exception(e.java_exception)

File /opt/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o41.sql.
: org.apache.iceberg.hive.RuntimeMetaException: Failed to connect to Hive Metastore

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2024

@jaychia Recently Hive integration tests have been added. First, you want to make sure that you're on a recent version of Docker. Also, it is good to periodically run make test-integration-rebuild to build a fresh copy of the image.

@Fokko Fokko merged commit cc0fd86 into apache:main Feb 6, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants