Add class to write Avro files and add PoC update_table api call #377

maxdebayser · 2023-06-04T22:16:32Z

Hi @Fokko , I was trying to implement writing to Iceberg from Python without using pyspark when I found your PR that already implemented some of the steps that are required. With the additions in this PR, I can build the avro files, upload them to S3 and call the API to update the table metadata to point to the metadata and data.

This PR is by no means complete, but I hope to get some feedback on this as I'm keen on having this feature in the pyiceberg library.

maxdebayser · 2023-06-07T17:56:00Z

python/pyiceberg/avro/file.py

@@ -199,3 +202,76 @@ def __next__(self) -> D:

    def _read_header(self) -> AvroFileHeader:
        return construct_reader(META_SCHEMA, {-1: AvroFileHeader}).read(self.decoder)
+


With this class I can write Avro files like this:

with avro.AvroOutputFile(PyArrowFileIO().new_output("writetest.avro"), MANIFEST_ENTRY_SCHEMA, "manifest_entry") as out: out.write_block([entry])

But perhaps we could also refactor the AvroFile and AvroOutputFile classes into a single RW class.

maxdebayser · 2023-06-07T17:57:49Z

python/pyiceberg/avro/resolver.py

+        if field.required:
+            return field_result
+        else:
+            return OptionWriter(field_result)


Without this change the avro file structure was the same as for a required field, which can't be parsed as an optional field.

maxdebayser · 2023-06-07T18:09:04Z

python/pyiceberg/avro/writer.py

            self.key_writer.write(encoder, k)
            self.value_writer.write(encoder, v)
+        if len(val) > 0:
+            encoder.write_int(0)


From https://avro.apache.org/docs/1.11.1/specification/#schema-maps : "A block with count zero indicates the end of the map."

maxdebayser · 2023-06-07T18:09:09Z

python/pyiceberg/avro/writer.py

@@ -184,6 +184,8 @@ def write(self, encoder: BinaryEncoder, val: List[Any]) -> None:
        encoder.write_int(len(val))
        for v in val:
            self.element_writer.write(encoder, v)
+        if len(val) > 0:
+            encoder.write_int(0)


From https://avro.apache.org/docs/1.11.1/specification/#arrays-1 : "A block with count zero indicates the end of the array."

maxdebayser · 2023-06-07T18:10:05Z

python/pyiceberg/manifest.py

@@ -100,14 +100,14 @@ def __repr__(self) -> str:
        field_id=108,
        name="column_sizes",
        field_type=MapType(key_id=117, key_type=IntegerType(), value_id=118, value_type=LongType()),
-        required=True,
+        required=False,


See PR apache#7796

maxdebayser · 2023-06-07T18:10:14Z

python/pyiceberg/manifest.py

        doc="Map of column id to total size on disk",
    ),
    NestedField(
        field_id=109,
        name="value_counts",
        field_type=MapType(key_id=119, key_type=IntegerType(), value_id=120, value_type=LongType()),
-        required=True,
+        required=False,


Same as above

maxdebayser · 2023-06-07T18:10:21Z

python/pyiceberg/manifest.py

@@ -192,7 +192,7 @@ def __init__(self, *data: Any, **named_data: Any) -> None:
    NestedField(1, "snapshot_id", LongType(), required=False),
    NestedField(3, "sequence_number", LongType(), required=False),
    NestedField(4, "file_sequence_number", LongType(), required=False),
-    NestedField(2, "data_file", DATA_FILE_TYPE, required=False),
+    NestedField(2, "data_file", DATA_FILE_TYPE, required=True),


Same as above

maxdebayser · 2023-06-07T18:13:17Z

python/pyiceberg/utils/schema_conversion.py

@@ -76,7 +76,7 @@


 class AvroSchemaConversion:
-    def avro_to_iceberg(self, avro_schema: dict[str, Any]) -> Schema:
+    def avro_to_iceberg(self, avro_schema: Dict[str, Any]) -> Schema:


With python 3.8 dict wasn't recognized as a type annotation

maxdebayser · 2023-06-07T18:17:11Z

python/pyiceberg/catalog/rest.py

@@ -335,6 +428,70 @@ def _handle_non_200_response(self, exc: HTTPError, error_handler: Dict[int, Type

        raise exception(response) from exc



This function is basically a PoC now. Since there are many table update operations, we should look at the Java API to see if it should be a single function with complex parameters or several simpler functions.

maxdebayser · 2023-06-07T18:53:19Z

python/pyiceberg/typedef.py

@@ -155,3 +156,6 @@ def __eq__(self, other: Any) -> bool:

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}[{', '.join(f'{key}={repr(value)}' for key, value in self.__dict__.items() if not key.startswith('_'))}]"
+
+    def fields(self) -> List[str]:
+        return [self.__getattribute__(v) if hasattr(self, v) else None for v in self._position_to_field_name.values()]


I couldn't find a more elegant way to return an array of fields preserving declaration order and preserving None for unset optional fields.

Fokko · 2023-06-18T09:42:46Z

@maxdebayser Can I do the following suggestion:

You can cherry-pick my work on the Avro writer.
Apply the fixes
Have a couple of tests where we read and write Avro files using our reader/writer, and ones written by fastavro (just to be sure).

maxdebayser · 2023-06-18T13:12:52Z

Sure. Just to be clear, should I create a new branch off the apache/iceberg master, cherry pick the work from this branch over there and add the read/write tests?

Fokko · 2023-06-18T13:29:19Z

I would suggest the following:

You create a PR to master that focuses on the writing of Avro.
- Leave out the changes for the REST, so we don't create a huge PR, and I think Python: Alter table plumbing and REST support apache/iceberg#6323 is good to go.
- Cherry-pick Python: Avro write support apache/iceberg#7491
- Add your changes, e.g. the bugs that you found in my code (those are great, thank you for that)!
- Add some tests
Create a separate PR for the changes of the type annotations, we can get that in right away.

Add class to write Avro files and add PoC update_table api call

7e13b76

github-actions bot added the python Pull requests that update Python code label Jun 4, 2023

maxdebayser commented Jun 7, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add class to write Avro files and add PoC update_table api call #377

Add class to write Avro files and add PoC update_table api call #377

maxdebayser commented Jun 4, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

maxdebayser Jun 7, 2023

Fokko commented Jun 18, 2023

maxdebayser commented Jun 18, 2023

Fokko commented Jun 18, 2023

		@@ -199,3 +202,76 @@ def __next__(self) -> D:

		def _read_header(self) -> AvroFileHeader:
		return construct_reader(META_SCHEMA, {-1: AvroFileHeader}).read(self.decoder)

		@@ -335,6 +428,70 @@ def _handle_non_200_response(self, exc: HTTPError, error_handler: Dict[int, Type

		raise exception(response) from exc

Add class to write Avro files and add PoC update_table api call #377

Are you sure you want to change the base?

Add class to write Avro files and add PoC update_table api call #377

Conversation

maxdebayser commented Jun 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko commented Jun 18, 2023

maxdebayser commented Jun 18, 2023

Fokko commented Jun 18, 2023