Add metadata tables for `data_files` and `delete_files` #1066

soumya-ghosh · 2024-08-15T22:25:08Z

Implements metadata tables for data_files and delete_files - #1053

Have reused the logic of files to derive data_files and delete_files.
Also reused the test cases for files (test_inspect_files) as schema is same as files.

ndrluis · 2024-08-16T01:38:02Z

pyiceberg/table/__init__.py

@@ -4365,6 +4365,10 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType:
        for manifest_list in snapshot.manifests(io):
            for manifest_entry in manifest_list.fetch_manifest_entry(io):
                data_file = manifest_entry.data_file
+                if file_content_type == "data" and data_file.content != DataFileContent.DATA:
+                    continue
+                if file_content_type == "delete" and data_file.content == DataFileContent.DATA:


What do you think about moving the file_content_type options to constants?

Updated signature of _files() to use DataFileContent Enum, had to use a list, as there are two types in case of delete files.

ndrluis · 2024-08-16T01:40:15Z

pyiceberg/table/__init__.py

-                    "nan_value_counts": dict(data_file.nan_value_counts),
-                    "lower_bounds": dict(data_file.lower_bounds),
-                    "upper_bounds": dict(data_file.upper_bounds),
+                    "column_sizes": dict(data_file.column_sizes) if data_file.column_sizes is not None else None,


I believe that using dict(data_file.column_sizes or {}) will achieve the same behavior without the need for an if statement. What do you think?

Agreed.
But these values are null when queried through Spark (Spark 3.5, Iceberg 1.5.0), hence I am assigning them to None over here.

nit: push this logic to the line above

column_sizes = data_file.column_sizes or {}

also i wonder if the null/None behavior is due to the .toPandas() conversion when queried through Spark

kevinjqliu

added a few comments

mkdocs/docs/api.md

pyiceberg/table/__init__.py

kevinjqliu · 2024-09-01T15:28:53Z

pyiceberg/table/__init__.py

-                    "nan_value_counts": dict(data_file.nan_value_counts),
-                    "lower_bounds": dict(data_file.lower_bounds),
-                    "upper_bounds": dict(data_file.upper_bounds),
+                    "column_sizes": dict(data_file.column_sizes) if data_file.column_sizes is not None else None,


nit: push this logic to the line above

column_sizes = data_file.column_sizes or {}

kevinjqliu · 2024-09-01T15:29:18Z

pyiceberg/table/__init__.py

-                    "nan_value_counts": dict(data_file.nan_value_counts),
-                    "lower_bounds": dict(data_file.lower_bounds),
-                    "upper_bounds": dict(data_file.upper_bounds),
+                    "column_sizes": dict(data_file.column_sizes) if data_file.column_sizes is not None else None,


also i wonder if the null/None behavior is due to the .toPandas() conversion when queried through Spark

kevinjqliu · 2024-09-01T15:30:48Z

tests/integration/test_inspect_table.py

+    # configure table properties
+    if format_version == 2:
+        with tbl.transaction() as txn:
+            txn.set_properties({"write.delete.mode": "merge-on-read"})
+    spark.sql(f"DELETE FROM {identifier} WHERE int = 1")


is this to produce delete files?

About this: #1066 (comment)
I have checked in Spark shell as well, the values are null

do you know if this produces positional deletes, equality deletes, or both?

I think Spark only produces positional delete files. Flink might produce equality delete files.

hm, good point. I can't find any reference with Spark and equality delete

Spark streaming might produce equality deletes, I tried with Flink, created an Iceberg table and upserted some data into it, observed that there were both positional and equality deletes files, which looked weird to me.

this was a nit comment btw, not blocking. we don't necessary have to test for equality delete files here

Alright, other than this, other comments are resolved.

Spark only produces positional delete files. Flink might produce equality delete files.

That's my understanding as well. I think just testing for positional deletes is okay for now, and maybe we can think of ways in adding equality deletes in our test suite in the future.

soumya-ghosh · 2024-09-12T19:08:40Z

@kevinjqliu could you execute the workflow for this? I've merged this PR into refactored init.py. Also wanted to check if integration tests are failing here.

soumya-ghosh · 2024-09-12T20:22:44Z

@sungwy @Fokko could you review this?

kevinjqliu · 2024-09-12T23:54:56Z

thanks again @soumya-ghosh. I'll wait for another approval before merging.

sungwy · 2024-09-20T00:21:53Z

Thanks for making this contribution @soumya-ghosh ! And thank you @kevinjqliu and @ndrluis for the detailed reviews!

* Add metadata tables for data_files and delete_files * Update API docs for `data_files` and `delete_files` * Update mehtod signature of `_files()` * Migrate implementation of files() table from __init__.py

soumya-ghosh marked this pull request as ready for review August 15, 2024 22:25

soumya-ghosh mentioned this pull request Aug 15, 2024

[feat] add missing metadata tables #1053

Open

16 tasks

ndrluis reviewed Aug 16, 2024

View reviewed changes

ndrluis approved these changes Aug 20, 2024

View reviewed changes

ndrluis added this to the PyIceberg 0.8.0 release milestone Aug 22, 2024

ndrluis requested review from kevinjqliu and sungwy August 29, 2024 15:29

kevinjqliu reviewed Sep 1, 2024

View reviewed changes

kevinjqliu approved these changes Sep 3, 2024

View reviewed changes

kevinjqliu requested a review from Fokko September 5, 2024 22:16

soumya-ghosh added 4 commits September 13, 2024 00:25

Add metadata tables for data_files and delete_files

30e9dd0

Update API docs for data_files and delete_files

e8f5b89

Update mehtod signature of _files()

fb68c6b

Migrate implementation of files() table from __init__.py

0a7db61

soumya-ghosh force-pushed the feat/metadata_tables branch from b54687e to 0a7db61 Compare September 12, 2024 18:57

sungwy approved these changes Sep 20, 2024

View reviewed changes

sungwy merged commit 41a3c8e into apache:main Sep 20, 2024
8 checks passed

soumya-ghosh deleted the feat/metadata_tables branch October 20, 2024 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metadata tables for `data_files` and `delete_files` #1066

Add metadata tables for `data_files` and `delete_files` #1066

soumya-ghosh commented Aug 15, 2024

ndrluis Aug 16, 2024 •

edited

Loading

soumya-ghosh Aug 16, 2024

ndrluis Aug 16, 2024

soumya-ghosh Aug 16, 2024

kevinjqliu Sep 1, 2024

kevinjqliu Sep 1, 2024

kevinjqliu left a comment

kevinjqliu Sep 1, 2024

kevinjqliu Sep 1, 2024

kevinjqliu Sep 1, 2024

soumya-ghosh Sep 1, 2024

soumya-ghosh Sep 1, 2024

kevinjqliu Sep 1, 2024

soumya-ghosh Sep 2, 2024

kevinjqliu Sep 3, 2024

soumya-ghosh Sep 3, 2024

kevinjqliu Sep 3, 2024

soumya-ghosh Sep 3, 2024

sungwy Sep 12, 2024

soumya-ghosh commented Sep 12, 2024

soumya-ghosh commented Sep 12, 2024

kevinjqliu commented Sep 12, 2024

sungwy commented Sep 20, 2024

Add metadata tables for data_files and delete_files #1066

Add metadata tables for data_files and delete_files #1066

Conversation

soumya-ghosh commented Aug 15, 2024

ndrluis Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

soumya-ghosh commented Sep 12, 2024

soumya-ghosh commented Sep 12, 2024

kevinjqliu commented Sep 12, 2024

sungwy commented Sep 20, 2024

Add metadata tables for `data_files` and `delete_files` #1066

Add metadata tables for `data_files` and `delete_files` #1066

ndrluis Aug 16, 2024 •

edited

Loading