Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata tables for data_files and delete_files #1066

Merged
merged 4 commits into from
Sep 20, 2024

Conversation

soumya-ghosh
Copy link
Contributor

Implements metadata tables for data_files and delete_files - #1053

Have reused the logic of files to derive data_files and delete_files.
Also reused the test cases for files (test_inspect_files) as schema is same as files.

@soumya-ghosh soumya-ghosh marked this pull request as ready for review August 15, 2024 22:25
@@ -4365,6 +4365,10 @@ def _readable_metrics_struct(bound_type: PrimitiveType) -> pa.StructType:
for manifest_list in snapshot.manifests(io):
for manifest_entry in manifest_list.fetch_manifest_entry(io):
data_file = manifest_entry.data_file
if file_content_type == "data" and data_file.content != DataFileContent.DATA:
continue
if file_content_type == "delete" and data_file.content == DataFileContent.DATA:
Copy link
Collaborator

@ndrluis ndrluis Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving the file_content_type options to constants?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated signature of _files() to use DataFileContent Enum, had to use a list, as there are two types in case of delete files.

"nan_value_counts": dict(data_file.nan_value_counts),
"lower_bounds": dict(data_file.lower_bounds),
"upper_bounds": dict(data_file.upper_bounds),
"column_sizes": dict(data_file.column_sizes) if data_file.column_sizes is not None else None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that using dict(data_file.column_sizes or {}) will achieve the same behavior without the need for an if statement. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.
But these values are null when queried through Spark (Spark 3.5, Iceberg 1.5.0), hence I am assigning them to None over here.
Screenshot 2024-08-16 at 5 47 43 PM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: push this logic to the line above

                column_sizes = data_file.column_sizes or {}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i wonder if the null/None behavior is due to the .toPandas() conversion when queried through Spark

@ndrluis ndrluis added this to the PyIceberg 0.8.0 release milestone Aug 22, 2024
@ndrluis ndrluis requested review from kevinjqliu and sungwy August 29, 2024 15:29
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments

"nan_value_counts": dict(data_file.nan_value_counts),
"lower_bounds": dict(data_file.lower_bounds),
"upper_bounds": dict(data_file.upper_bounds),
"column_sizes": dict(data_file.column_sizes) if data_file.column_sizes is not None else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: push this logic to the line above

                column_sizes = data_file.column_sizes or {}

"nan_value_counts": dict(data_file.nan_value_counts),
"lower_bounds": dict(data_file.lower_bounds),
"upper_bounds": dict(data_file.upper_bounds),
"column_sizes": dict(data_file.column_sizes) if data_file.column_sizes is not None else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i wonder if the null/None behavior is due to the .toPandas() conversion when queried through Spark

Comment on lines +675 to +679
# configure table properties
if format_version == 2:
with tbl.transaction() as txn:
txn.set_properties({"write.delete.mode": "merge-on-read"})
spark.sql(f"DELETE FROM {identifier} WHERE int = 1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to produce delete files?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About this: #1066 (comment)
I have checked in Spark shell as well, the values are null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if this produces positional deletes, equality deletes, or both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Spark only produces positional delete files. Flink might produce equality delete files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, good point. I can't find any reference with Spark and equality delete

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark streaming might produce equality deletes, I tried with Flink, created an Iceberg table and upserted some data into it, observed that there were both positional and equality deletes files, which looked weird to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was a nit comment btw, not blocking. we don't necessary have to test for equality delete files here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, other than this, other comments are resolved.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark only produces positional delete files. Flink might produce equality delete files.

That's my understanding as well. I think just testing for positional deletes is okay for now, and maybe we can think of ways in adding equality deletes in our test suite in the future.

@kevinjqliu kevinjqliu requested a review from Fokko September 5, 2024 22:16
@soumya-ghosh
Copy link
Contributor Author

@kevinjqliu could you execute the workflow for this? I've merged this PR into refactored init.py. Also wanted to check if integration tests are failing here.

@soumya-ghosh
Copy link
Contributor Author

@sungwy @Fokko could you review this?

@kevinjqliu
Copy link
Contributor

thanks again @soumya-ghosh. I'll wait for another approval before merging.

@sungwy
Copy link
Collaborator

sungwy commented Sep 20, 2024

Thanks for making this contribution @soumya-ghosh ! And thank you @kevinjqliu and @ndrluis for the detailed reviews!

@sungwy sungwy merged commit 41a3c8e into apache:main Sep 20, 2024
8 checks passed
@soumya-ghosh soumya-ghosh deleted the feat/metadata_tables branch October 20, 2024 18:09
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* Add metadata tables for data_files and delete_files

* Update API docs for `data_files` and `delete_files`

* Update mehtod signature of `_files()`

* Migrate implementation of files() table from __init__.py
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* Add metadata tables for data_files and delete_files

* Update API docs for `data_files` and `delete_files`

* Update mehtod signature of `_files()`

* Migrate implementation of files() table from __init__.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants