Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent adding duplicate files #1036

Merged
merged 8 commits into from
Aug 26, 2024

Conversation

amitgilad3
Copy link
Contributor

This resolves #998 , where duplicate files are added with add_files method, handles 2 cases:

  1. Files list is not unique
  2. One of the files added is already referenced by current snapshot

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this! A few minor comments

unique_files = set(file_paths)
return len(unique_files) != len(file_paths)

def find_referenced_files(self, file_paths: List[str]) -> list[str]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI there's also the table.inspect.files() API
https://py.iceberg.apache.org/api/#files
which returns all data files in the current snapshot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we only need to worry about the current snapshot.
If a data file existed in previous snapshots, but not in the current, we can still add the file.

@@ -621,6 +621,24 @@ def delete(self, delete_filter: Union[str, BooleanExpression], snapshot_properti
if not delete_snapshot.files_affected and not delete_snapshot.rewrites_needed:
warnings.warn("Delete operation did not match any records")

def has_duplicates(self, file_paths: List[str]) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: static method. may be better if inlined

@@ -630,7 +648,15 @@ def add_files(self, file_paths: List[str], snapshot_properties: Dict[str, str] =

Raises:
FileNotFoundError: If the file does not exist.
ValueError: Raises a ValueError in case file_paths is not unique
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: given file_paths contains duplicate files

@@ -630,7 +648,15 @@ def add_files(self, file_paths: List[str], snapshot_properties: Dict[str, str] =

Raises:
FileNotFoundError: If the file does not exist.
ValueError: Raises a ValueError in case file_paths is not unique
ValueError: Raises a ValueError in case file already referenced in table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: given file_paths already referenced by table

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amitgilad3 thanks again for putting together this PR! This will be a stellar safeguard to have on this new API.

I left a comment regarding the method definition, but the functionality looks 💯

@@ -621,6 +621,13 @@ def delete(self, delete_filter: Union[str, BooleanExpression], snapshot_properti
if not delete_snapshot.files_affected and not delete_snapshot.rewrites_needed:
warnings.warn("Delete operation did not match any records")

def find_referenced_files(self, file_paths: List[str]) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to be intentional about introducing public methods. Since this is just used by add_files, could we make this method private? The type notation of list in the return type also needs to be typing.List until we deprecate python3.8 support:

Suggested change
def find_referenced_files(self, file_paths: List[str]) -> list[str]:
def _find_referenced_files(self, file_paths: List[str]) -> List[str]:

An alternative is to just keep this logic within add_files, since this method isn't reused and is just 2 or 3 lines of code

Copy link
Contributor Author

@amitgilad3 amitgilad3 Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, it makes more sense to move the logic into add_files since no one uses it and its on 3 lines if code

Comment on lines 641 to 645
expr = pc.field("file_path").isin(file_paths)
referenced_files = [file["file_path"] for file in self._table.inspect.files().filter(expr).to_pylist()]

if referenced_files:
raise ValueError(f"Cannot add files that are already referenced by table, files: {', '.join(referenced_files)}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one more suggestion here @amitgilad3 - should we make this behavior the default, but have a boolean flag available to disable it in the add_files API?

A similar spark procedure has an equivalent flag check_duplicate_files and I feel that it could be useful because inspecting the files table requires us to read all of the active manifest files. If the user already knows that they won't be running into this issue, I think it would be useful for them to be able to disable this check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my suggestion is something like:

def add_files(self, file_paths: List[str], snapshot_properties: Dict[str, str] = EMPTY_DICT, check_duplicate_files: bool = True) -> None:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also added the flag to the table api and added tests to make sure the flag works with False

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, i added a few comments

Comment on lines 776 to 778
with pytest.raises(ValueError) as exc_info:
tbl.add_files(file_paths=[referenced_file])
assert f"Cannot add files that are already referenced by table, files: {referenced_file}" in str(exc_info.value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: its not clear to me how referenced_file is "referenced" already



@pytest.mark.integration
def test_add_files_that_referenced_by_current_snapshot_with_check_duplicate_files_false(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: split this into 2 tests. one for the happy path, another for check_duplicate_files=False.

What happens when check_duplicate_files=False is set and we add files already referenced, does this leave the table in an inconsistent (bad) state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So when you set check_duplicate_files to False you are essentially taking responsibility for scenarios where duplicate files can be added, but the default is to validate

Comment on lines 737 to 738
@pytest.mark.integration
def test_add_files_with_duplicate_files_in_file_paths(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@pytest.mark.integration
def test_add_files_with_duplicate_files_in_file_paths(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:
@pytest.mark.integration
@pytest.mark.parametrize("format_version", [1, 2])
def test_add_files_with_duplicate_files_in_file_paths(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:

@sungwy
Copy link
Collaborator

sungwy commented Aug 15, 2024

@amitgilad3 - this looks almost ready to merge. @kevinjqliu 's made some great suggestions here, so I'm thinking that we can take another round of reviews once we take a pass through adopting those review comments

@amitgilad3
Copy link
Contributor Author

Hey @sungwy + @kevinjqliu , again thanks for all the help and guidance , i went over all the comments and fixed them

@amitgilad3
Copy link
Contributor Author

Hey @sungwy , fixed all the comments and tests pass is there anything else or can we merge?

@sungwy
Copy link
Collaborator

sungwy commented Aug 23, 2024

Hey @sungwy , fixed all the comments and tests pass is there anything else or can we merge?

I will take another pass at it today. Thank you @amitgilad3 !

@sungwy sungwy merged commit 53a0b73 into apache:main Aug 26, 2024
7 checks passed
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* prevent add_files from adding a file that's already referenced by the iceberg table

* fix method that searches files that are already referenced + docs

* move function to locate duplicate files into add_files

* add check_duplicate_files flag to add_files api to make the behaviour according to java api

* add check_duplicate_files flag to table level api and add tests

* add check_duplicate_files flag to table level api and add tests

* fix tests to check new new added flag check_duplicate_files and fix checks

* fix linting
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* prevent add_files from adding a file that's already referenced by the iceberg table

* fix method that searches files that are already referenced + docs

* move function to locate duplicate files into add_files

* add check_duplicate_files flag to add_files api to make the behaviour according to java api

* add check_duplicate_files flag to table level api and add tests

* add check_duplicate_files flag to table level api and add tests

* fix tests to check new new added flag check_duplicate_files and fix checks

* fix linting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prevent add_files from adding a file that's already referenced by the Iceberg Table
3 participants