prevent adding duplicate files #1036

amitgilad3 · 2024-08-10T14:09:11Z

This resolves #998 , where duplicate files are added with add_files method, handles 2 cases:

Files list is not unique
One of the files added is already referenced by current snapshot

… iceberg table

kevinjqliu

Thanks for contributing this! A few minor comments

kevinjqliu · 2024-08-11T17:07:48Z

pyiceberg/table/__init__.py

+        unique_files = set(file_paths)
+        return len(unique_files) != len(file_paths)
+
+    def find_referenced_files(self, file_paths: List[str]) -> list[str]:


FYI there's also the table.inspect.files() API
https://py.iceberg.apache.org/api/#files
which returns all data files in the current snapshot

I guess we only need to worry about the current snapshot.
If a data file existed in previous snapshots, but not in the current, we can still add the file.

kevinjqliu · 2024-08-11T17:10:18Z

pyiceberg/table/__init__.py

@@ -621,6 +621,24 @@ def delete(self, delete_filter: Union[str, BooleanExpression], snapshot_properti
        if not delete_snapshot.files_affected and not delete_snapshot.rewrites_needed:
            warnings.warn("Delete operation did not match any records")

+    def has_duplicates(self, file_paths: List[str]) -> bool:


nit: static method. may be better if inlined

kevinjqliu · 2024-08-11T17:10:47Z

pyiceberg/table/__init__.py

@@ -630,7 +648,15 @@ def add_files(self, file_paths: List[str], snapshot_properties: Dict[str, str] =

        Raises:
            FileNotFoundError: If the file does not exist.
+            ValueError: Raises a ValueError in case file_paths is not unique


nit: given file_paths contains duplicate files

kevinjqliu · 2024-08-11T17:11:07Z

pyiceberg/table/__init__.py

@@ -630,7 +648,15 @@ def add_files(self, file_paths: List[str], snapshot_properties: Dict[str, str] =

        Raises:
            FileNotFoundError: If the file does not exist.
+            ValueError: Raises a ValueError in case file_paths is not unique
+            ValueError: Raises a ValueError in case file already referenced in table


nit: given file_paths already referenced by table

sungwy

Hi @amitgilad3 thanks again for putting together this PR! This will be a stellar safeguard to have on this new API.

I left a comment regarding the method definition, but the functionality looks 💯

sungwy · 2024-08-12T21:09:25Z

pyiceberg/table/__init__.py

@@ -621,6 +621,13 @@ def delete(self, delete_filter: Union[str, BooleanExpression], snapshot_properti
        if not delete_snapshot.files_affected and not delete_snapshot.rewrites_needed:
            warnings.warn("Delete operation did not match any records")

+    def find_referenced_files(self, file_paths: List[str]) -> list[str]:


We want to be intentional about introducing public methods. Since this is just used by add_files, could we make this method private? The type notation of list in the return type also needs to be typing.List until we deprecate python3.8 support:

Suggested change

def find_referenced_files(self, file_paths: List[str]) -> list[str]:

def _find_referenced_files(self, file_paths: List[str]) -> List[str]:

An alternative is to just keep this logic within add_files, since this method isn't reused and is just 2 or 3 lines of code

you are right, it makes more sense to move the logic into add_files since no one uses it and its on 3 lines if code

sungwy · 2024-08-12T21:32:19Z

pyiceberg/table/__init__.py

+        expr = pc.field("file_path").isin(file_paths)
+        referenced_files = [file["file_path"] for file in self._table.inspect.files().filter(expr).to_pylist()]
+
+        if referenced_files:
+            raise ValueError(f"Cannot add files that are already referenced by table, files: {', '.join(referenced_files)}")


Just one more suggestion here @amitgilad3 - should we make this behavior the default, but have a boolean flag available to disable it in the add_files API?

A similar spark procedure has an equivalent flag check_duplicate_files and I feel that it could be useful because inspecting the files table requires us to read all of the active manifest files. If the user already knows that they won't be running into this issue, I think it would be useful for them to be able to disable this check.

so my suggestion is something like:

def add_files(self, file_paths: List[str], snapshot_properties: Dict[str, str] = EMPTY_DICT, check_duplicate_files: bool = True) -> None:

i also added the flag to the table api and added tests to make sure the flag works with False

… according to java api

kevinjqliu

Thanks for the PR, i added a few comments

tests/integration/test_add_files.py

kevinjqliu · 2024-08-12T22:52:30Z

tests/integration/test_add_files.py

+    with pytest.raises(ValueError) as exc_info:
+        tbl.add_files(file_paths=[referenced_file])
+    assert f"Cannot add files that are already referenced by table, files: {referenced_file}" in str(exc_info.value)


nit: its not clear to me how referenced_file is "referenced" already

kevinjqliu · 2024-08-12T22:55:03Z

tests/integration/test_add_files.py

+
+
+@pytest.mark.integration
+def test_add_files_that_referenced_by_current_snapshot_with_check_duplicate_files_false(


nit: split this into 2 tests. one for the happy path, another for check_duplicate_files=False.

What happens when check_duplicate_files=False is set and we add files already referenced, does this leave the table in an inconsistent (bad) state?

So when you set check_duplicate_files to False you are essentially taking responsibility for scenarios where duplicate files can be added, but the default is to validate

sungwy · 2024-08-13T12:36:40Z

tests/integration/test_add_files.py

+@pytest.mark.integration
+def test_add_files_with_duplicate_files_in_file_paths(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:


Suggested change

@pytest.mark.integration

def test_add_files_with_duplicate_files_in_file_paths(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:

@pytest.mark.integration

@pytest.mark.parametrize("format_version", [1, 2])

def test_add_files_with_duplicate_files_in_file_paths(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:

sungwy · 2024-08-15T21:21:08Z

@amitgilad3 - this looks almost ready to merge. @kevinjqliu 's made some great suggestions here, so I'm thinking that we can take another round of reviews once we take a pass through adopting those review comments

…hecks

amitgilad3 · 2024-08-16T19:34:52Z

Hey @sungwy + @kevinjqliu , again thanks for all the help and guidance , i went over all the comments and fixed them

amitgilad3 · 2024-08-23T08:24:04Z

Hey @sungwy , fixed all the comments and tests pass is there anything else or can we merge?

sungwy · 2024-08-23T11:21:24Z

Hey @sungwy , fixed all the comments and tests pass is there anything else or can we merge?

I will take another pass at it today. Thank you @amitgilad3 !

* prevent add_files from adding a file that's already referenced by the iceberg table * fix method that searches files that are already referenced + docs * move function to locate duplicate files into add_files * add check_duplicate_files flag to add_files api to make the behaviour according to java api * add check_duplicate_files flag to table level api and add tests * add check_duplicate_files flag to table level api and add tests * fix tests to check new new added flag check_duplicate_files and fix checks * fix linting

prevent add_files from adding a file that's already referenced by the…

090fa59

… iceberg table

amitgilad3 mentioned this pull request Aug 10, 2024

Prevent add_files from adding a file that's already referenced by the Iceberg Table #998

Closed

kevinjqliu reviewed Aug 11, 2024

View reviewed changes

fix method that searches files that are already referenced + docs

713eecf

sungwy reviewed Aug 12, 2024

View reviewed changes

move function to locate duplicate files into add_files

7a528d3

sungwy reviewed Aug 12, 2024

View reviewed changes

amitgilad3 added 3 commits August 13, 2024 00:48

add check_duplicate_files flag to add_files api to make the behaviour…

9d0c95a

… according to java api

add check_duplicate_files flag to table level api and add tests

19e189f

add check_duplicate_files flag to table level api and add tests

fd8d32e

kevinjqliu reviewed Aug 12, 2024

View reviewed changes

sungwy reviewed Aug 13, 2024

View reviewed changes

fix tests to check new new added flag check_duplicate_files and fix c…

e4fe107

…hecks

fix linting

69d9c7d

sungwy approved these changes Aug 24, 2024

View reviewed changes

sungwy merged commit 53a0b73 into apache:main Aug 26, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prevent adding duplicate files #1036

prevent adding duplicate files #1036

amitgilad3 commented Aug 10, 2024

kevinjqliu left a comment

kevinjqliu Aug 11, 2024

kevinjqliu Aug 11, 2024

kevinjqliu Aug 11, 2024

kevinjqliu Aug 11, 2024

kevinjqliu Aug 11, 2024

sungwy left a comment

sungwy Aug 12, 2024

amitgilad3 Aug 12, 2024 •

edited

Loading

sungwy Aug 12, 2024

sungwy Aug 12, 2024

amitgilad3 Aug 12, 2024

kevinjqliu left a comment

kevinjqliu Aug 12, 2024

kevinjqliu Aug 12, 2024

amitgilad3 Aug 16, 2024

sungwy Aug 13, 2024

sungwy commented Aug 15, 2024

amitgilad3 commented Aug 16, 2024

amitgilad3 commented Aug 23, 2024

sungwy commented Aug 23, 2024

	def find_referenced_files(self, file_paths: List[str]) -> list[str]:
	def _find_referenced_files(self, file_paths: List[str]) -> List[str]:



		@pytest.mark.integration
		def test_add_files_that_referenced_by_current_snapshot_with_check_duplicate_files_false(

		@pytest.mark.integration
		def test_add_files_with_duplicate_files_in_file_paths(spark: SparkSession, session_catalog: Catalog, format_version: int) -> None:

prevent adding duplicate files #1036

prevent adding duplicate files #1036

Conversation

amitgilad3 commented Aug 10, 2024

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitgilad3 Aug 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kevinjqliu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sungwy commented Aug 15, 2024

amitgilad3 commented Aug 16, 2024

amitgilad3 commented Aug 23, 2024

sungwy commented Aug 23, 2024

amitgilad3 Aug 12, 2024 •

edited

Loading