Python: TableScan Plan files API implementation without residual evaluation #6069

dhruv-pratap · 2022-10-27T18:51:05Z

Taking a first dig at TableScan Plan Files API. The implementation at present evaluates partition filters but does not evaluate row-level filters at the moment as it requires Projections.

The API is exposed by the CLI tool and lists out all the data files that would need to be scanned in order to perform the scan plan.

High-level changes done as part of this PR:

Add new_scan() API to the Table interface.
Add TableScan interface with just plan_files abstract method.
Add FileScanTask model that represents a scan task over the existing DataFile model.
Add DataTableScan class that extends TableScan and implements plan_files API to return a collection of FileScanTasks using the existing _ManifestEvalVisitor to perform partition pruning.

…ters evaluation. (apache#3229)

…artition fields.

Fokko · 2022-10-31T12:11:17Z

python/pyiceberg/cli/console.py

+@click.argument("identifier")
+@click.pass_context
+@catch_exception()
+def scan_table(ctx: Context, identifier: str):


I like that we can easily visualize the files. I think we should merge this one with the files command since they are so similar. WDYT?

➜ python git:(master) ✗ pyiceberg --verbose true --catalog fokko files hive_test.h1 Snapshots: fokko.hive_test.h1 └── Snapshot 4362649644256929116, schema 0: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/snap-4362649644256929116-1-63e88d4d-1b30-4cae-b7a0-f53b0b54758a.avro ├── Manifest: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/63e88d4d-1b30-4cae-b7a0-f53b0b54758a-m0.avro │ └── Datafile: │ s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/data/0bfca5ff//00000-0-dweeks_20220901095219_f181f97a-43ce-4aa6-8cec-6956d83b7a42-job_lo │ cal2002913683_0002-00001.parquet └── Manifest: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/508fd9ac-520f-47d8-900f-00f1dda051ac-m0.avro └── Datafile: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/data/3d030bcf/00000-0-44897130-c494-4dca-a164-e874c1e37c36-00001.parquet

Not sure I follow you @Fokko . Do you mean make scan a sub-command of files command? Or just have files command use the table.new_scan().plan_files() API underneath?

If it's the later, I like that.

I think the later one is more flexible, but then we should remove the files because they have so much overlap

The only difference being files at present lists all snapshots. The table scan plan_files() will only list the snapshot that is being used for the scan.

Maybe we can make files with three options:

files --snapshot=all - The current files behavior

files --snapshot=current - Files under the current snapshot

files --snapshot=<snapshot_id> - Files under the snapshot_id specified

Why not move it to the scan subcommand? This way we don't have similar commands

Fokko

Thanks for working on this @dhruv-pratap I have some suggestions, let me know what you think!

Fokko · 2022-10-31T12:12:34Z

python/pyiceberg/cli/output.py

@@ -14,9 +14,11 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from __future__ import annotations


As a community, we probably should make a decision on this one. Sometimes we use it, but not all the time. Personally I find Optional[str] easier to read than str | None

I like Optional[str] too, but I think for some reason when I run make lint locally it turns it into the later format i.e. str | None Do we need to tweak the linter config to enforce one format over the other?

This is because the import statement. PyUpgrade will automatically rewrite the imports: https://github.com/asottile/pyupgrade#pep-604-typing-rewrites We can change this in another PR.

I've reverted to not use from __future__ import annotations in a recent commit.

It should have been fixed now #6114 is in. Delayed loading of the annotations (that's what the import does), can be helpful sometime :)

python/pyiceberg/expressions/visitors.py

python/pyiceberg/table/__init__.py

python/pyiceberg/table/scan.py

Co-authored-by: Fokko Driesprong <[email protected]>

…gic to copy and rewrite a table's metadata along with its data. This logic can be used to stage test table in a temporary directory to perform scan operations on it.

dhruv-pratap · 2022-11-01T21:34:07Z

Added tests for table scan with partition filters. Also, added logic to copy and rewrite a table's metadata along with its data. This logic can be used to stage test table in a temporary directory to perform scan operations on it. This seems to be a necessity in absence of write-path APIs to stage test table for testing read-path APIs.

Happy to listen to suggestions if there are any alternatives to test these.

CC: @Fokko @samredai

Fokko · 2022-11-04T16:05:15Z

python/pyiceberg/cli/console.py

+@click.argument("identifier")
+@click.pass_context
+@catch_exception()
+def scan_table(ctx: Context, identifier: str):


Why not move it to the scan subcommand? This way we don't have similar commands

Fokko · 2022-11-04T19:31:21Z

python/pyiceberg/cli/output.py

@@ -14,9 +14,11 @@
 # KIND, either express or implied.  See the License for the
 # specific language governing permissions and limitations
 # under the License.
+from __future__ import annotations


It should have been fixed now #6114 is in. Delayed loading of the annotations (that's what the import does), can be helpful sometime :)

python/pyiceberg/cli/output.py

python/pyiceberg/table/__init__.py

python/pyiceberg/table/scan.py

Fokko · 2022-11-04T19:41:54Z

python/pyiceberg/table/scan.py

+        self, io: FileIO, table: Table, snapshot: Optional[Snapshot] = None, expression: Optional[BooleanExpression] = ALWAYS_TRUE
+    ):
+        self.io = io
+        super().__init__(table, snapshot, expression)


Nit: Convention is to first call the super, and then do the rest

Will make the change.

Fokko · 2022-11-04T19:45:23Z

python/tests/table/test_scan.py

+
+
+@pytest.fixture
+def temperatures_table() -> Table:


I'm hesitant to keep actual binaries in the repository. We already generate the manifest list and manifest itself:

iceberg/python/tests/conftest.py

Lines 952 to 967 in b215c48

@pytest.fixture(scope="session")

def generated_manifest_file_file(

avro_schema_manifest_file: Dict[str, Any], generated_manifest_entry_file: str

) -> Generator[str, None, None]:

from fastavro import parse_schema, writer

parsed_schema = parse_schema(avro_schema_manifest_file)

# Make sure that a valid manifest_path is set

manifest_file_records[0]["manifest_path"] = generated_manifest_entry_file

with TemporaryDirectory() as tmpdir:

tmp_avro_file = tmpdir + "/manifest.avro"

with open(tmp_avro_file, "wb") as out:

writer(out, parsed_schema, manifest_file_records)

yield tmp_avro_file

How would you feel about generating some Parquet files too?

Yes, this should generate a sample table rather than putting binaries in the repo.

Initially, we can build all of this based on just Avro and add Parquet tests later.

Alright, let me take a look at that example, reverse engineer the binaries, and remove it from the repo.

rdblue · 2022-11-06T21:32:43Z

python/pyiceberg/expressions/visitors.py

@@ -526,7 +528,7 @@ def visit_equal(self, term: BoundTerm, literal: Literal[Any]) -> bool:
        if lower > literal.value:
            return ROWS_CANNOT_MATCH

-        upper = _from_byte_buffer(term.ref().field.field_type, field.lower_bound)
+        upper = _from_byte_buffer(term.ref().field.field_type, field.upper_bound)


@Fokko, can you check why tests didn't catch this case? We should ensure this is tested properly and separate the fix into its own PR.

Co-authored-by: Fokko Driesprong <[email protected]>

rdblue · 2022-11-06T22:30:52Z

python/pyiceberg/table/scan.py

+        return self.data_file.file_size_in_bytes
+
+
+class TableScan(ABC):


Why does this API not match the Java API? Is there something more pythonic about this?

Just wanted to build the API incrementally and add things as and when needed, otherwise it becomes a bloated PR.

rdblue · 2022-11-06T22:31:29Z

python/pyiceberg/table/scan.py

+from pyiceberg.table import PartitionSpec, Snapshot, Table
+from pyiceberg.utils.iceberg_base_model import IcebergBaseModel
+
+ALWAYS_TRUE = AlwaysTrue()


I'd prefer not adding these constants everywhere. Just use AlwaysTrue() in place of this. It's a singleton.

rdblue · 2022-11-06T22:32:31Z

python/pyiceberg/table/scan.py

+            raise ValueError("Unable to resolve to a Snapshot to use for the table scan.")
+
+    @abstractmethod
+    def plan_files(self) -> Iterable[FileScanTask]:


In Java, there are several different types of tasks. I'd recommend using the same pattern and not requiring this to be FileScanTask. Introduce an abstract ScanTask and return an iterable of that instead.

rdblue · 2022-11-06T23:36:13Z

@Fokko, @dhruv-pratap, I posted an alternative scan API in a draft as #6131. Please take a look. That behaves like this one and allows you to specify optional arguments when creating a scan, but also allows the same syntax that Java uses.

Co-authored-by: Fokko Driesprong <[email protected]>

dhruv-pratap · 2022-11-07T17:21:54Z

In retrospect, I think this is becoming too large of a PR and would benefit from breaking down into smaller tasks. I'm going to go ahead and close this PR and if you guys are onboard I can create issues for the below:

Add ScanTask interface
Add FileScanTask class that extends ScanTask interface
Fix ManifestEvalVisitor eq Fix typo in _ManifestEvalVisitor.visit_equal #6117
Add MetricsEvalVisitor that evaluates an expression on a DataFile to test whether rows in the file may match.
Add ResidualEvalVisitor that evaluates the residuals for an expression the partitions in the given PartitionSpec.
Add TableScan interface. @rdblue just picked this up in Python: Add initial TableScan implementation #6131
Add DataTableScan class that extends TableScan implements plan_files() API
Expose plan_files API as scan subcommand via CLI.

CC: @Fokko @rdblue @samredai

Fokko · 2022-11-07T20:28:20Z

Hey @dhruv-pratap that makes a lot of sense. Maybe we should create issues on the list you mentioned above, to make sure that we're aligned on who's working on what. Smaller PRs make it much easier to get stuff merged.

rdblue · 2022-11-21T23:49:40Z

@dhruv-pratap, we've been working on the list lately and I think the remaining items are: (4) add MetricsEvalVisitor to prune by column stats, (5) add ResidualEvalVisitor to produce residuals (LOW priority), and (8) expose plan_files via the CLI.

For the last one, we may just want to add filters to the existing files command?

dhruv-pratap · 2022-11-22T00:35:27Z

@dhruv-pratap, we've been working on the list lately and I think the remaining items are: (4) add MetricsEvalVisitor to prune by column stats, (5) add ResidualEvalVisitor to produce residuals (LOW priority), and (8) expose plan_files via the CLI.

For the last one, we may just want to add filters to the existing files command?

@rdblue We discussed this more in this thread. We can extend files with three options:

files --snapshot=all - The current files behavior
files --snapshot=current - Files under the current snapshot
files --snapshot=<snapshot_id> - Files under the snapshot_id specified

Adding expression option would be a bigger undertaking though, as it would require an expression parser. Not sure if we want to prioritize that at this point for the CLI.

rdblue · 2023-01-08T19:00:20Z

Adding expression option would be a bigger undertaking though, as it would require an expression parser. Not sure if we want to prioritize that at this point for the CLI.

We added an expression parser a few weeks ago, so this would be unblocked. I'm all for updating the CLI with your suggestions.

Fokko · 2023-03-03T20:53:39Z

Hey @dhruv-pratap. I'm closing this draft as this has already been implemented. Feel free to open a new PR if you feel something is missing.

Python: TableScan Plan files API implementation without row-level fil…

7e30879

…ters evaluation. (apache#3229)

github-actions bot added the python label Oct 27, 2022

Python: Pass the ManifestFile to the ManifestEvaluator to filter on p…

79770f7

…artition fields.

Fokko reviewed Oct 31, 2022

View reviewed changes

dhruv-pratap and others added 3 commits October 31, 2022 11:35

Update python/pyiceberg/expressions/visitors.py

65bb08b

Co-authored-by: Fokko Driesprong <[email protected]>

Update python/pyiceberg/table/__init__.py

da71c2b

Co-authored-by: Fokko Driesprong <[email protected]>

Python: Add tests for table scan with partition filters. Also, add lo…

998003e

…gic to copy and rewrite a table's metadata along with its data. This logic can be used to stage test table in a temporary directory to perform scan operations on it.

dhruv-pratap added 3 commits November 1, 2022 17:36

Python: Fix linter issues.

30773a5

Python: Fix linter issues with optional types.

63ada10

Python: Revert to not use 'from __future__ import annotations'

f8a5a93

dhruv-pratap changed the title ~~Python: TableScan Plan files API implementation without row-level filters evaluation.~~ Python: TableScan Plan files API implementation without residual evaluation Nov 3, 2022

This was referenced Nov 4, 2022

Fix typo in _ManifestEvalVisitor.visit_equal #6117

Merged

Feedback martindurant/daskberg#1

Open

Fokko reviewed Nov 4, 2022

View reviewed changes

rdblue reviewed Nov 6, 2022

View reviewed changes

rdblue and others added 4 commits November 6, 2022 13:33

Update python/pyiceberg/table/__init__.py

970341e

Co-authored-by: Fokko Driesprong <[email protected]>

Update python/pyiceberg/table/scan.py

d92bf6b

Co-authored-by: Fokko Driesprong <[email protected]>

Update python/pyiceberg/table/__init__.py

c508250

Co-authored-by: Fokko Driesprong <[email protected]>

Update python/pyiceberg/table/__init__.py

51babdd

Co-authored-by: Fokko Driesprong <[email protected]>

rdblue reviewed Nov 6, 2022

View reviewed changes

rdblue mentioned this pull request Nov 6, 2022

Python: Add initial TableScan implementation #6131

Closed

dhruv-pratap and others added 3 commits November 7, 2022 10:37

Update python/pyiceberg/cli/output.py

fd04698

Co-authored-by: Fokko Driesprong <[email protected]>

Update python/pyiceberg/cli/output.py

736509a

Co-authored-by: Fokko Driesprong <[email protected]>

Update python/pyiceberg/cli/output.py

aa2ec55

Co-authored-by: Fokko Driesprong <[email protected]>

Fokko closed this Mar 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python: TableScan Plan files API implementation without residual evaluation #6069

Python: TableScan Plan files API implementation without residual evaluation #6069

dhruv-pratap commented Oct 27, 2022

Fokko Oct 31, 2022

dhruv-pratap Oct 31, 2022

Fokko Oct 31, 2022

dhruv-pratap Nov 2, 2022

dhruv-pratap Nov 2, 2022

Fokko Nov 4, 2022

Fokko left a comment

Fokko Oct 31, 2022

dhruv-pratap Oct 31, 2022

Fokko Oct 31, 2022

dhruv-pratap Nov 2, 2022

Fokko Nov 4, 2022

dhruv-pratap commented Nov 1, 2022 •

edited

Loading

Fokko Nov 4, 2022

Fokko Nov 4, 2022

Fokko Nov 4, 2022

dhruv-pratap Nov 7, 2022

Fokko Nov 4, 2022

rdblue Nov 6, 2022

dhruv-pratap Nov 7, 2022

rdblue Nov 6, 2022

rdblue Nov 6, 2022

dhruv-pratap Nov 7, 2022

rdblue Nov 6, 2022

rdblue Nov 6, 2022

rdblue commented Nov 6, 2022

dhruv-pratap commented Nov 7, 2022 •

edited

Loading

Fokko commented Nov 7, 2022

rdblue commented Nov 21, 2022

dhruv-pratap commented Nov 22, 2022

rdblue commented Jan 8, 2023

Fokko commented Mar 3, 2023

	@pytest.fixture(scope="session")
	def generated_manifest_file_file(
	avro_schema_manifest_file: Dict[str, Any], generated_manifest_entry_file: str
	) -> Generator[str, None, None]:
	from fastavro import parse_schema, writer

	parsed_schema = parse_schema(avro_schema_manifest_file)

	# Make sure that a valid manifest_path is set
	manifest_file_records[0]["manifest_path"] = generated_manifest_entry_file

	with TemporaryDirectory() as tmpdir:
	tmp_avro_file = tmpdir + "/manifest.avro"
	with open(tmp_avro_file, "wb") as out:
	writer(out, parsed_schema, manifest_file_records)
	yield tmp_avro_file

		return self.data_file.file_size_in_bytes


		class TableScan(ABC):

Python: TableScan Plan files API implementation without residual evaluation #6069

Python: TableScan Plan files API implementation without residual evaluation #6069

Conversation

dhruv-pratap commented Oct 27, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Fokko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dhruv-pratap commented Nov 1, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rdblue commented Nov 6, 2022

dhruv-pratap commented Nov 7, 2022 • edited Loading

Fokko commented Nov 7, 2022

rdblue commented Nov 21, 2022

dhruv-pratap commented Nov 22, 2022

rdblue commented Jan 8, 2023

Fokko commented Mar 3, 2023

dhruv-pratap commented Nov 1, 2022 •

edited

Loading

dhruv-pratap commented Nov 7, 2022 •

edited

Loading