Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: TableScan Plan files API implementation without residual evaluation #6069

Closed

Conversation

dhruv-pratap
Copy link
Contributor

Taking a first dig at TableScan Plan Files API. The implementation at present evaluates partition filters but does not evaluate row-level filters at the moment as it requires Projections.

The API is exposed by the CLI tool and lists out all the data files that would need to be scanned in order to perform the scan plan.

image

High-level changes done as part of this PR:

  1. Add new_scan() API to the Table interface.
  2. Add TableScan interface with just plan_files abstract method.
  3. Add FileScanTask model that represents a scan task over the existing DataFile model.
  4. Add DataTableScan class that extends TableScan and implements plan_files API to return a collection of FileScanTasks using the existing _ManifestEvalVisitor to perform partition pruning.

@click.argument("identifier")
@click.pass_context
@catch_exception()
def scan_table(ctx: Context, identifier: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we can easily visualize the files. I think we should merge this one with the files command since they are so similar. WDYT?

➜  python git:(master) ✗ pyiceberg --verbose true --catalog fokko files hive_test.h1
Snapshots: fokko.hive_test.h1
└── Snapshot 4362649644256929116, schema 0: 
    s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/snap-4362649644256929116-1-63e88d4d-1b30-4cae-b7a0-f53b0b54758a.avro
    ├── Manifest: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/63e88d4d-1b30-4cae-b7a0-f53b0b54758a-m0.avro
    │   └── Datafile: 
    │       s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/data/0bfca5ff//00000-0-dweeks_20220901095219_f181f97a-43ce-4aa6-8cec-6956d83b7a42-job_lo
    │       cal2002913683_0002-00001.parquet
    └── Manifest: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/508fd9ac-520f-47d8-900f-00f1dda051ac-m0.avro
        └── Datafile: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/data/3d030bcf/00000-0-44897130-c494-4dca-a164-e874c1e37c36-00001.parquet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow you @Fokko . Do you mean make scan a sub-command of files command? Or just have files command use the table.new_scan().plan_files() API underneath?

If it's the later, I like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the later one is more flexible, but then we should remove the files because they have so much overlap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only difference being files at present lists all snapshots. The table scan plan_files() will only list the snapshot that is being used for the scan.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make files with three options:

  • files --snapshot=all - The current files behavior
  • files --snapshot=current - Files under the current snapshot
  • files --snapshot=<snapshot_id> - Files under the snapshot_id specified

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move it to the scan subcommand? This way we don't have similar commands

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @dhruv-pratap I have some suggestions, let me know what you think!

@@ -14,9 +14,11 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from __future__ import annotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a community, we probably should make a decision on this one. Sometimes we use it, but not all the time. Personally I find Optional[str] easier to read than str | None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Optional[str] too, but I think for some reason when I run make lint locally it turns it into the later format i.e. str | None Do we need to tweak the linter config to enforce one format over the other?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the import statement. PyUpgrade will automatically rewrite the imports: https://github.com/asottile/pyupgrade#pep-604-typing-rewrites We can change this in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reverted to not use from __future__ import annotations in a recent commit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have been fixed now #6114 is in. Delayed loading of the annotations (that's what the import does), can be helpful sometime :)

python/pyiceberg/expressions/visitors.py Outdated Show resolved Hide resolved
python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
python/pyiceberg/table/scan.py Outdated Show resolved Hide resolved
dhruv-pratap and others added 3 commits October 31, 2022 11:35
…gic to copy and rewrite a table's metadata along with its data. This logic can be used to stage test table in a temporary directory to perform scan operations on it.
@dhruv-pratap
Copy link
Contributor Author

dhruv-pratap commented Nov 1, 2022

Added tests for table scan with partition filters. Also, added logic to copy and rewrite a table's metadata along with its data. This logic can be used to stage test table in a temporary directory to perform scan operations on it. This seems to be a necessity in absence of write-path APIs to stage test table for testing read-path APIs.

Happy to listen to suggestions if there are any alternatives to test these.

CC: @Fokko @samredai

@dhruv-pratap dhruv-pratap changed the title Python: TableScan Plan files API implementation without row-level filters evaluation. Python: TableScan Plan files API implementation without residual evaluation Nov 3, 2022
@click.argument("identifier")
@click.pass_context
@catch_exception()
def scan_table(ctx: Context, identifier: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not move it to the scan subcommand? This way we don't have similar commands

@@ -14,9 +14,11 @@
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
from __future__ import annotations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have been fixed now #6114 is in. Delayed loading of the annotations (that's what the import does), can be helpful sometime :)

python/pyiceberg/cli/output.py Outdated Show resolved Hide resolved
python/pyiceberg/cli/output.py Outdated Show resolved Hide resolved
python/pyiceberg/cli/output.py Outdated Show resolved Hide resolved
python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
python/pyiceberg/table/__init__.py Outdated Show resolved Hide resolved
python/pyiceberg/table/scan.py Outdated Show resolved Hide resolved
self, io: FileIO, table: Table, snapshot: Optional[Snapshot] = None, expression: Optional[BooleanExpression] = ALWAYS_TRUE
):
self.io = io
super().__init__(table, snapshot, expression)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Convention is to first call the super, and then do the rest

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will make the change.



@pytest.fixture
def temperatures_table() -> Table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm hesitant to keep actual binaries in the repository. We already generate the manifest list and manifest itself:

@pytest.fixture(scope="session")
def generated_manifest_file_file(
avro_schema_manifest_file: Dict[str, Any], generated_manifest_entry_file: str
) -> Generator[str, None, None]:
from fastavro import parse_schema, writer
parsed_schema = parse_schema(avro_schema_manifest_file)
# Make sure that a valid manifest_path is set
manifest_file_records[0]["manifest_path"] = generated_manifest_entry_file
with TemporaryDirectory() as tmpdir:
tmp_avro_file = tmpdir + "/manifest.avro"
with open(tmp_avro_file, "wb") as out:
writer(out, parsed_schema, manifest_file_records)
yield tmp_avro_file
How would you feel about generating some Parquet files too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should generate a sample table rather than putting binaries in the repo.

Initially, we can build all of this based on just Avro and add Parquet tests later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, let me take a look at that example, reverse engineer the binaries, and remove it from the repo.

@@ -526,7 +528,7 @@ def visit_equal(self, term: BoundTerm, literal: Literal[Any]) -> bool:
if lower > literal.value:
return ROWS_CANNOT_MATCH

upper = _from_byte_buffer(term.ref().field.field_type, field.lower_bound)
upper = _from_byte_buffer(term.ref().field.field_type, field.upper_bound)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, can you check why tests didn't catch this case? We should ensure this is tested properly and separate the fix into its own PR.

return self.data_file.file_size_in_bytes


class TableScan(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this API not match the Java API? Is there something more pythonic about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to build the API incrementally and add things as and when needed, otherwise it becomes a bloated PR.

from pyiceberg.table import PartitionSpec, Snapshot, Table
from pyiceberg.utils.iceberg_base_model import IcebergBaseModel

ALWAYS_TRUE = AlwaysTrue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer not adding these constants everywhere. Just use AlwaysTrue() in place of this. It's a singleton.

raise ValueError("Unable to resolve to a Snapshot to use for the table scan.")

@abstractmethod
def plan_files(self) -> Iterable[FileScanTask]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Java, there are several different types of tasks. I'd recommend using the same pattern and not requiring this to be FileScanTask. Introduce an abstract ScanTask and return an iterable of that instead.

@rdblue
Copy link
Contributor

rdblue commented Nov 6, 2022

@Fokko, @dhruv-pratap, I posted an alternative scan API in a draft as #6131. Please take a look. That behaves like this one and allows you to specify optional arguments when creating a scan, but also allows the same syntax that Java uses.

@dhruv-pratap
Copy link
Contributor Author

dhruv-pratap commented Nov 7, 2022

In retrospect, I think this is becoming too large of a PR and would benefit from breaking down into smaller tasks. I'm going to go ahead and close this PR and if you guys are onboard I can create issues for the below:

  1. Add ScanTask interface
  2. Add FileScanTask class that extends ScanTask interface
  3. Fix ManifestEvalVisitor eq Fix typo in _ManifestEvalVisitor.visit_equal #6117
  4. Add MetricsEvalVisitor that evaluates an expression on a DataFile to test whether rows in the file may match.
  5. Add ResidualEvalVisitor that evaluates the residuals for an expression the partitions in the given PartitionSpec.
  6. Add TableScan interface. @rdblue just picked this up in Python: Add initial TableScan implementation #6131
  7. Add DataTableScan class that extends TableScan implements plan_files() API
  8. Expose plan_files API as scan subcommand via CLI.

CC: @Fokko @rdblue @samredai

@Fokko
Copy link
Contributor

Fokko commented Nov 7, 2022

Hey @dhruv-pratap that makes a lot of sense. Maybe we should create issues on the list you mentioned above, to make sure that we're aligned on who's working on what. Smaller PRs make it much easier to get stuff merged.

@rdblue
Copy link
Contributor

rdblue commented Nov 21, 2022

@dhruv-pratap, we've been working on the list lately and I think the remaining items are: (4) add MetricsEvalVisitor to prune by column stats, (5) add ResidualEvalVisitor to produce residuals (LOW priority), and (8) expose plan_files via the CLI.

For the last one, we may just want to add filters to the existing files command?

@dhruv-pratap
Copy link
Contributor Author

@dhruv-pratap, we've been working on the list lately and I think the remaining items are: (4) add MetricsEvalVisitor to prune by column stats, (5) add ResidualEvalVisitor to produce residuals (LOW priority), and (8) expose plan_files via the CLI.

For the last one, we may just want to add filters to the existing files command?

@rdblue We discussed this more in this thread. We can extend files with three options:

  • files --snapshot=all - The current files behavior
  • files --snapshot=current - Files under the current snapshot
  • files --snapshot=<snapshot_id> - Files under the snapshot_id specified

Adding expression option would be a bigger undertaking though, as it would require an expression parser. Not sure if we want to prioritize that at this point for the CLI.

@rdblue
Copy link
Contributor

rdblue commented Jan 8, 2023

Adding expression option would be a bigger undertaking though, as it would require an expression parser. Not sure if we want to prioritize that at this point for the CLI.

We added an expression parser a few weeks ago, so this would be unblocked. I'm all for updating the CLI with your suggestions.

@Fokko
Copy link
Contributor

Fokko commented Mar 3, 2023

Hey @dhruv-pratap. I'm closing this draft as this has already been implemented. Feel free to open a new PR if you feel something is missing.

@Fokko Fokko closed this Mar 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants