-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: TableScan Plan files API implementation without residual evaluation #6069
Python: TableScan Plan files API implementation without residual evaluation #6069
Conversation
…ters evaluation. (apache#3229)
@click.argument("identifier") | ||
@click.pass_context | ||
@catch_exception() | ||
def scan_table(ctx: Context, identifier: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that we can easily visualize the files. I think we should merge this one with the files command since they are so similar. WDYT?
➜ python git:(master) ✗ pyiceberg --verbose true --catalog fokko files hive_test.h1
Snapshots: fokko.hive_test.h1
└── Snapshot 4362649644256929116, schema 0:
s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/snap-4362649644256929116-1-63e88d4d-1b30-4cae-b7a0-f53b0b54758a.avro
├── Manifest: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/63e88d4d-1b30-4cae-b7a0-f53b0b54758a-m0.avro
│ └── Datafile:
│ s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/data/0bfca5ff//00000-0-dweeks_20220901095219_f181f97a-43ce-4aa6-8cec-6956d83b7a42-job_lo
│ cal2002913683_0002-00001.parquet
└── Manifest: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/metadata/508fd9ac-520f-47d8-900f-00f1dda051ac-m0.avro
└── Datafile: s3://bucket/41246f47-e1bc-465e-89b4-6d039a9d55dc/67ad77ac-d90b-4472-b1e9-672d91a1ac41/data/3d030bcf/00000-0-44897130-c494-4dca-a164-e874c1e37c36-00001.parquet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I follow you @Fokko . Do you mean make scan
a sub-command of files
command? Or just have files
command use the table.new_scan().plan_files()
API underneath?
If it's the later, I like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the later one is more flexible, but then we should remove the files because they have so much overlap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only difference being files
at present lists all snapshots. The table scan plan_files()
will only list the snapshot that is being used for the scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make files with three options:
files --snapshot=all
- The current files behaviorfiles --snapshot=current
- Files under the current snapshotfiles --snapshot=<snapshot_id>
- Files under the snapshot_id specified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move it to the scan subcommand? This way we don't have similar commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @dhruv-pratap I have some suggestions, let me know what you think!
@@ -14,9 +14,11 @@ | |||
# KIND, either express or implied. See the License for the | |||
# specific language governing permissions and limitations | |||
# under the License. | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a community, we probably should make a decision on this one. Sometimes we use it, but not all the time. Personally I find Optional[str]
easier to read than str | None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Optional[str] too, but I think for some reason when I run make lint
locally it turns it into the later format i.e. str | None
Do we need to tweak the linter config to enforce one format over the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the import statement. PyUpgrade will automatically rewrite the imports: https://github.com/asottile/pyupgrade#pep-604-typing-rewrites We can change this in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reverted to not use from __future__ import annotations
in a recent commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have been fixed now #6114 is in. Delayed loading of the annotations (that's what the import does), can be helpful sometime :)
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
…gic to copy and rewrite a table's metadata along with its data. This logic can be used to stage test table in a temporary directory to perform scan operations on it.
Added tests for table scan with partition filters. Also, added logic to copy and rewrite a table's metadata along with its data. This logic can be used to stage test table in a temporary directory to perform scan operations on it. This seems to be a necessity in absence of write-path APIs to stage test table for testing read-path APIs. Happy to listen to suggestions if there are any alternatives to test these. |
@click.argument("identifier") | ||
@click.pass_context | ||
@catch_exception() | ||
def scan_table(ctx: Context, identifier: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move it to the scan subcommand? This way we don't have similar commands
@@ -14,9 +14,11 @@ | |||
# KIND, either express or implied. See the License for the | |||
# specific language governing permissions and limitations | |||
# under the License. | |||
from __future__ import annotations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should have been fixed now #6114 is in. Delayed loading of the annotations (that's what the import does), can be helpful sometime :)
self, io: FileIO, table: Table, snapshot: Optional[Snapshot] = None, expression: Optional[BooleanExpression] = ALWAYS_TRUE | ||
): | ||
self.io = io | ||
super().__init__(table, snapshot, expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Convention is to first call the super, and then do the rest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will make the change.
|
||
|
||
@pytest.fixture | ||
def temperatures_table() -> Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to keep actual binaries in the repository. We already generate the manifest list and manifest itself:
iceberg/python/tests/conftest.py
Lines 952 to 967 in b215c48
@pytest.fixture(scope="session") | |
def generated_manifest_file_file( | |
avro_schema_manifest_file: Dict[str, Any], generated_manifest_entry_file: str | |
) -> Generator[str, None, None]: | |
from fastavro import parse_schema, writer | |
parsed_schema = parse_schema(avro_schema_manifest_file) | |
# Make sure that a valid manifest_path is set | |
manifest_file_records[0]["manifest_path"] = generated_manifest_entry_file | |
with TemporaryDirectory() as tmpdir: | |
tmp_avro_file = tmpdir + "/manifest.avro" | |
with open(tmp_avro_file, "wb") as out: | |
writer(out, parsed_schema, manifest_file_records) | |
yield tmp_avro_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should generate a sample table rather than putting binaries in the repo.
Initially, we can build all of this based on just Avro and add Parquet tests later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, let me take a look at that example, reverse engineer the binaries, and remove it from the repo.
@@ -526,7 +528,7 @@ def visit_equal(self, term: BoundTerm, literal: Literal[Any]) -> bool: | |||
if lower > literal.value: | |||
return ROWS_CANNOT_MATCH | |||
|
|||
upper = _from_byte_buffer(term.ref().field.field_type, field.lower_bound) | |||
upper = _from_byte_buffer(term.ref().field.field_type, field.upper_bound) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, can you check why tests didn't catch this case? We should ensure this is tested properly and separate the fix into its own PR.
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
return self.data_file.file_size_in_bytes | ||
|
||
|
||
class TableScan(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this API not match the Java API? Is there something more pythonic about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to build the API incrementally and add things as and when needed, otherwise it becomes a bloated PR.
from pyiceberg.table import PartitionSpec, Snapshot, Table | ||
from pyiceberg.utils.iceberg_base_model import IcebergBaseModel | ||
|
||
ALWAYS_TRUE = AlwaysTrue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not adding these constants everywhere. Just use AlwaysTrue()
in place of this. It's a singleton.
raise ValueError("Unable to resolve to a Snapshot to use for the table scan.") | ||
|
||
@abstractmethod | ||
def plan_files(self) -> Iterable[FileScanTask]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Java, there are several different types of tasks. I'd recommend using the same pattern and not requiring this to be FileScanTask
. Introduce an abstract ScanTask
and return an iterable of that instead.
@Fokko, @dhruv-pratap, I posted an alternative scan API in a draft as #6131. Please take a look. That behaves like this one and allows you to specify optional arguments when creating a scan, but also allows the same syntax that Java uses. |
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
Co-authored-by: Fokko Driesprong <[email protected]>
In retrospect, I think this is becoming too large of a PR and would benefit from breaking down into smaller tasks. I'm going to go ahead and close this PR and if you guys are onboard I can create issues for the below:
|
Hey @dhruv-pratap that makes a lot of sense. Maybe we should create issues on the list you mentioned above, to make sure that we're aligned on who's working on what. Smaller PRs make it much easier to get stuff merged. |
@dhruv-pratap, we've been working on the list lately and I think the remaining items are: (4) add For the last one, we may just want to add filters to the existing |
@rdblue We discussed this more in this thread. We can extend
Adding |
We added an expression parser a few weeks ago, so this would be unblocked. I'm all for updating the CLI with your suggestions. |
Hey @dhruv-pratap. I'm closing this draft as this has already been implemented. Feel free to open a new PR if you feel something is missing. |
Taking a first dig at TableScan Plan Files API. The implementation at present evaluates partition filters but does not evaluate row-level filters at the moment as it requires Projections.
The API is exposed by the CLI tool and lists out all the data files that would need to be scanned in order to perform the scan plan.
High-level changes done as part of this PR:
new_scan()
API to theTable
interface.TableScan
interface with justplan_files
abstract method.FileScanTask
model that represents a scan task over the existingDataFile
model.DataTableScan
class that extendsTableScan
and implementsplan_files
API to return a collection ofFileScanTasks
using the existing_ManifestEvalVisitor
to perform partition pruning.