-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: Add initial TableScan implementation #6131
Conversation
|
||
class Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, I removed Pydantic from this because it was conflicting with the schema
method. I think we should probably make these changes to the Table
class either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could still override the method and it will work fine, only mypy will complain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The schema method was already there before this PR, why is removing it now? 🤔
|
||
|
||
class TableScan: | ||
_always_true: ClassVar[BooleanExpression] = AlwaysTrue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was needed to avoid calling functions in arg defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is necessary. AlwaysTrue
is a singleton it will always return the same instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that AlwaysTrue
is a class, and you have to use AlwaysTrue()
to get the singleton.
I'm just an outside observer here, but isn't there already a Python implementation that followed the Java API, but folks thought it would be good to do all this work to rewrite it to be more readable/usable for a Python-native developer? Trying to include both options seems like it might bring added complexity and scope creep. For me the second example above (the Java-style) feels very Java and the first feels very Python. If the whole goal is keeping the API Pythonic, shouldn't we stick with one Python-like API? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general. I also agree with @ddrinka that just creating a table through the constructor is more pythonic, the builder introduces a lot of noise to me. Apart from that it looks good 👍🏻
|
||
|
||
class TableScan: | ||
_always_true: ClassVar[BooleanExpression] = AlwaysTrue() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is necessary. AlwaysTrue
is a singleton it will always return the same instance.
|
||
class Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could still override the method and it will work fine, only mypy will complain.
"""Return the identifer of this table""" | ||
return self.identifier | ||
|
||
def scan(self, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def scan(self, **kwargs): | |
def scan(self, **kwargs) -> TableScan: |
table: Table, | ||
row_filter: BooleanExpression = _always_true, | ||
partition_filter: BooleanExpression = _always_true, | ||
selected_fields: tuple[str] = ("*",), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selected_fields: tuple[str] = ("*",), | |
selected_fields: Tuple[str] = ("*",), |
|
||
return self.table.current_snapshot() | ||
|
||
def projection(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def projection(self): | |
def projection(self) -> Schema: |
"""Creates a copy of this table scan with updated fields.""" | ||
return TableScan(**{**self.__dict__, **overrides}) | ||
|
||
def snapshot(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def snapshot(self): | |
def snapshot(self) -> Snapshot: |
def select(self, *field_names: str) -> "TableScan": | ||
if "*" in self.selected_fields: | ||
return self.update(selected_fields=field_names) | ||
return self.update(selected_fields=tuple(set(self.selected_fields).intersection(field_names))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small bug:
return self.update(selected_fields=tuple(set(self.selected_fields).intersection(field_names))) | |
return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names)))) |
Closing in favor of #6145. |
This adds an implementation of
TableScan
that is an alternative to the one in #6069. This doesn't implementplan_files
, it is just to demonstrate a possible scan API.This scan API works like the Java scan API, but also allows passing scan options when creating an initial scan. Both of these are supported:
I think this is a reasonable way to get more pythonic (by passing optional arguments) and also mostly match the API and behavior in the JVM implementation.