Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: Add initial TableScan implementation #6131

Closed
wants to merge 1 commit into from

Conversation

rdblue
Copy link
Contributor

@rdblue rdblue commented Nov 6, 2022

This adds an implementation of TableScan that is an alternative to the one in #6069. This doesn't implement plan_files, it is just to demonstrate a possible scan API.

This scan API works like the Java scan API, but also allows passing scan options when creating an initial scan. Both of these are supported:

scan = table.scan(
    row_filter=In("id", [5, 6, 7]),
    selected_fields=("id", "data"),
    snapshot_id=1234567890
  )
# OR
scan = table.scan() \
    .filter_rows(In("id", [5, 6, 7])) \
    .select("id", "data") \
    .use_snapshot(1234567890)

I think this is a reasonable way to get more pythonic (by passing optional arguments) and also mostly match the API and behavior in the JVM implementation.


class Table:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, I removed Pydantic from this because it was conflicting with the schema method. I think we should probably make these changes to the Table class either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could still override the method and it will work fine, only mypy will complain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema method was already there before this PR, why is removing it now? 🤔



class TableScan:
_always_true: ClassVar[BooleanExpression] = AlwaysTrue()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was needed to avoid calling functions in arg defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is necessary. AlwaysTrue is a singleton it will always return the same instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that AlwaysTrue is a class, and you have to use AlwaysTrue() to get the singleton.

@rdblue rdblue requested a review from Fokko November 6, 2022 23:38
@ddrinka
Copy link
Contributor

ddrinka commented Nov 7, 2022

I'm just an outside observer here, but isn't there already a Python implementation that followed the Java API, but folks thought it would be good to do all this work to rewrite it to be more readable/usable for a Python-native developer? Trying to include both options seems like it might bring added complexity and scope creep.

For me the second example above (the Java-style) feels very Java and the first feels very Python. If the whole goal is keeping the API Pythonic, shouldn't we stick with one Python-like API?

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general. I also agree with @ddrinka that just creating a table through the constructor is more pythonic, the builder introduces a lot of noise to me. Apart from that it looks good 👍🏻



class TableScan:
_always_true: ClassVar[BooleanExpression] = AlwaysTrue()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is necessary. AlwaysTrue is a singleton it will always return the same instance.


class Table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could still override the method and it will work fine, only mypy will complain.

"""Return the identifer of this table"""
return self.identifier

def scan(self, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def scan(self, **kwargs):
def scan(self, **kwargs) -> TableScan:

table: Table,
row_filter: BooleanExpression = _always_true,
partition_filter: BooleanExpression = _always_true,
selected_fields: tuple[str] = ("*",),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
selected_fields: tuple[str] = ("*",),
selected_fields: Tuple[str] = ("*",),


return self.table.current_snapshot()

def projection(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def projection(self):
def projection(self) -> Schema:

"""Creates a copy of this table scan with updated fields."""
return TableScan(**{**self.__dict__, **overrides})

def snapshot(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def snapshot(self):
def snapshot(self) -> Snapshot:

def select(self, *field_names: str) -> "TableScan":
if "*" in self.selected_fields:
return self.update(selected_fields=field_names)
return self.update(selected_fields=tuple(set(self.selected_fields).intersection(field_names)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small bug:

Suggested change
return self.update(selected_fields=tuple(set(self.selected_fields).intersection(field_names)))
return self.update(selected_fields=tuple(set(self.selected_fields).intersection(set(field_names))))

@rdblue rdblue closed this Nov 10, 2022
@rdblue
Copy link
Contributor Author

rdblue commented Nov 10, 2022

Closing in favor of #6145.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants