-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(api): avoid caching physical tables #9976
base: main
Are you sure you want to change the base?
Conversation
while isinstance(op, (ops.Project, ops.DropColumns)): | ||
if isinstance(op, ops.Project) and not all( | ||
isinstance(v, ops.Field) for v in op.values.values() | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of this code is unrelated to the backend and is about some property of an expression.
Can we define this as an attribute on ops.Relation
subclasses? Seems like it'd be much less squishy that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meaning the decision is consistent and only based on the operation type (no query in the backend needed)? That's option 2 listed above.
@@ -1589,6 +1589,24 @@ def _get_schema_using_query(self, query: str) -> sch.Schema: | |||
} | |||
) | |||
|
|||
def _should_cache_physical_table(self, op: ops.PhysicalTable) -> bool: | |||
if isinstance(op, (ops.DatabaseTable, ops.UnboundTable)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like a property we should encode on the operation itself instead of leaving it up to the backend.
Is that not possible for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I covered this in the PR body above. I agree that that would be one potentially nicer option (that's bullet 2 above).
Fixes #6195.
This makes
table_expr.cache()
a no-op for physical tables and other expressions not worth caching (defined here as simple column subsets of physical tables).In the original issue (#6195) I wanted to only avoid caching things that were backed by an existing physical table (don't cache tables, do cache views, do cache expressions, ...). In practice writing
_is_this_a_real_database_table
for every backend would be annoying (maybe less annoying once the.ddl
accessor lands). I'm also not sure if avoiding caching views is that bad.We do make exceptions for two backends though -
duckdb
andpyspark
. Both of these backends have lazy-loading for file inputs (meaning creating theibis.Table
doesn't load the data yet), and loading the data can sometimes be expensive. In both cases the output ofread_parquet
et. al. is backed by aTEMP VIEW
. For these backends alone we also will cache temp views to work around this. This means that the pattern of:will work for both these backends still. Other backends where
read_parquet
reads immediately (not as a view) will now have.cache()
as a no-op (where before it would duplicate the data).All that said, I'm not happy with how squishy the definition of "do we cache this expression" is. I think the above is pragmatic and easy to do, but it's also hard to explain since there's so much "it depends" going on. Mostly pushing this up for discussion for now (I haven't added tests yet).
A few options that would make this simpler/more uniform/easier to explain:
ops.DatabaseTable
, but the uniformity there may be worth it.PhysicalTable
. This makes the choice much easier, since it requires no backend introspection. Forduckdb
/pyspark
/others add an optional kwarg to control whether the data is loaded immediately or lazily (feat(io): support way to ensureread_*
apis create physical tables in the backend #9931). I know I closed that one in favor of feat: makeTable.cache()
a no-op for tables that are already concrete in a backend #6195, but now I'm waffling. This moves the decision about how to load data to the caller rather than after the call, which feels better to me (more localized control). Could be a backend-specific kwarg (so not there for all our backends), or something more general. Possible names:cache=True
,view=False
,lazy=False
, idk..cache()
uniformly to ensure I was working on a physical table for efficiency, but now I'm not sure if the complications here are worth it.Currently leaning towards option 2, but could hear arguments for any of them.