Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake] Cleanup Table - Remove cached lazy_polars DF. #737

Closed
4 tasks done
idiom-bytes opened this issue Mar 4, 2024 · 0 comments
Closed
4 tasks done

[Lake] Cleanup Table - Remove cached lazy_polars DF. #737

idiom-bytes opened this issue Mar 4, 2024 · 0 comments
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented Mar 4, 2024

Motivation

Right now @kdetry got confused about the cached lazy_polars dataframe inside Table - Table.df as database table/object reflection, which is not the case. This is an artifact of using parquet as a filesystem and needs to be updated/removed.

To mutate any DB or data store, you must call CSVDataStore.apppend, PersistentDataStore.apppend, or related functions. You cannot simply override the polars dataframe.

self.df = self.df.select([pl.col("timestamp"])

This will simply override the cached dataframe, not change the DB or any persistent records.

Candidate Solution

This cache was always meant to be a lazy_polars object, a low-memory pointer to the table definition so it's easy to access & query. Therefore, it's important that we make data store vs. cached data more intuitive.

Candidate Solution A
Have no cache for tables. DuckDB is designed to support beyond-memory compute. Simplify everything to just query whatever is needed from duckdb.

Candidate Solution B
Rather than cache the dataframe inside Table object, create a new helper util/object named Cache, and cache the dataframes we want to save there.

Perhaps the cache has very limited memory to enforce usage constraints, dataframes can be of incomprehensible size.

Perhaps cached items are immutable and stored w/ a pointer (:shrug:) such that there is little memory footprint and there is no confusion as to what the cache does.

Candidate Solution C
Use a python cache that solves our requirements.

Candidate Solution D
A mixture of A. B, C.

DoD

  • Table has been updated and self.df parameter has been removed
  • Rest of code that's relying on table, is querying the table directly to get results it needs.
  • All code that's expected to mutable the Table/Database is properly: appending, upserting, updating, or doing CRUD operations rather than overriding table.df and expecting the database to change.
  • Decide whether caching will be maintained as is or needs to be re-implemented. Write separate ticket. do not handle caching directly in this ticket.
@idiom-bytes idiom-bytes added the Type: Enhancement New feature or request label Mar 4, 2024
@idiom-bytes idiom-bytes changed the title [Lake] Cleanup Table - Cached df needs to be explicit, immutable. [Lake] Cleanup Table - Remove cached lazy_polars DF. Mar 4, 2024
@kdetry kdetry assigned kdetry and unassigned kdetry Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants