[Lake] Cleanup Table - Remove cached lazy_polars DF. #737

idiom-bytes · 2024-03-04T20:59:06Z

Motivation

Right now @kdetry got confused about the cached lazy_polars dataframe inside Table - Table.df as database table/object reflection, which is not the case. This is an artifact of using parquet as a filesystem and needs to be updated/removed.

To mutate any DB or data store, you must call CSVDataStore.apppend, PersistentDataStore.apppend, or related functions. You cannot simply override the polars dataframe.

self.df = self.df.select([pl.col("timestamp"])

This will simply override the cached dataframe, not change the DB or any persistent records.

Candidate Solution

This cache was always meant to be a lazy_polars object, a low-memory pointer to the table definition so it's easy to access & query. Therefore, it's important that we make data store vs. cached data more intuitive.

Candidate Solution A
Have no cache for tables. DuckDB is designed to support beyond-memory compute. Simplify everything to just query whatever is needed from duckdb.

Candidate Solution B
Rather than cache the dataframe inside Table object, create a new helper util/object named Cache, and cache the dataframes we want to save there.

Perhaps the cache has very limited memory to enforce usage constraints, dataframes can be of incomprehensible size.

Perhaps cached items are immutable and stored w/ a pointer (:shrug:) such that there is little memory footprint and there is no confusion as to what the cache does.

Candidate Solution C
Use a python cache that solves our requirements.

Candidate Solution D
A mixture of A. B, C.

DoD

Table has been updated and self.df parameter has been removed
Rest of code that's relying on table, is querying the table directly to get results it needs.
All code that's expected to mutable the Table/Database is properly: appending, upserting, updating, or doing CRUD operations rather than overriding table.df and expecting the database to change.
Decide whether caching will be maintained as is or needs to be re-implemented. Write separate ticket. do not handle caching directly in this ticket.

The text was updated successfully, but these errors were encountered:

idiom-bytes added the Type: Enhancement New feature or request label Mar 4, 2024

idiom-bytes changed the title ~~[Lake] Cleanup Table - Cached df needs to be explicit, immutable.~~ [Lake] Cleanup Table - Remove cached lazy_polars DF. Mar 4, 2024

idiom-bytes mentioned this issue Mar 4, 2024

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

Closed

50 tasks

kdetry assigned kdetry and unassigned kdetry Mar 6, 2024

idiom-bytes mentioned this issue Mar 6, 2024

[Freeze][Do Not Modify] Fix #685 - Integrate DuckDB - E2E #758

Closed

idiom-bytes closed this as completed Mar 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake] Cleanup Table - Remove cached lazy_polars DF. #737

[Lake] Cleanup Table - Remove cached lazy_polars DF. #737

idiom-bytes commented Mar 4, 2024 •

edited

Loading

[Lake] Cleanup Table - Remove cached lazy_polars DF. #737

[Lake] Cleanup Table - Remove cached lazy_polars DF. #737

Comments

idiom-bytes commented Mar 4, 2024 • edited Loading

Motivation

Candidate Solution

DoD

idiom-bytes commented Mar 4, 2024 •

edited

Loading