You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now @kdetry got confused about the cached lazy_polars dataframe inside Table - Table.df as database table/object reflection, which is not the case. This is an artifact of using parquet as a filesystem and needs to be updated/removed.
To mutate any DB or data store, you must call CSVDataStore.apppend, PersistentDataStore.apppend, or related functions. You cannot simply override the polars dataframe.
self.df = self.df.select([pl.col("timestamp"])
This will simply override the cached dataframe, not change the DB or any persistent records.
Candidate Solution
This cache was always meant to be a lazy_polars object, a low-memory pointer to the table definition so it's easy to access & query. Therefore, it's important that we make data store vs. cached data more intuitive.
Candidate Solution A
Have no cache for tables. DuckDB is designed to support beyond-memory compute. Simplify everything to just query whatever is needed from duckdb.
Candidate Solution B
Rather than cache the dataframe inside Table object, create a new helper util/object named Cache, and cache the dataframes we want to save there.
Perhaps the cache has very limited memory to enforce usage constraints, dataframes can be of incomprehensible size.
Perhaps cached items are immutable and stored w/ a pointer (:shrug:) such that there is little memory footprint and there is no confusion as to what the cache does.
Candidate Solution C
Use a python cache that solves our requirements.
Candidate Solution D
A mixture of A. B, C.
DoD
Table has been updated and self.df parameter has been removed
Rest of code that's relying on table, is querying the table directly to get results it needs.
All code that's expected to mutable the Table/Database is properly: appending, upserting, updating, or doing CRUD operations rather than overriding table.df and expecting the database to change.
Decide whether caching will be maintained as is or needs to be re-implemented. Write separate ticket.do not handle caching directly in this ticket.
The text was updated successfully, but these errors were encountered:
Motivation
Right now @kdetry got confused about the cached lazy_polars dataframe inside Table -
Table.df
as database table/object reflection, which is not the case. This is an artifact of using parquet as a filesystem and needs to be updated/removed.To mutate any DB or data store, you must call
CSVDataStore.apppend
,PersistentDataStore.apppend
, or related functions. You cannot simply override the polars dataframe.This will simply override the cached dataframe, not change the DB or any persistent records.
Candidate Solution
This cache was always meant to be a lazy_polars object, a low-memory pointer to the table definition so it's easy to access & query. Therefore, it's important that we make data store vs. cached data more intuitive.
Candidate Solution A
Have no cache for tables. DuckDB is designed to support beyond-memory compute. Simplify everything to just query whatever is needed from duckdb.
Candidate Solution B
Rather than cache the dataframe inside
Table
object, create a new helper util/object named Cache, and cache the dataframes we want to save there.Perhaps the cache has very limited memory to enforce usage constraints, dataframes can be of incomprehensible size.
Perhaps cached items are immutable and stored w/ a pointer (:shrug:) such that there is little memory footprint and there is no confusion as to what the cache does.
Candidate Solution C
Use a python cache that solves our requirements.
Candidate Solution D
A mixture of A. B, C.
DoD
self.df
parameter has been removedtable.df
and expecting the database to change.The text was updated successfully, but these errors were encountered: