Migrate `db` package to DuckDB-based SQL backend #19

lorenzo-w · 2023-11-24T15:46:50Z

Schema should be a class entirely separate from Table and database typing should be based on sets/dicts of Table types instead of Schemas
Table should be renamed to Record and Col to Attr
Change the >> operator to Record.rel() to make expressions more friendly and reusable
DBCol will be combined into DB
DB carries in its type args a backend literal + info on the current record selection type as a type union
When constructing a new DB object directly, Record types are only supplied for validation or substitution purposes and the Record selection types are set to Any.
DB pulls its index and value type info from its record type args.
A column can be represented as a DB with a blank Record type, which has no attributes, just an index and a value.
Selecting dynamic attrs can be done via the catch-all object a, which can be imported: db[a.any_name]
When selecting via .rel() operator, the DB Record type is always set to the tip of the path
Selection can be done arbitrarily deep into the relational tree, staring from the current selection
Filtering can only be done on atrributes of the currently selected, single Record type, for now (based on sqla expressions)
Assignment via setitem can be done with a dataframe, a list of records or a dict (list)
Typing solution for now: Do not check arg types while selecting or filtering. It just cannot be done with nested attribute access. Only set the appropriate selection type arg on the result and use that to type the data download + type-check uploaded data.
Merging solution: Use the Record.rel(Record | Record.fk) for filtering, selection and inclusion of additional relations as extra tuple items, dict items or dataframes when downloading data. SQL merging is done behind the scenes.
to_df(): operates on current selection, takes tuple of relational paths resolving to Record types, which must all originate from the current selection
to_records(): same as df, but defined rels are loaded automatically and need not be listed explicitly
all export/download functions also take sorting and pagination arguments. Actually, if you just ditch the typing of downloaded dicts (which would probably not work anyway), then you could just have Record instances have a getitem/setitem, which can be populated with on-demand downloaded implicit relations. Then there would be no need for an extra to_dict function. Spin this further, and you can eliminate the need for to_df as well by simply making it a keyword option to download dataframes instead of Record lists. Then you can just call the function load().
load() on the entire, unselected database should return a paginated list of records by default
extract(): Operates on the current selection. Extracts data in current selection + any related records into an overlay by default (including implicit relations, e.g. backrefs). Allows for dynamic, directional aggregation. Not defining the aggregation computations declaratively in the Record class makes sense especially if you transfer the data after extraction and intend to only keep the result. Hence you need to type the transferred records with a static class, not a computed one. The define dynamic directional aggregations, simply supply a mapping of relational paths to a dynamic aggregation definition, which is itself a mapping of attributes of a Record class to scalar- or vector-valued expressions. Note that aggregation is only supported for implicit rels. You may also configure extract() to not resolve any implicit rels by default and explicitly opt in for some aggregated or full implicit rels via the supplied mapping.
union (via "|") is a well-defined operation, which is based on the union of all records in two database selections. As a result of unioning the records, the type of the result will also be a union of the operand types, meaning the set of attributes will also be unioned
When selecting through a single-record relation, the resulting DB has the target record as type, but the index remains in its current state.
To get to a state indexed by the target record's default index, use DB.reindex() without args
When accessing attributes typed as Iterable[Record], the default index of the target Record class is automatically appended to the current indexing tuple
When accessing attributes typed as Mapping[Record], the mapping's key type is automatically appended to the current indexing tuple
setitem on DataBase can be used to replace entire, top-level datasets
setitem on DataSet can only be used with direct props for now, in which case it simply accepts any value of compatible type, which is broadcastable to the selection
setitem also works with index values / lists / slices. In that case the accepted value type is the Record type of the selection
setitem does not support arbitrary filters
write and pass unit tests for all new code

lorenzo-w changed the title ~~Generalize DFDB to other backends (SQL etc.)~~ Migrate db package to DuckDB-based SQL backend Jan 26, 2024

lorenzo-w added the 1d label Jan 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate `db` package to DuckDB-based SQL backend #19

Migrate `db` package to DuckDB-based SQL backend #19

lorenzo-w commented Nov 24, 2023 •

edited

Loading

Migrate db package to DuckDB-based SQL backend #19

Migrate db package to DuckDB-based SQL backend #19

Comments

lorenzo-w commented Nov 24, 2023 • edited Loading

Migrate `db` package to DuckDB-based SQL backend #19

Migrate `db` package to DuckDB-based SQL backend #19

lorenzo-w commented Nov 24, 2023 •

edited

Loading