Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate db package to DuckDB-based SQL backend #19

Open
29 of 30 tasks
lorenzo-w opened this issue Nov 24, 2023 · 0 comments
Open
29 of 30 tasks

Migrate db package to DuckDB-based SQL backend #19

lorenzo-w opened this issue Nov 24, 2023 · 0 comments
Labels

Comments

@lorenzo-w
Copy link
Contributor

lorenzo-w commented Nov 24, 2023

  • Schema should be a class entirely separate from Table and database typing should be based on sets/dicts of Table types instead of Schemas
  • Table should be renamed to Record and Col to Attr
  • Change the >> operator to Record.rel() to make expressions more friendly and reusable
  • DBCol will be combined into DB
  • DB carries in its type args a backend literal + info on the current record selection type as a type union
  • When constructing a new DB object directly, Record types are only supplied for validation or substitution purposes and the Record selection types are set to Any.
  • DB pulls its index and value type info from its record type args.
  • A column can be represented as a DB with a blank Record type, which has no attributes, just an index and a value.
  • Selecting dynamic attrs can be done via the catch-all object a, which can be imported: db[a.any_name]
  • When selecting via .rel() operator, the DB Record type is always set to the tip of the path
  • Selection can be done arbitrarily deep into the relational tree, staring from the current selection
  • Filtering can only be done on atrributes of the currently selected, single Record type, for now (based on sqla expressions)
  • Assignment via setitem can be done with a dataframe, a list of records or a dict (list)
  • Typing solution for now: Do not check arg types while selecting or filtering. It just cannot be done with nested attribute access. Only set the appropriate selection type arg on the result and use that to type the data download + type-check uploaded data.
  • Merging solution: Use the Record.rel(Record | Record.fk) for filtering, selection and inclusion of additional relations as extra tuple items, dict items or dataframes when downloading data. SQL merging is done behind the scenes.
  • to_df(): operates on current selection, takes tuple of relational paths resolving to Record types, which must all originate from the current selection
  • to_records(): same as df, but defined rels are loaded automatically and need not be listed explicitly
  • all export/download functions also take sorting and pagination arguments. Actually, if you just ditch the typing of downloaded dicts (which would probably not work anyway), then you could just have Record instances have a getitem/setitem, which can be populated with on-demand downloaded implicit relations. Then there would be no need for an extra to_dict function. Spin this further, and you can eliminate the need for to_df as well by simply making it a keyword option to download dataframes instead of Record lists. Then you can just call the function load().
  • load() on the entire, unselected database should return a paginated list of records by default
  • extract(): Operates on the current selection. Extracts data in current selection + any related records into an overlay by default (including implicit relations, e.g. backrefs). Allows for dynamic, directional aggregation. Not defining the aggregation computations declaratively in the Record class makes sense especially if you transfer the data after extraction and intend to only keep the result. Hence you need to type the transferred records with a static class, not a computed one. The define dynamic directional aggregations, simply supply a mapping of relational paths to a dynamic aggregation definition, which is itself a mapping of attributes of a Record class to scalar- or vector-valued expressions. Note that aggregation is only supported for implicit rels. You may also configure extract() to not resolve any implicit rels by default and explicitly opt in for some aggregated or full implicit rels via the supplied mapping.
  • union (via "|") is a well-defined operation, which is based on the union of all records in two database selections. As a result of unioning the records, the type of the result will also be a union of the operand types, meaning the set of attributes will also be unioned
  • When selecting through a single-record relation, the resulting DB has the target record as type, but the index remains in its current state.
  • To get to a state indexed by the target record's default index, use DB.reindex() without args
  • When accessing attributes typed as Iterable[Record], the default index of the target Record class is automatically appended to the current indexing tuple
  • When accessing attributes typed as Mapping[Record], the mapping's key type is automatically appended to the current indexing tuple
  • setitem on DataBase can be used to replace entire, top-level datasets
  • setitem on DataSet can only be used with direct props for now, in which case it simply accepts any value of compatible type, which is broadcastable to the selection
  • setitem also works with index values / lists / slices. In that case the accepted value type is the Record type of the selection
  • setitem does not support arbitrary filters
  • write and pass unit tests for all new code
@lorenzo-w lorenzo-w changed the title Generalize DFDB to other backends (SQL etc.) Migrate db package to DuckDB-based SQL backend Jan 26, 2024
@lorenzo-w lorenzo-w added the 1d label Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant