Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

Closed
37 of 50 tasks
idiom-bytes opened this issue Feb 27, 2024 · 3 comments
Closed
37 of 50 tasks

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

idiom-bytes opened this issue Feb 27, 2024 · 3 comments
Assignees
Labels

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented Feb 27, 2024

Motivation

We completed the techspikes around our data infrastructure. As an outcome, we're going to implement our ETL pipeline in DuckDB such that we maintain many of our constraints:

  • We can test everything end-to-end (pytest, duckdb, plots)
  • We can reduce requirements on servers/infra/ops by using in-memory, embedded, on-disk db (duckb)
  • We can continue doing distributed computing (ray)
  • DB store can grow elastic w/ lvm, efs, filestore

Outline

Our first goal is to take the current ETL workflow and update it end-to-end.

Screenshot from 2024-02-29 13-51-46

Shelved Deliverables

CLOSED TICKET - Add ETL checkpoint to enforce SLAs, and process data incrementally. #694

Reason: We're going to instead implement a build step that leverages a simple SQL strategy w/ temp tables, such that we can enforce SLAs in a clean manner.

DoD

[First Deliverable - Update Ingestion + Load]

  • Update GQL local/save-to-disk to use csv #681
  • Fetch + insert new records into db #682

[Core System Updates]

  • Improve ETL by using SQL build strategy & temp tables to enforce SLAs #771

[Update ETL Deliverables]

  • DuckDB - Update existing logic that queries data from parquet using polars, to query_data from PersistentDataStore using SQL.
  • DuckDB - Provide unifying etl_ view for ETL build steps. Downstream bronze and silver tables require data from both live_ and build_ tables. #810
  • DuckDB - Update ETL "bronze step" to use duckdb/sql #738
  • DuckDB - Clean up any final queries that fetch the whole database system. No select * from table_name anywhere. #809
  • DuckDB - Create logic to support an etl_view that joins both live + temp tables, making it easier to query data as it's being built #810
  • DuckDB - Port latest ETL "bronze slots" to use duckdb/sql #740
  • DuckDB - Extend test coverage to simulate multiple runs, and cover more edge cases. #811
  • DuckDB - Abstract [Prod->Temp->View] pattern #881
  • Accuracy - Update accuracy endpoint to use the ETL to get and store the data. #854
  • Streamlit - Integrate predictoor_income work and dashboards #612

[ETL CLI Deliverables]

  • Update ETL CLI and provide a clean interface that helps enforce SLAs. #703
  • Create pdr analytics describe, query, validate, resume CLI command #883

[Cleanup Deliverables]

  • Cleanup DuckDB implementation, remove hanging dependencies, remove table.df caching, switching from parquet to CSV, cleanup tests. #737
  • Clean truevalue and predvalue #664
  • Improve CSVDataStore and PersistentDataStore instantiation inside Table #773
  • Cleanup Tables, query them directly through CSVDataStore + PersistentDataStore #772
  • Remove reference to parquet_dir use lake_dir instead #770
  • PDS thread tests are failing, although CI/CD passes. #941
  • OHLCV + CSVDS has been reconciled #858
  • Lake cli command to update is implemented #953
  • Fix issue where CSV data exists, but raw table won't be recreated #1038
  • Fix pdr-slots and quey logic that's causing the pipeline to break #1036
  • Time functions are returning local timestamp and breaking the pipeline #1070
  • Use max(values) to get the max timestamp across all duckdb tables #1049
  • Expand validation tool to check/report duplicate rows #1058
  • Update GQLDF to use the same end_ts for all subgraph queries. #1068
  • Stabilize Lake Validation & CLI Commands #1036
  • Fix drop logic to use the proper timetamp, and not drop the whole table #1079
  • Verify lake behavior is working as expected #1000
  • DuckDB - Use bronze-predictions for checkpoint #982
  • Expand ETL coverage to have multiple days of data (1mb files) and multiple operations. Enforce a benchmark performance against this.

[Ratchet Integration]

[Post-DuckDB Merge - Core Functionality]

  • Verify incremental table updates are working as expected #1001
  • Readme for working the lake and completing UX e2e is documented and works well end-to-end #1002
  • DuckDB - Implement "silver predictions" #665
  • DuckDB - Port latest ETL "silver predictions" to use duckdb/sql + close old PR #741
  • DuckDB - Silver predictions SQL PR #848

[Post-DuckDB - Peripheral Functionality]
These are frozen. Do not start/complete until DuckDB review/work is complete.

  • Update OHCLV Data Factory to use DataStores #769
  • TableRegistry looks redundant now. Deprecate it. #1088
  • PredictoorETL is handling st_ts and end_ts correctly #1086
  • Re-enable slots and subscription raw tables.
  • DuckDB - Re-enable subscription table #1085
  • DuckDB - Re-enable bronze_slots tables - #595
  • ETL - Cleanup payout, truevals, and revenue calculations - #1183
  • Data Store Objects - Please rename functions to use sql nomenclature: fill becomes insert, override becomes upsert
  • OHLCV + CSVDS will be updated after DuckDB has been updated #769
  • If you delete CSVs and rebuild RAW tables, you get duplicate data #1087
@KatunaNorbert
Copy link
Member

Discovered some issues related to data fetching on the main branch, because multiple things are rewritten in the ETF flow I leave them here in case they got solved along the way, if not maybe here would be the place to solve them:

  1. Column don't match: "timestamp" and "tx_id" - solved if file is deleted

Screenshot 2024-04-02 at 10.05.49.png

  1. Error on saving data - solved if the gql command is rerun

Screenshot 2024-04-02 at 10.12.45.png

@idiom-bytes
Copy link
Member Author

idiom-bytes commented May 30, 2024

Simply integrate the accuracy code when it's ready to go...

Do not modify the lake when using old code... the idea was to just use the old code, which is what we'll be doing...
#1102
#1103

Please do not develop these further.

@idiom-bytes
Copy link
Member Author

Closing this ticket and moving remaining items to some other ticket so i can reconcile outstanding issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants