Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Lake][ETL] Improve ETL step such that live + build tables have a unified view to query from #810

Closed
6 tasks done
idiom-bytes opened this issue Mar 18, 2024 · 0 comments · Fixed by #813
Closed
6 tasks done
Assignees
Labels
Type: Enhancement New feature or request

Comments

@idiom-bytes
Copy link
Member

idiom-bytes commented Mar 18, 2024

Background / motivation

As laid out in PR #794, we can further improve the compute query by:

  1. Creating a view that unions the live_table_name w/ build_table_name such that queries are able to process across datasets without iterim code

Problem

The issue right now is that all data in ETL is assumed to be in the build table

pdr_predictions_table_name = get_table_name("pdr_predictions", True)
pdr_truevals_table_name = get_table_name("pdr_truevals", True)
pdr_payouts_table_name = get_table_name("pdr_payouts", True)

Whereas in reality, some data might already be in live_tables, and needs to be processed against build_tables.

live_pdr_predictions_table_name = get_table_name("pdr_predictions")
live_pdr_truevals_table_name = get_table_name("pdr_truevals")
live_pdr_payouts_table_name = get_table_name("pdr_payouts")

build_pdr_predictions_table_name = get_table_name("pdr_predictions", True)
build_pdr_truevals_table_name = get_table_name("pdr_truevals", True)
build_pdr_payouts_table_name = get_table_name("pdr_payouts", True)

PersistentDataStore(path).query_data(...)

Approach

What we want to do, is to create a view that joins both live_ and build_ tables, such that you can do a single query across them. Here is how to approach this.

pseudocode 1 - Create view that unions both tables so all data sits together. Views should not take more storage/mem, it's "an alias for two different tables"

create view etl_bronze_pdr_predictions (
select *
from bronze_pdr_predictions
union etl_bronze_pdr_predictions
);

pseudocode 2 - query across both tables through a single view

SELECT MAX(timestamp) as max_timestamp FROM etl_bronze_pdr_predictions

DoD

  • Add view commands such that we can verify if they exist
  • TDD - Validate that views simplify querying across multiple tables.
  • Cleanup tables vs. view commands such that they are explicit
  • Use _etl prefix to provide unifying view for both live_ and _build tables
  • Update ETL/build step to query from _etl such that we can process all data together under a single query.
  • At the end of the ETL step, insert the data from _build into live_ and drop the _etl view
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Enhancement New feature or request
Projects
None yet
1 participant