[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086

idiom-bytes · 2024-05-23T23:33:15Z

Background / motivation

This originates from scoping down & stabilizing the DuckDB/Raw/ETL data pipeline.
#1077

The problem exists because lake_ss.st_ts and lake_ss.end_ts are being continually used with time windows to solve for how OHLCV is typically used (with a lookback).

Example Time Window
st_ts: "1 month ago"
end_ts: "now"

If we want to solve for having a relative date in ppss.yaml (rather than use a fixed date for st_ts) then I propose we have different lake_ss strategies, such that we can separate the concern of the lake (to grow as big as possible) from the subsystems that are trying to consume/process/build off this data.

Challenge / Problem

The lake (and likely OHLCV) benefit from being "greedy" and always growing to obtain as much data as possible. So, it's silly for an example to have a Time Window of 1 month that would delete old data.

What would be best would be to separate these concerns into Lake<->System

Lake
|- Pond
|- Pond

[Lake]
Is always greedy, trying to grow as much as possible.

It should start from a fixed date (i.e. 01-01-2023)
It should end on the latest possible date (i.e. now)

[Pond]
Is a filter of Lake, trying to process whatever data from lake it's responsible for.

It could start from a relative date (i.e. 1 month ago)
It could end on a relative date (i.e. 1 day ago)

We generally do not want a lake with a moving tail.

Proposal A - Broad Lake - Narrow Pond - Improving ppss.yaml

Update lake_ss to be greedy.

No start relative dates
Fixed start dates
Relative end_dates are ok

Let model_ss use filtering

OHCLV/Model AI can have a relative start date
This data can be sampled from the larger lake

Example:

lake_ss:
 st_ts: 01-01-2023
 end_ts: "now"
pdr_etl_ss:
  st_ts: "01-01-2023"
  end_ts: "now"
ai_model_ss:
  st_ts: "last 1 month"
  end_ts: "now"

TODOs / DoD

Update yaml.css to support different lake/data strategies
lake_ss is responsible for owning/growing the lake
pdr_etl_ss and ai_model_ss then consume/use a subset of the lake

Tasks:

Update yaml to support pdr_etl vs. ai_model ss
Update lake to fill up based on top_level_ss rules

The text was updated successfully, but these errors were encountered:

trentmc · 2024-05-24T04:57:54Z

If an absolute value start date is given in the yaml file, and there's already ohlcv csv files with earlier dates, the lake doesn't delete them. Why would it?

You can think of start and end date values as saying "I want at least this data in the lake". It's instructions if any extra data to gather. It is explicitly NOT saying "I want only this data in the lake, therefore delete anything that isn't there".

This carries over to relative dates too.

Therefore you don't need to delete old data. That would be overkill and over-engineering. KISS.

trentmc · 2024-05-24T04:58:37Z

(Therefore we should keep relative dates for start date too. It helps ux.)

KatunaNorbert · 2024-05-28T11:11:22Z

It makes sense for me and this is also crossed my mind before to have different config values for different data types inside the lake. This is going to make the fetching process + storing more efficient and will also help with overall configurations so now If you want to experiment with sim or predictions you update a specific configuration inside the lake, if you want to play with analytics you change another parameter and you can keep track much easier than just modifying one start_ts.

Related to the more specific ETL + GQL issue I think we can make some minor changes where the now is read at the start of the fetching process and is kept as constant until the fetch has ended. After first data fetching went trough the GQL is going to look at last saved data timestamp for fetching new values so there shouldn't be issues.
I created an issue and a fix for the proposed solution here: #1095

idiom-bytes · 2024-05-31T00:52:23Z

thank you @KatunaNorbert! 👍
please review when possible

I have implemented the immediate solution I'm looking for here #1106 and closed #1095

this addresses the issue in 1 place, across all lake commands, as intended and expected by GQL + ETL to work correctly.

idiom-bytes · 2024-06-04T14:15:39Z

I have completed all the work required to address this for the ETL build. st_ts and ed_ts are working as I expect and I believe I have a way to resolve this for my core requirements.

Dates are all showing up relative to UTC
Fuzzy dates are being temporary cast to fixed dates while the pipeline runs
st_ts and end_ts look stable and working as expected

I'm closing this ticket as all dependencies have been addressed.

idiom-bytes added the Type: Enhancement New feature or request label May 23, 2024

idiom-bytes changed the title ~~[Lake][Config] Lake is handling fixed and relative dates correctly.~~ [Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. May 23, 2024

idiom-bytes mentioned this issue May 23, 2024

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards #685

Closed

50 tasks

idiom-bytes self-assigned this Jun 4, 2024

idiom-bytes closed this as completed Jun 4, 2024

idiom-bytes mentioned this issue Jun 25, 2024

[ETL] ETL & Analytics Backlog #1299

Open

35 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086

[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086

idiom-bytes commented May 23, 2024

trentmc commented May 24, 2024

trentmc commented May 24, 2024

KatunaNorbert commented May 28, 2024

idiom-bytes commented May 31, 2024 •

edited

Loading

idiom-bytes commented Jun 4, 2024

[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086

[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086

Comments

idiom-bytes commented May 23, 2024

Background / motivation

Challenge / Problem

Proposal A - Broad Lake - Narrow Pond - Improving ppss.yaml

TODOs / DoD

trentmc commented May 24, 2024

trentmc commented May 24, 2024

KatunaNorbert commented May 28, 2024

idiom-bytes commented May 31, 2024 • edited Loading

idiom-bytes commented Jun 4, 2024

idiom-bytes commented May 31, 2024 •

edited

Loading