-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Lake][Config] PredictoorETL is handling st_ts and end_ts dates correctly. #1086
Comments
If an absolute value start date is given in the yaml file, and there's already ohlcv csv files with earlier dates, the lake doesn't delete them. Why would it? You can think of start and end date values as saying "I want at least this data in the lake". It's instructions if any extra data to gather. It is explicitly NOT saying "I want only this data in the lake, therefore delete anything that isn't there". This carries over to relative dates too. Therefore you don't need to delete old data. That would be overkill and over-engineering. KISS. |
(Therefore we should keep relative dates for start date too. It helps ux.) |
It makes sense for me and this is also crossed my mind before to have different config values for different data types inside the lake. This is going to make the fetching process + storing more efficient and will also help with overall configurations so now If you want to experiment with sim or predictions you update a specific configuration inside the lake, if you want to play with analytics you change another parameter and you can keep track much easier than just modifying one start_ts. Related to the more specific ETL + GQL issue I think we can make some minor changes where the |
thank you @KatunaNorbert! 👍 I have implemented the immediate solution I'm looking for here #1106 and closed #1095 this addresses the issue in 1 place, across all lake commands, as intended and expected by GQL + ETL to work correctly. |
I have completed all the work required to address this for the ETL build. st_ts and ed_ts are working as I expect and I believe I have a way to resolve this for my core requirements.
I'm closing this ticket as all dependencies have been addressed. |
Background / motivation
This originates from scoping down & stabilizing the DuckDB/Raw/ETL data pipeline.
#1077
The problem exists because lake_ss.st_ts and lake_ss.end_ts are being continually used with time windows to solve for how OHLCV is typically used (with a lookback).
Example Time Window
st_ts: "1 month ago"
end_ts: "now"
If we want to solve for having a relative date in ppss.yaml (rather than use a fixed date for st_ts) then I propose we have different lake_ss strategies, such that we can separate the concern of the lake (to grow as big as possible) from the subsystems that are trying to consume/process/build off this data.
Challenge / Problem
The lake (and likely OHLCV) benefit from being "greedy" and always growing to obtain as much data as possible. So, it's silly for an example to have a Time Window of 1 month that would delete old data.
What would be best would be to separate these concerns into Lake<->System
Lake
|- Pond
|- Pond
[Lake]
Is always greedy, trying to grow as much as possible.
[Pond]
Is a filter of Lake, trying to process whatever data from lake it's responsible for.
We generally do not want a lake with a moving tail.
Proposal A - Broad Lake - Narrow Pond - Improving ppss.yaml
Update lake_ss to be greedy.
Let model_ss use filtering
Example:
TODOs / DoD
Tasks:
The text was updated successfully, but these errors were encountered: