You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.
Approach 1: separate lake, started separately
User starts a separate process pdr lake. It's constantly writing to the data lake
Predictoor bot reads from the lake, but does not write (for safety). Same for other apps.
This was the idea when we conceived of lake.
But we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.
Approach 2: allow >1 writers.
We could have >1 lake processes / threads, predictoor bots, or other apps.
Support two flows:
Flow 1: quickstart: start pdr lake inside the app. Eg user starts one pdr predictoor process (and nothing else). The predictoor bot will detect whether a lake process is running, and start one if needed.
Flow 2: power-predictoor usage: start pdr lake separately. Eg a user starts pdr lake, then 20 pdr predictoor processes, one for each feed to predict
Flow 3: power-lake usage: >1 lake processes / threads filling complementary parts of lake (different pairs, different subgraph queries). Eg user starts 1 pdr lake process, and it starts >1 threads. Eg user starts 1 process, then later one, a different one with different goals. Eg >1 users start different processes
Benefits: (a) more convenient: users don't need to kick off the lake process themselves. (b) faster: because parallel fill (c) more flexible: users (or predictoor bots) can start more lake processes without worry
Approach 2 is endgame. The benefits compared to 1 are immense, let alone 0.
Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?
I (Trent) recommend the 0->2 directly because of the big benefits
Whereas doing 1 in between would force the user to have to change behavior. (And extra effort for us overall: much of the code we'd write for 1 would be thrown away for 2)
TODOs
Locking core: Update lake to support "locking" concept. Such that I could run >1 different pdr lake processes against the same feed, and they wouldn't fight with each other
Parallel fill: Update lake to run >1 threads within a single pdr lake process, 1 thread per ohlcv pair or subgraph feed
Update predictoor bot: detect whether a lake process is running, and start one if needed.
Ensure READMEs are all updated accordingly. predictoor.md and trader.md should teach the user about how to run pdr lake separately (at the end of the README)
The text was updated successfully, but these errors were encountered:
trentmc
changed the title
[Lake, UX] Separate the process for pr
[Lake, UX] Lake supports >1 writers, incl predictoor bots and >1 pdr lakes
Jan 20, 2024
I agree in general that all services (including different agents/bots) would benefit from having a lake that's just up-to-date. And having a process that's solely-responsible for doing this is the way forward.
I think there might be other approaches like "swapping tables", or updating a pointer to the latest table, that might more productive to implement than locking.
What I originally considered was building a base table.py object, that would abstract the schema, return the df, point to a file, etc... The basic structure can be found on table_pdr_predictions, table_pdr_subscriptions. Anything that reads from the lake, would do so through the Table() interface, not DataFactory(). This way, DataFactories are operating on their own, updating the lake, while components/users can access via the interface.
DuckDB only lets you have 1 writer process at a time that holds the db writer connection. Within this, you can then have multiple threads/operating on it. So, for the duckdb "container/process/vm" we should make it as big as possible.
There is now a task for making sure that Lake/ETL has an "update process" that sits there indefinitely looping and updating the lake #1107
Forward Looking:
We can have requests/queries/etc pushed to "duckdb writer/service" via a simple API
To solve for "Approach 2" / multiple writers, it would entail a clustered db (I.e. clickhouse)
Background / motivation
Approach 0: each app writes lake, in its process. Predictoor bot updates the data lake, then uses the lake, in time for making predictions. Same for other apps in pdr-backend.
Approach 1: separate lake, started separately
pdr lake
. It's constantly writing to the data lakeBut we can do better yet, leveraging the database concept of "locking" which enables >1 writers without hurting DB safety. Writers must handle contention due to locks, eg by waiting.
Approach 2: allow >1 writers.
pdr lake
inside the app. Eg user starts onepdr predictoor
process (and nothing else). The predictoor bot will detect whether a lake process is running, and start one if needed.pdr lake
separately. Eg a user startspdr lake
, then 20pdr predictoor
processes, one for each feed to predictpdr lake
process, and it starts >1 threads. Eg user starts 1 process, then later one, a different one with different goals. Eg >1 users start different processesApproach 2 is endgame. The benefits compared to 1 are immense, let alone 0.
Q: Should we go from 0 -> 1 -> 2, or 0->2 directly?
TODOs
pdr lake
separately (at the end of the README)The text was updated successfully, but these errors were encountered: