-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EPIC, Sim] Integrate subgraph prediction into sim trading #1257
Comments
At first step we will add the new chain feed prediction on top of the already existing code and just easily switch between the 2 prediction sources by using a
` |
Background: Slack DMs Trent <> Berkay on Jul 5, 2024 which led to Phase 2. Link trentmc0Hey - I was just looking at the commit bf0a07a#diff-e1434a39ebe0600cf431437fa2fc76a535ef5ef5c234c2cfc3d31e7a438cced3 to fix #1257: Use chain predictions inside trading sim (PR #1296).
I'm leaning to removing it for now. With the [UPDATE], I do want to remove it. Berkay Saglamjust checked the pr, I see what you mean. This feature is super valuable for our users. For traders to see how much money they would have made. For predictoors, it's a step towards having more accurate simulations as this lays down the foundation to getting prediction data at any point. I agree with regards to the complexity it adds, but, I think/hope we can reduce it down to just a few lines of change in simengine and keep the commands same, could remove testnet support and mock network calls in tests, leave it off by default and keep the original values. Move sql queries outside simengine, encapsulate the logic into a function and get prediction data in a single line. Although I'm not familiar with the Duckdb stuff, I'm happy to give it a try and possibly iterate with Norbert 🙌 note: for the conflicts with your pr, I think it's for the best to revert this now and work on the improvements in the original pr trentmc0Thanks for the detailed thoughts. I think you make a great suggestion. Yes, please do give it a try. And also, yes, do revert this now and work on improvements in the original pr. Thank you. (And, pull in Norbert as needed.) (edited) Berkay SaglamCool! Reverted, will work on the improvements, thanks! Berkay SaglamHi i managed to trim down complexity to: total files changed 20 -> 8, no changes in sim tests and one new function + an if statement + a new dict in simengine still got few improvements to do, however, I noticed that this is fetching all predictions in chain instead of the slot data. Norbert confirmed this is a limitation of data lake and it has to fetch predictions. On my pc, it takes 5 seconds to complete a query, a single query can fetch 1000 predictions. With 100 users submitting predictions on every epoch: This seems unsustainable. We'd need to make 1 request to fetch 3 days worth of data if we used slot data instead. (20 * n users)x speed up. I understand that lake needs to fetch prediction data to show predictoor stats, but simengine doesn't need it. We could create a slot data fetcher module and manage the data in a parquet file easily, should I go for it? trentmc0Thanks for the update. Sounds like excellent work so far 👍 👍 we could create a slot data fetcher module and manage the data in a parquet file easily, should I go for it? I haven't worked with the lake yet, so I don't know for sure But it seems super weird if it can't persist data from before. If it can't do that, then what's the point of it? Berkay Saglama no it can persist data. Lake uses prediction data, it is the most detailed data we have, and lake aggregates it to provide statistics etc. allowing profit queries per predictoor address for example. Fetching prediction data for 1 day & 100 users takes 624 requests (approx 0.8 hours), whereas we can fetch 3 days worth of slot data for a pair in a single request trentmc0Can you clarify your first sentence please? Berkay Saglamsure, so there's prediction data and slot data
trentmc0Thanks very much for the clarification. That helps a lot. Does lake persist the data once I've gathered it? That is, if I run lake a second time for the same data request, does it use data stored locally to disk, or does it re-query everything from chain? Berkay SaglamYes it persists data trentmc0So once you have the data it's not slow I guess you're proposing :
Am I interpreting correctly? Berkay Saglamyes! I haven't thought about that. That'd solve it, I remember Roberto mentioned to query the data and share it, so anyone can download and load it up to lake. Querying all the predictions takes too much time. trentmc0Thanks for the thoughts. Q's:
Berkay Saglamlake uses duckdb, sql database. Duckdb can read and write CSV, Parquet, and JSON. However, I'm not sure where the csvs are used. Lake stores all the predictions and aggregates them, so it needs the prediction data to work. prediction data includes: address, stake_amt, direction per predictoor. To give an example: 100 predictoors, two sided, 10 feeds = a total of 2000 predictions each slot lake has to process. This is why it takes long to fetch prediction data, however, lake needs this data to provide detailed statistics, predictoor profits by address etc. Confirmed this limitation with Norbert in this thread: https://oceanprotocol.slack.com/archives/C05P3TBHUAC/p1720176804737049 (edited) trentmc0It seems it's a good time to build a cloud-based caching system. I think github-based is a good idea. It's how paradigm does it. It makes it highly accessible. Here's how it would work:
It feels like a first cut could get built very quickly. And it feels like the right time, since both you & Norbert are already feeling pain with this new feature. Berkay SaglamSounds awesome how about this (kinda combo of abcd): We launch a private server and run lake. We can also access it to query lake data, but its not available to public The server continuously fetches data and exports & uploads data to github or to a bucket every midnight. My only concern is that how big this data will be, most likely gonna reach gigabytes. We can export only last 15 days or 30 maybe trentmc0Let's call my first proposal to be "Cand A", and yours to be "Cand B". Analysis: if cand B, non-S3 people's scripts would still have to grab the cached csvs/parquets from the repo.
So I lean to "Cand A". One less flow to support; and more importantly, hardening the same flows as our users. Re size of the data: if it hits GBs quickly, then maybe Github isn't the right place. Let's consider our options: (i) github repo (ii) Google Drive (iii) Google cloud bucket [added]. What else? Berkay SaglamGoogle cloud bucket is another option. The average cost is around 0.05/gb it will cost us as people download the data. Google drive or github is more cost efficient Good point! Then we can just leave the lake running, upload data regularly and block all outside access to server. Question, why don’t we allow anyone to access and query data as they wish? trentmc0
TBH, I'm not against that, we should consider that as an option.
and it could consume a lot of resources if people get crazy trentmc0We're already going to be storing csv/parquet data in the cloud, to be uploaded by the daemon, and downloaded by local users (and the daemon). So why complicate by adding another flow for remote querying? Berkay SaglamRight makes sense, we could maybe look into that if there's demand for it in the future trentmc0
Perfect. Agree
Yes And we need to determine where to store the csvs / parquets:
trentmc0I thought more about this. Criteria:
Quick analysis:
Details on Github
Details on Google Drive:
My recommendation:
Appendix:
Berkay Saglam++ great analysis! I agree let's start with GitHub see if it causes any problems, it's easy to switch providers anyway. I'll spin up a vm and start lake |
Current open PRs related to this issue:
@trizin can one of these be closed, in lieu of the other one? If no, please clarify what's what here |
…#1341) * Fix 1262: Get prediction data for sim (PR #1265) * added new field to ppss for type of prediction source * add network param to sim cli and make fetch work * verify there is enough data in database and write test --------- Co-authored-by: Mustafa Tuncay <[email protected]> * Fix #1263: Transform lake data into prediction signals (#1297) * issue-1263: sim_engine changes * issue-1263: part 2 * test fixes * removed unnecessary console logs * fix model var imps plot on chain data * fix mypy and pylint * issue-1263: fixes * issue-1263: tests will be fixed at the epic branch * change ppss config to better fit the 2 predictions * add try catch around gql update and fix pylint --------- Co-authored-by: Norbert <[email protected]> * get feed contract address from duckDB (#1311) * Fix #1316: Update readme with chain data predictions signals inside trading (#1320) * update readmes * Fix #1313: mock GQLDataFactory for the sim engine (#1315) * Stop tracking lake_data/ folder * clear_test_db is added to test_get_predictions_signals_data, removed console logs --------- Co-authored-by: Norbert <[email protected]> * review fixes * review fixes * fixes * moved ppss validation to ppss class * use parameter names when calling sim test dict function * removed print * fix failint test in trader agent system * Default to mainnet and revert changes * Revert values * remove redundant code and move functions * update tests * SimEngineChainPredictions class * better name * revert * use dev * format * more improvements * remove check and verify * remove test * verify_use_chain_data_in_syms_dependencies * move tests * verify_use_chain_data_in_syms_dependencies * formatting * make use_own_model optional * remove param * remove * move import * remove unused import * revert changes * remove unused * revert * empty_fig * update imports * linter * Use dict * update test * linter * mypy * remove def * fix insert to table db function is depricated * move under _init_loop_attributes * shorter error * rename to prob up * improvements to readme * mypy * separate source of predictions --------- Co-authored-by: Norbert <[email protected]> Co-authored-by: Mustafa Tuncay <[email protected]> Co-authored-by: Norbert <[email protected]>
Background / motivation
Goal: Make money trading against previous prediction signals.
Target outcomes:
Details of targets in the TODOs below
TODOs: Phase 1: Norbert
Target outcome: first cut
TODOs: Phase 2: Berkay
Target outcome: reduce user friction, reduce sim_engine.py complexity
pdr sim
such that (a) doesn't require grabbing specifying the chain (b) doesn't require getting chain data (c) has 5000 five-min candles (18 days). Update CLI accordingly.lake
daemon running in the cloud that populates github data repopdr-lake-cache
with csv/parquet files(i) first tries to grab from local, then (ii) gets remaining from
pdr-lake-cache
, then (iii) gets remaining from actual chain queries. Details in this comment below, in bottom 1/3.The text was updated successfully, but these errors were encountered: