Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EPIC, Sim] Integrate subgraph prediction into sim trading #1257

Closed
11 tasks done
KatunaNorbert opened this issue Jun 20, 2024 · 3 comments · Fixed by #1296 or #1341
Closed
11 tasks done

[EPIC, Sim] Integrate subgraph prediction into sim trading #1257

KatunaNorbert opened this issue Jun 20, 2024 · 3 comments · Fixed by #1296 or #1341
Assignees
Labels
Epic Type: Enhancement New feature or request

Comments

@KatunaNorbert
Copy link
Member

KatunaNorbert commented Jun 20, 2024

Background / motivation

Goal: Make money trading against previous prediction signals.

Target outcomes:

  1. [Norbert] Do a first cut
  2. [Berkay] reduce user friction, reduce sim_engine.py complexity

Details of targets in the TODOs below

TODOs: Phase 1: Norbert

Target outcome: first cut

  • make sims fetch necessary prediction data to be used for trading simulation #1262
  • make it easy to specify inside ppss.yaml what prediction data to be used by trader #1262
  • integrate prediction signals into sim engine and make it possible to trade agains those #1263
  • fix duckdb tests are failing on the github actions test #1303
  • get contract ids from duckDb instead of subgraph to reduce subgraph load and allow engine to run on existing data if subgraph is unavailable #1310
  • mock subgraph data inside tests #1313
  • update readme and docs accordingly #1316

TODOs: Phase 2: Berkay

Target outcome: reduce user friction, reduce sim_engine.py complexity

  • Refactor sim_engine.py so that its complexity is manageable again. (Phase 1 made sim_engine.py way more complicated.)
  • Restore use of pdr sim such that (a) doesn't require grabbing specifying the chain (b) doesn't require getting chain data (c) has 5000 five-min candles (18 days). Update CLI accordingly.
  • Make it "not painful" for simulator users to get subgraph prediction data. Where pain = having to wait around > 10 minutes. HOW:
    • (a) have a lake daemon running in the cloud that populates github data repo pdr-lake-cache with csv/parquet files
    • (b) when a user runs simulation including subgraph prediction data, it calls lake functionality, which
      (i) first tries to grab from local, then (ii) gets remaining from pdr-lake-cache, then (iii) gets remaining from actual chain queries. Details in this comment below, in bottom 1/3.
  • With ^ done, is it now low friction to use? Has sim_engine.py complexity been tamed? If yes, then we can complete
@KatunaNorbert KatunaNorbert added the Type: Enhancement New feature or request label Jun 20, 2024
@KatunaNorbert KatunaNorbert self-assigned this Jun 20, 2024
@KatunaNorbert
Copy link
Member Author

KatunaNorbert commented Jun 20, 2024

At first step we will add the new chain feed prediction on top of the already existing code and just easily switch between the 2 prediction sources by using a use_own_model flag inside ppss.yaml as:
`# predict price direction

    if ppss.sim_ss.use_own_model:
      prob_up: float = model.predict_ptrue(X_test)[0]  # in [0.0, 1.0]
    else:
      prob_up: float = <<get chain feed prediction>>

`

@KatunaNorbert KatunaNorbert changed the title Integrate subgraph prediction into sim trading [Sim] Integrate subgraph prediction into sim trading Jun 20, 2024
@trentmc trentmc changed the title [Sim] Integrate subgraph prediction into sim trading [EPIC, Sim] Integrate subgraph prediction into sim trading Jun 20, 2024
trizin added a commit that referenced this issue Jul 4, 2024
@trentmc trentmc reopened this Jul 9, 2024
@trentmc
Copy link
Member

trentmc commented Jul 9, 2024

Background: Slack DMs Trent <> Berkay on Jul 5, 2024 which led to Phase 2. Link

trentmc0

Hey - I was just looking at the commit bf0a07a#diff-e1434a39ebe0600cf431437fa2fc76a535ef5ef5c234c2cfc3d31e7a438cced3 to fix #1257: Use chain predictions inside trading sim (PR #1296).
This is a pretty huge complexity spike to sim_engine -- our most sensitive code for complexity.
While it's a nice idea, now that I see it,

  • It doesn't really help us go faster in "make $" because we want to use our own models / setup anyay
  • And it slows us down in "make $" because sim_engine just got a lot more complex, which is exactly where we want to iterate rapidly
  • [UPDATE] It also reduced the default # training points from 30 days to 10, which meant we can only simulate 1800 not 5000 iterations. That's a step down.
  • [UPDATE] It forces 'pdr sim' to now specify network. Some unit tests now use sapphire-testnet not development. Yikes.

I'm leaning to removing it for now. With the [UPDATE], I do want to remove it.
What do you think?

Berkay Saglam

just checked the pr, I see what you mean.

This feature is super valuable for our users. For traders to see how much money they would have made. For predictoors, it's a step towards having more accurate simulations as this lays down the foundation to getting prediction data at any point.

I agree with regards to the complexity it adds, but, I think/hope we can reduce it down to just a few lines of change in simengine and keep the commands same, could remove testnet support and mock network calls in tests, leave it off by default and keep the original values. Move sql queries outside simengine, encapsulate the logic into a function and get prediction data in a single line. Although I'm not familiar with the Duckdb stuff, I'm happy to give it a try and possibly iterate with Norbert 🙌

note: for the conflicts with your pr, I think it's for the best to revert this now and work on the improvements in the original pr

trentmc0

Thanks for the detailed thoughts.

I think you make a great suggestion. Yes, please do give it a try. And also, yes, do revert this now and work on improvements in the original pr. Thank you. (And, pull in Norbert as needed.) (edited)

Berkay Saglam

Cool! Reverted, will work on the improvements, thanks!

Berkay Saglam

#1341

Hi i managed to trim down complexity to: total files changed 20 -> 8, no changes in sim tests and one new function + an if statement + a new dict in simengine

still got few improvements to do, however, I noticed that this is fetching all predictions in chain instead of the slot data. Norbert confirmed this is a limitation of data lake and it has to fetch predictions.
The problem is that there's a lot of predictions and simengine only needs slot data (total stake amount on each side)

On my pc, it takes 5 seconds to complete a query, a single query can fetch 1000 predictions. With 100 users submitting predictions on every epoch:
1002013*24/1000 = 624 requests: 1 day worth of data takes 0.87 hours to fetch

This seems unsustainable. We'd need to make 1 request to fetch 3 days worth of data if we used slot data instead. (20 * n users)x speed up.

I understand that lake needs to fetch prediction data to show predictoor stats, but simengine doesn't need it. We could create a slot data fetcher module and manage the data in a parquet file easily, should I go for it?

trentmc0

Thanks for the update. Sounds like excellent work so far 👍 👍

we could create a slot data fetcher module and manage the data in a parquet file easily, should I go for it?
Isn't this what the data lake is supposed to do?

I haven't worked with the lake yet, so I don't know for sure

But it seems super weird if it can't persist data from before. If it can't do that, then what's the point of it?

Berkay Saglam

a no it can persist data. Lake uses prediction data, it is the most detailed data we have, and lake aggregates it to provide statistics etc. allowing profit queries per predictoor address for example.
But it works with prediction data, meaning it needs to fetch all individual predictions there are. For sim engine slot data is sufficient which includes total stake up and total stake down.

Fetching prediction data for 1 day & 100 users takes 624 requests (approx 0.8 hours), whereas we can fetch 3 days worth of slot data for a pair in a single request

trentmc0

Can you clarify your first sentence please?

Berkay Saglam

sure, so there's prediction data and slot data

  • prediction data: contains prediction direction, stake, slot, per each prediction
  • slot data: contains total stake up and down, per slot
  • simengine requires slot data, not prediction data. Right now the implementation uses lake to retrieve prediction data (which takes tooo long) and then uses lake to retrieve slot data from that prediction data

trentmc0

Thanks very much for the clarification. That helps a lot.

Does lake persist the data once I've gathered it? That is, if I run lake a second time for the same data request, does it use data stored locally to disk, or does it re-query everything from chain?

Berkay Saglam

Yes it persists data

trentmc0

So once you have the data it's not slow

I guess you're proposing :

  • to be able to download data from somewhere else, because downloading is faster than querying from chain
  • And in a format different than what the lake stores

Am I interpreting correctly?

Berkay Saglam

yes! I haven't thought about that. That'd solve it, I remember Roberto mentioned to query the data and share it, so anyone can download and load it up to lake. Querying all the predictions takes too much time.
other solution is creating a simple module that queries and persists slot data, slot data is less granular, only includes information for slots and it takes 1 query to fetch 3 days worth of data. Although this adds complexity and an extra module to repo. Considering that downloading might be a better solution although such data is not available at present

trentmc0

Thanks for the thoughts.

Q's:

  • what format does the lake currently store in?
  • And, I know there was csv stuff, where does it play a role?

Berkay Saglam

lake uses duckdb, sql database. Duckdb can read and write CSV, Parquet, and JSON. However, I'm not sure where the csvs are used.

Lake stores all the predictions and aggregates them, so it needs the prediction data to work. prediction data includes: address, stake_amt, direction per predictoor. To give an example: 100 predictoors, two sided, 10 feeds = a total of 2000 predictions each slot lake has to process. This is why it takes long to fetch prediction data, however, lake needs this data to provide detailed statistics, predictoor profits by address etc.

Confirmed this limitation with Norbert in this thread: https://oceanprotocol.slack.com/archives/C05P3TBHUAC/p1720176804737049 (edited)

trentmc0

It seems it's a good time to build a cloud-based caching system.

I think github-based is a good idea. It's how paradigm does it. It makes it highly accessible.

Here's how it would work:

  • We have a github repo called pdr-lake-cache
  • Whenever lake goes to start doing chain queries, it gets the info as follows:
    • whenever possible, it uses the data it already has locally
    • for any missing data, it grabs what it can from called pdr-lake-cache github repo
    • for any missing data, it does the actual chain queries
  • We have a means to update data in pdr-lake-cache github repo. Ideas: (a) new daemon on azure (b) github worker - but how do we trigger it? (c) p2p: right after getting the chain data, if it has write-permission to the repo, then it sends all the missing data to the repo (d) some combo of a/b/c.

It feels like a first cut could get built very quickly.

And it feels like the right time, since both you & Norbert are already feeling pain with this new feature.
Thoughts?

Berkay Saglam

Sounds awesome how about this (kinda combo of abcd):

We launch a private server and run lake. We can also access it to query lake data, but its not available to public

The server continuously fetches data and exports & uploads data to github or to a bucket every midnight.

My only concern is that how big this data will be, most likely gonna reach gigabytes. We can export only last 15 days or 30 maybe

trentmc0

Let's call my first proposal to be "Cand A", and yours to be "Cand B".

Analysis: if cand B, non-S3 people's scripts would still have to grab the cached csvs/parquets from the repo.

  • Therefore they'd still have to run their DuckDB server locally.
  • Therefore Cand B would need to support two DuckDB query flows: one local, one remote. Compare this to Cand A needs just local DuckDB queries.
  • Perhaps more importantly, Cand B would have S3 following a different flow than our users. They might hit bugs that we never see

So I lean to "Cand A". One less flow to support; and more importantly, hardening the same flows as our users.

Re size of the data: if it hits GBs quickly, then maybe Github isn't the right place. Let's consider our options: (i) github repo (ii) Google Drive (iii) Google cloud bucket [added]. What else?

Berkay Saglam

Google cloud bucket is another option. The average cost is around 0.05/gb it will cost us as people download the data. Google drive or github is more cost efficient

Good point! Then we can just leave the lake running, upload data regularly and block all outside access to server.

Question, why don’t we allow anyone to access and query data as they wish?

trentmc0

Question, why don’t we allow anyone to access and query data as they wish?

TBH, I'm not against that, we should consider that as an option.

  1. Why we haven't been discussing this as much:
  2. we wanted people to be able to run things on their own. So that they can add their own queries, for example.

and it could consume a lot of resources if people get crazy
And thinking out loud, (2) was the main thing that we'd wanted to avoid. We want people used to running their own infrastructure; and to us not bearing large costs for it.

trentmc0

We're already going to be storing csv/parquet data in the cloud, to be uploaded by the daemon, and downloaded by local users (and the daemon).

So why complicate by adding another flow for remote querying?

Berkay Saglam

Right makes sense, we could maybe look into that if there's demand for it in the future
Is it then: private lake running on a server, exports and uploads last x days of data to Github or alternatively Google Drive every midnight

trentmc0

Right makes sense, we could maybe look into that if there's demand for it in the future

Perfect. Agree

Is it then: private lake running on a server, exports and uploads last x days of data ...

Yes

And we need to determine where to store the csvs / parquets:

options: (i) github repo (ii) Google Drive (iii) Google cloud bucket (iv) AWS S3 (v) Azure storage

trentmc0

I thought more about this.

Criteria:

  1. OPF upload bandwidth limit ok for us?
  2. Users' download bandwidth limit ok?
  3. Cost to OPF ok?
  4. Cost to users zero or near-zero?
  5. Low-friction for users to download running pdr scripts?
  6. Low-friction for users to inspect the files in the repo?
  7. Possible for users to download manually, easily?

Quick analysis:

  • (i) github repo and (ii) Google Drive do ok on (1-7).
  • (iii) Google cloud bucket (iv) AWS S3 (v) Azure storage are ok on (1-7), but do badly on (6)(7)
  • So from here on, let's focus on (i) github repo vs (ii) Google Drive

Details on Github

  • On criteria (1)(2)(3)(4):
    • Github bandwidth limits: No limit per se. We just have to pay beyond 1 GB of data downloaded by others. That's fine.
    • Github rate limits: 5000 http queries per hour. Assuming files are large enough so that # queries are <<5000, that's fine.
  • On criteria (5)(6)(7):
    • Github is the same system users use to get pdr-backend. (Though not a huge "plus" because we'd have the queries built into the code)
    • People can easily inspect all the files

Details on Google Drive:

  • About the same as github.
  • Except not "the same system users use to get pdr-backend"
  • And some users may not be using GDrive. We use it, but maybe less-so others?

My recommendation:

  • Start with Github. If it looks problematic, play with GDrive to see if it will be better

Appendix:

Berkay Saglam

++ great analysis! I agree let's start with GitHub see if it causes any problems, it's easy to switch providers anyway. I'll spin up a vm and start lake

@trentmc
Copy link
Member

trentmc commented Jul 9, 2024

Current open PRs related to this issue:

@trizin can one of these be closed, in lieu of the other one? If no, please clarify what's what here

trizin added a commit that referenced this issue Jul 18, 2024
…#1341)

* Fix 1262: Get prediction data for sim (PR #1265)

* added new field to ppss for type of prediction source

* add network param to sim cli and make fetch work

* verify there is enough data in database and write test

---------

Co-authored-by: Mustafa Tuncay <[email protected]>

* Fix #1263: Transform lake data into prediction signals (#1297)

* issue-1263: sim_engine changes

* issue-1263: part 2

* test fixes

* removed unnecessary console logs

* fix model var imps plot on chain data

* fix mypy and pylint

* issue-1263: fixes

* issue-1263: tests will be fixed at the epic branch

* change ppss config to better fit the 2 predictions

* add try catch around gql update and fix pylint

---------

Co-authored-by: Norbert <[email protected]>

* get feed contract address from duckDB (#1311)

* Fix #1316: Update readme with chain data predictions signals inside trading (#1320)

* update readmes

* Fix #1313: mock GQLDataFactory for the sim engine (#1315)

* Stop tracking lake_data/ folder

* clear_test_db is added to test_get_predictions_signals_data, removed console logs

---------

Co-authored-by: Norbert <[email protected]>

* review fixes

* review fixes

* fixes

* moved ppss validation to ppss class

* use parameter names when calling sim test dict function

* removed print

* fix failint test in trader agent system

* Default to mainnet and revert changes

* Revert values

* remove redundant code and move functions

* update tests

* SimEngineChainPredictions class

* better name

* revert

* use dev

* format

* more improvements

* remove check and verify

* remove test

* verify_use_chain_data_in_syms_dependencies

* move tests

* verify_use_chain_data_in_syms_dependencies

* formatting

* make use_own_model optional

* remove param

* remove

* move import

* remove unused import

* revert changes

* remove unused

* revert

* empty_fig

* update imports

* linter

* Use dict

* update test

* linter

* mypy

* remove def

* fix insert to table db function is depricated

* move under _init_loop_attributes

* shorter error

* rename to prob up

* improvements to readme

* mypy

* separate source of predictions

---------

Co-authored-by: Norbert <[email protected]>
Co-authored-by: Mustafa Tuncay <[email protected]>
Co-authored-by: Norbert <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Epic Type: Enhancement New feature or request
Projects
None yet
3 participants