Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Investigate auction propagation delay between driver <> solver engine #2133

Closed
fleupold opened this issue Dec 7, 2023 · 2 comments · Fixed by #2134
Closed

chore: Investigate auction propagation delay between driver <> solver engine #2133

fleupold opened this issue Dec 7, 2023 · 2 comments · Fixed by #2134
Assignees
Labels
E:3.1 Driver Colocation See https://github.com/cowprotocol/pm/issues/14 for details

Comments

@fleupold
Copy link
Contributor

fleupold commented Dec 7, 2023

Background

Similar to #2095 there is another 1.7s propagation delay between the driver and the solver-enginer (cf. https://production-6de61f.kb.eu-central-1.aws.cloud.es.io/app/r/s/U1XTl)

2023-12-07T08:38:34.078Z DEBUG request{id="17245"}:/solve:auction{id=8122282}: shared::http_solver: http solver instance name is 2023-12-07_08:38:34.078506830_UTC_Ethereum___Mainnet_1_8122282
2023-12-07T08:38:32.337Z DEBUG request{id="17245"}:/solve{solver=otex-solve auction_id=8122282}: driver::infra::api::routes::solve: auction task execution time elapsed=556.865172ms auction_id=8122282

It would be good to see if we can reduce this as well since we are leaking time the external solvers can actually spend on finding a solution.

@fleupold fleupold added the E:3.1 Driver Colocation See https://github.com/cowprotocol/pm/issues/14 for details label Dec 7, 2023
@MartinquaXD
Copy link
Contributor

MartinquaXD commented Dec 7, 2023

There are only 2 async operations happening in the measured code path:

  1. fetching current gas price
  2. fetching token meta data (symbol, decimals, contract balance)

2 is heavily cached but if we encounter a token that is not cached all solvers would send 3 RPC requests per uncached token. If this is the culprit we could use the RequestSharing component here to ensure they don't send duplicate requests.

Edit: Looks like the culprit could actually be lock contention. If you configure enough solvers you can reproduce slow times even without cache misses.

@MartinquaXD MartinquaXD self-assigned this Dec 7, 2023
MartinquaXD added a commit that referenced this issue Dec 7, 2023
# Description
We've noticed weird delays between the autopilot sending an auction and
the driver being ready to start working on it.
The linked issue suggests that we spend a lot of time converting the
auction and enriching it with token meta data.
This should be super fast because all the data should be cached at all
times.
However, was one token that caused it all to break down: `0xeeeee...`
(the address we use to denote the chains native token)

This is roughly what happened:
1. every driver receives auction
2. every driver collects all the addresses for which we don't have
cached metadata
3. this always yielded `0xeeee...` (it's not a real token so we never
get data when calling ERC20 functions on it)
4. every driver wasted some time sending RPC requests for `0xeeee`
5. every driver tried to take an exclusive lock to write no data (empty
vec) to the cache

Overall this wasted time with unnecessary network requests and serially
taking an exclusive lock for all drivers.

# Changes
* most importantly filter out `0xeee`
* avoid sending duplicate RPC requests by using `RequestSharing`
* don't take an exclusive lock if some other driver cached the same data
in the meantime
* moved logging of token balance update task out of the critical section

All changes combined results in us now spending a couple microseconds on
getting cached balances instead of seconds.
And because this was the bulk of the work captured by the log `auction
task execution time` we reduced that time to ~11ms.

## How to test
e2e tests should still pass and manual test confirmed the reduced
latency.

Fixes #2133
@MartinquaXD
Copy link
Contributor

Follow up on this. Each driver now spends 2-3 ms on this part most of the time.
But there can still be unfortunate timings when we see a new block right before the auction or even during the auction.
For example here 1 driver was able to finish the task in 3ms and then all the subsequent drivers needed multiple 100ms because we saw a new block.
But I think there is nothing that can reasonably be done about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E:3.1 Driver Colocation See https://github.com/cowprotocol/pm/issues/14 for details
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants