Skip to content
This repository has been archived by the owner on Apr 18, 2024. It is now read-only.

Use a more advanced weighting / fail-over strategy #19

Merged
merged 25 commits into from
Feb 18, 2023
Merged

Conversation

willscott
Copy link
Contributor

@willscott willscott commented Feb 9, 2023

fix #1

this is a second pass at the consistent hash

  • this adds logic to decrease the weight of provided backends each time they fail a request.

As-is this should make the prevalence of 503's quite a bit better.

Todo before done:

@willscott willscott requested a review from hacdias February 9, 2023 22:52
Copy link
Contributor

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over time, this helps a bit with 60s timeout discussed on Slack.

But retry logic is dead code (see comment below)

p.lk.RUnlock()
root := member.String()
return nil, ErrNoBackend
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 (we were missing this termination before)


blk, err = p.doFetch(ctx, root, c)
for i := 0; i < len(nodes); i++ {
blk, err = p.doFetch(ctx, nodes[i], c)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have no timeout for doFetch here, you effectively have no retries.

In my tests, either saturn PoP replies fast with block, or it hangs until 60s timeout in biforst-gateway hits, and cancels the entire context. There is never a scenario when MaxRetries logic gets executed here.

Some ideas:

  • if you introduce timeout per doFetch attempt, maybe 19s (so 3 retries take <60s) (we already downvote on each failure)
  • try at least 2 PoPs in parallel, and use the fastest response, cancelling remaining ones (+ boost weight of the fast one)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c13c013.

@aarshkshah1992
Copy link
Contributor

aarshkshah1992 commented Feb 15, 2023

I am taking ownership of landing this. The follow up work is in #26 and is ready for review.

@willscott
Copy link
Contributor Author

the open bullet of 'don't downweight for 404's' is captured in #28 as a follow-on item

@willscott
Copy link
Contributor Author

we should try to fix the flaky test here - i believe it has to do with potential non-determinism in how the consistent hash re-shards on node re-weighting.

@tchardin
Copy link

@willscott is the bug on this branch or on master?

@tchardin
Copy link

Sorry I mean is the gateway using this branch or master?

@willscott
Copy link
Contributor Author

gateway is using / hanging bug is on this branch

@willscott
Copy link
Contributor Author

noting that we have made the change such that we can say this branch will:

fix #27

lidel added a commit to ipfs-inactive/bifrost-gateway that referenced this pull request Feb 18, 2023
lidel added a commit to ipfs-inactive/bifrost-gateway that referenced this pull request Feb 18, 2023
@willscott
Copy link
Contributor Author

fix #16
fix #25

Copy link
Contributor

@lidel lidel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shipped to bifrost-stage1-ny, lgtm.

ps. @willscott automatic closing of issues works only if you add "fix #16 " to the top comment, by editing it

@willscott willscott merged commit adbe481 into main Feb 18, 2023
@willscott willscott deleted the weighted branch February 18, 2023 04:04
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dynamic Load
4 participants