-
Notifications
You must be signed in to change notification settings - Fork 2
Use a more advanced weighting / fail-over strategy #19
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Over time, this helps a bit with 60s timeout discussed on Slack.
But retry logic is dead code (see comment below)
p.lk.RUnlock() | ||
root := member.String() | ||
return nil, ErrNoBackend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 (we were missing this termination before)
|
||
blk, err = p.doFetch(ctx, root, c) | ||
for i := 0; i < len(nodes); i++ { | ||
blk, err = p.doFetch(ctx, nodes[i], c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you have no timeout for doFetch
here, you effectively have no retries.
In my tests, either saturn PoP replies fast with block, or it hangs until 60s timeout in biforst-gateway hits, and cancels the entire context. There is never a scenario when MaxRetries
logic gets executed here.
Some ideas:
- if you introduce timeout per
doFetch
attempt, maybe 19s (so 3 retries take <60s) (we already downvote on each failure) - try at least 2 PoPs in parallel, and use the fastest response, cancelling remaining ones (+ boost weight of the fast one)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in c13c013.
I am taking ownership of landing this. The follow up work is in #26 and is ready for review. |
Co-authored-by: Marcin Rataj <[email protected]>
re-find idx in lock
Fix bugs, refactor, document and land the weighted consistent hashing branch
the open bullet of 'don't downweight for 404's' is captured in #28 as a follow-on item |
we should try to fix the flaky test here - i believe it has to do with potential non-determinism in how the consistent hash re-shards on node re-weighting. |
@willscott is the bug on this branch or on master? |
Sorry I mean is the gateway using this branch or master? |
gateway is using / hanging bug is on this branch |
noting that we have made the change such that we can say this branch will: fix #27 |
Closes #41 by applying fixes from filecoin-saturn/caboose#19
Closes #41 by applying fixes from filecoin-saturn/caboose#19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shipped to bifrost-stage1-ny
, lgtm.
ps. @willscott automatic closing of issues works only if you add "fix #16 " to the top comment, by editing it
fix #1
this is a second pass at the consistent hash
As-is this should make the prevalence of 503's quite a bit better.
Todo before done: