-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266
Conversation
Tested by 1. Ran this bash script for 8 minutes. It did eventually fail due to "no cluster leader" but everything retried appropriately ``` while :; do sudo docker restart nomad-server-$(($RANDOM % 5)) && sleep $((5 + $RANDOM % 10)); done^C ``` 2. Changed default blocking time and logged how long blocking queries took using the above script. They took 30-40s each, which is expected when you add a default RPCHoldtime of 5s which can be applied twice for blocking queries. This seems sane to me. 3. Added test code to force an EOF error every 10 seconds (separate from above tests) and verified it always retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @benbuzbee! This looks great! I've left a question about the reflection approach, but it's possible I'm just missing something there.
client/rpc.go
Outdated
@@ -46,19 +47,53 @@ func (c *Client) StreamingRpcHandler(method string) (structs.StreamingRpcHandler | |||
return c.streamingRpcs.GetHandler(method) | |||
} | |||
|
|||
// Given a type that is or eventually points to a concrete type with an embedded QueryOptions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These reflection-based functions are really here because we've removed the HasTimeout
from the RPCInfo
interface which was implemented by both the QueryOptions
and the WriteRequest
. It's been replaced only on the QueryOptions
side by TimeToBlock
. But it looks to me like all cases where we call TimeToBlock
, we are testing for a non-0 result.
So couldn't we avoid this by having TimeToBlock
on the RPCInfo
interface, having WriteRequest
implement it, but have WriteRequest
always return 0? That seems cheaper to me if we can make it work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally thought about this but for some reason thought it was better to use reflection, maybe it was early in my thinking before it evolved a bit. Either way now I don't see any reason why not so I updated the PR.
For posterity, here is a simple go app I used to test the code by running this and killing nomad servers in a cluster package main
import (
"fmt"
"math"
"os"
"os/signal"
"sync/atomic"
"math/rand"
"time"
"github.com/hashicorp/nomad/api"
)
// How many blocking requests to run in parallel
const numBlockingRequests int = 1000
// When asserting timings, how much grace to give in the calculation for
// rpc hold time and jitter
const timingGrace = time.Minute
var successfulUpdates uint64 = 0
var successfulWaits uint64 = 0
func main() {
cfg := api.DefaultConfig()
c, err := api.NewClient(cfg)
if err != nil {
panic(err)
}
startTime := time.Now()
printSummary := func() {
fmt.Printf("We lasted for %s\n", time.Now().Sub(startTime))
fmt.Printf("We successfully got updated indexes %d times\n", atomic.LoadUint64(&successfulUpdates))
fmt.Printf("We successfully got waited max block duration %d times\n", atomic.LoadUint64(&successfulWaits))
}
for i := 0; i < numBlockingRequests; i++ {
go func() {
// In case panic
defer printSummary()
BlockRandomLoop(c)
}()
}
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, os.Interrupt, os.Kill)
<-sigChan
printSummary()
}
// BlockRandomLoop creates a blocking long-poll to a Nomad endpoint and asserts there are no errors
func BlockRandomLoop(c *api.Client) {
time.Sleep(time.Duration(rand.Int63()) % (time.Second * 1))
i := uint64(1)
for {
// Random time between 0 and 5 minutes
blockTime := (time.Duration(rand.Int63()) % (time.Minute * 5))
startTime := time.Now()
_, meta, err := c.Jobs().List(&api.QueryOptions{
WaitIndex: i,
WaitTime: blockTime,
})
elapsedDuration := time.Now().Sub(startTime)
if err != nil {
panic(err)
}
if meta.LastIndex != i {
// Unblocked because stuff changed, should be < requested blocked time
if elapsedDuration > (blockTime + timingGrace) {
panic(fmt.Errorf("%s > %s", elapsedDuration, blockTime))
}
i = meta.LastIndex
atomic.AddUint64(&successfulUpdates, 1)
continue
}
// Should have blocked for "exactly" WaitTime
e := time.Duration(math.Abs(float64(elapsedDuration - blockTime)))
if e > timingGrace {
panic(fmt.Errorf("block error too high between expected '%s' and actual '%s'", blockTime, elapsedDuration))
}
atomic.AddUint64(&successfulWaits, 1)
}
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Sorry for the delay on merging this @benbuzbee. I want to get one more Nomad engineer's eyes on this, but we're heading into the holiday weekend here in the US. I'm going to make sure this lands in Nomad 1.0.0, so it'll get merged sometime next week I'm sure. |
This will ship in 1.0.0. The changelog already includes the original #8921 so nothing to add there. Thanks again @benbuzbee! |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
Overhaul the RPC retry logic with recent learnings about how it can fail. For tons of detail about all the learnings and care we put into this please see #9265
I realize this probably looks scary, so i am also available for a voice chat, and I hope the details in #9265 help.
Description
The fix which we merged into Nomad for the above issue is a good one - without it retries cannot possibly work, but it does have a flaw.
Consider:
What you would expect is that the second request, #3, has a timeout of 1 minute. Unfortunately, that is not what happens. The request is made again from the top - with a timeout of 5 minutes just like the first - and so the entire request could take as much as 9 minutes, even though the client asked for 5 minutes.
What is even worse about this is that if another retriable error occurs at the 7 minute mark, the code will not retry because it will realize it has been 7 minutes with a timeout of 5 and so the client will see an EOF. What it should have seen was a timeout at the 5 minute mark instead.
The correct fix for this is clear: step 3 should make a request to the sever with a timeout of 1 minute (the original timeout, minus time elapsed already). However, how to implement that is not so straight forward.
Because the RPC helpers burry the request time in interfaces and re-infer defaults on the server, there is no easy way for the RPC helper to change the request time to reflect the updated elapsed time.
One way to do that is with reflect. This is an inelegant solution that fixes the bug with a hammer. To fix it correctly, the client needs a way to tell the server the timeout time in a way that is not invisible and type-lost to the RPC function. However the complexity and risk of such a change does not seem appropriate.
Tested by
The following cases were testing. As a note, we have been using this patch at Cloudflare on thousands of nodes for the past two weeks and have noticed only an improvement
Changed default blocking time and logged how long blocking queries took using the above script. They took 30-40s each, which is expected when you add a default RPCHoldtime of 5s which can be applied twice for blocking queries. This seems sane to me.
Added test code to force an EOF error every 10 seconds (separate from above tests) and verified it always retries