Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RPC retry logic in nomad client's rpc.go for blocking queries #9266

Merged
merged 2 commits into from
Nov 30, 2020

Conversation

benbuzbee
Copy link
Contributor

Overhaul the RPC retry logic with recent learnings about how it can fail. For tons of detail about all the learnings and care we put into this please see #9265

I realize this probably looks scary, so i am also available for a voice chat, and I hope the details in #9265 help.

Description

The fix which we merged into Nomad for the above issue is a good one - without it retries cannot possibly work, but it does have a flaw.

Consider:

  1. A long poll is made with a timeout of 5 minutes
  2. A retriable error occurs at the 4 minute mark
  3. We make a new request to the server

What you would expect is that the second request, #3, has a timeout of 1 minute. Unfortunately, that is not what happens. The request is made again from the top - with a timeout of 5 minutes just like the first - and so the entire request could take as much as 9 minutes, even though the client asked for 5 minutes.

What is even worse about this is that if another retriable error occurs at the 7 minute mark, the code will not retry because it will realize it has been 7 minutes with a timeout of 5 and so the client will see an EOF. What it should have seen was a timeout at the 5 minute mark instead.

The correct fix for this is clear: step 3 should make a request to the sever with a timeout of 1 minute (the original timeout, minus time elapsed already). However, how to implement that is not so straight forward.

Because the RPC helpers burry the request time in interfaces and re-infer defaults on the server, there is no easy way for the RPC helper to change the request time to reflect the updated elapsed time.

One way to do that is with reflect. This is an inelegant solution that fixes the bug with a hammer. To fix it correctly, the client needs a way to tell the server the timeout time in a way that is not invisible and type-lost to the RPC function. However the complexity and risk of such a change does not seem appropriate.

Tested by

The following cases were testing. As a note, we have been using this patch at Cloudflare on thousands of nodes for the past two weeks and have noticed only an improvement

  1. Ran this bash script for 8 minutes. It did eventually fail due to "no cluster leader" but everything retried appropriately
while :; do sudo docker restart nomad-server-$(($RANDOM % 5)) && sleep  $((5 + $RANDOM % 10)); done^C
  1. Changed default blocking time and logged how long blocking queries took using the above script. They took 30-40s each, which is expected when you add a default RPCHoldtime of 5s which can be applied twice for blocking queries. This seems sane to me.

  2. Added test code to force an EOF error every 10 seconds (separate from above tests) and verified it always retries

Tested by
1. Ran this bash script for 8 minutes. It did eventually fail due to "no cluster leader" but everything retried appropriately
```
while :; do sudo docker restart nomad-server-$(($RANDOM % 5)) && sleep  $((5 + $RANDOM % 10)); done^C
```
2. Changed default blocking time and logged how long blocking queries took using the above script. They took 30-40s each, which is expected when you add a default RPCHoldtime of 5s which can be applied twice for blocking queries. This seems sane to me.

3. Added test code to force an EOF error every 10 seconds (separate from above tests) and verified it always retries
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @benbuzbee! This looks great! I've left a question about the reflection approach, but it's possible I'm just missing something there.

client/rpc.go Outdated
@@ -46,19 +47,53 @@ func (c *Client) StreamingRpcHandler(method string) (structs.StreamingRpcHandler
return c.streamingRpcs.GetHandler(method)
}

// Given a type that is or eventually points to a concrete type with an embedded QueryOptions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These reflection-based functions are really here because we've removed the HasTimeout from the RPCInfo interface which was implemented by both the QueryOptions and the WriteRequest. It's been replaced only on the QueryOptions side by TimeToBlock. But it looks to me like all cases where we call TimeToBlock, we are testing for a non-0 result.

So couldn't we avoid this by having TimeToBlock on the RPCInfo interface, having WriteRequest implement it, but have WriteRequest always return 0? That seems cheaper to me if we can make it work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally thought about this but for some reason thought it was better to use reflection, maybe it was early in my thinking before it evolved a bit. Either way now I don't see any reason why not so I updated the PR.

@benbuzbee
Copy link
Contributor Author

For posterity, here is a simple go app I used to test the code by running this and killing nomad servers in a cluster

package main

import (
	"fmt"
	"math"
	"os"
	"os/signal"
	"sync/atomic"

	"math/rand"
	"time"

	"github.com/hashicorp/nomad/api"
)

// How many blocking requests to run in parallel
const numBlockingRequests int = 1000

// When asserting timings, how much grace to give in the calculation for
// rpc hold time and jitter
const timingGrace = time.Minute

var successfulUpdates uint64 = 0
var successfulWaits uint64 = 0

func main() {
	cfg := api.DefaultConfig()
	c, err := api.NewClient(cfg)
	if err != nil {
		panic(err)
	}

	startTime := time.Now()
	printSummary := func() {
		fmt.Printf("We lasted for %s\n", time.Now().Sub(startTime))
		fmt.Printf("We successfully got updated indexes %d times\n", atomic.LoadUint64(&successfulUpdates))
		fmt.Printf("We successfully got waited max block duration %d times\n", atomic.LoadUint64(&successfulWaits))
	}
	for i := 0; i < numBlockingRequests; i++ {
		go func() {
			// In case panic
			defer printSummary()
			BlockRandomLoop(c)
		}()
	}

	sigChan := make(chan os.Signal, 1)
	signal.Notify(sigChan, os.Interrupt, os.Kill)
	<-sigChan
	printSummary()
}

// BlockRandomLoop creates a blocking long-poll to a Nomad endpoint and asserts there are no errors
func BlockRandomLoop(c *api.Client) {
	time.Sleep(time.Duration(rand.Int63()) % (time.Second * 1))
	i := uint64(1)
	for {
		// Random time between 0 and 5 minutes
		blockTime := (time.Duration(rand.Int63()) % (time.Minute * 5))

		startTime := time.Now()
		_, meta, err := c.Jobs().List(&api.QueryOptions{
			WaitIndex: i,
			WaitTime:  blockTime,
		})
		elapsedDuration := time.Now().Sub(startTime)

		if err != nil {
			panic(err)
		}

		if meta.LastIndex != i {
			// Unblocked because stuff changed, should be < requested blocked time
			if elapsedDuration > (blockTime + timingGrace) {
				panic(fmt.Errorf("%s > %s", elapsedDuration, blockTime))
			}
			i = meta.LastIndex
			atomic.AddUint64(&successfulUpdates, 1)
			continue
		}

		// Should have blocked for "exactly" WaitTime
		e := time.Duration(math.Abs(float64(elapsedDuration - blockTime)))
		if e > timingGrace {
			panic(fmt.Errorf("block error too high between expected '%s' and actual '%s'", blockTime, elapsedDuration))
		}
		atomic.AddUint64(&successfulWaits, 1)
	}
}

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@tgross tgross added this to the 1.0 milestone Nov 25, 2020
@tgross
Copy link
Member

tgross commented Nov 25, 2020

Sorry for the delay on merging this @benbuzbee. I want to get one more Nomad engineer's eyes on this, but we're heading into the holiday weekend here in the US. I'm going to make sure this lands in Nomad 1.0.0, so it'll get merged sometime next week I'm sure.

@tgross tgross merged commit 6a6547b into hashicorp:master Nov 30, 2020
@tgross
Copy link
Member

tgross commented Nov 30, 2020

This will ship in 1.0.0. The changelog already includes the original #8921 so nothing to add there. Thanks again @benbuzbee!

@github-actions
Copy link

github-actions bot commented Dec 8, 2022

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants