streaming: replace agent/cache with submatview.Store #10112

dnephin · 2021-04-23T21:36:36Z

This PR replaces agent/cache.Cache in the streaming flow with a new submatview.Store. The new Store implements a similar interface, and consumes similar interfaces, but removes all the logic about fetching results, which is no longer necessary with streaming.

Some notable differences:

instead of registering cache-types ahead of time, the new Store expects the incoming Request to specify the details about the type and how to initialize the Materializer if there is no existing entry.
the cache eviction logic takes into account active requests. The TTL starts when the last active request ends, instead of each time a request starts. This is important for streaming because we expect the requests to stay active for longer than 10 minutes.
stopping the Materializer (which shuts down the stream) is done synchronously with expiring the entry from the store. This prevents issues that we encountered in the cache where expiration would happen while background goroutines were still running.

This PR is unfortunately a bit large, but hopefully it is mostly tests. This PR includes #10110 (which can be rebased out if that merges), and #10068 because it was not really possible to test those changes without first making this change.

TODO:

port the 6 tests that moved from cache-type/streaming_health_services_test.go to submatview/store_test.go
fix the data race caused by initializing the gRPC logger in tests.
fix the existing test failures, I believe being caused by the changes to Agent.New.
resolve TODOs around error handling in Store.Get and Store.Notify
resolve TODOs about godoc (could be a follow up)
cleanup the git history a bit, there are some WIP commits that need to be squashed.
changelog entry

hashicorp-ci · 2021-04-23T21:37:30Z

🤔 This PR has changes in the website/ directory but does not have a type/docs-cherrypick label. If the changes are for the next version, this can be ignored. If they are updates to current docs, attach the label to auto cherrypick to the stable-website branch after merging.

dnephin · 2021-04-23T21:44:28Z

agent/consul/options.go

@@ -15,5 +16,9 @@ type Deps struct {
 	Tokens          *token.Store
 	Router          *router.Router
 	ConnPool        *pool.ConnPool
-	GRPCConnPool    *grpc.ClientConnPool
+	GRPCConnPool    GRPCClientConner


This was done so that tests which use Agent.New but don't actually need a gRPC connection pool can use a no-op fake.

dnephin · 2021-04-23T21:45:07Z

agent/health_endpoint.go

-	useStreaming := s.agent.config.UseStreamingBackend && args.MinQueryIndex > 0 && !args.Ingress
-	args.QueryOptions.UseCache = s.agent.config.HTTPUseCache && (args.QueryOptions.UseCache || useStreaming)
-
-	if args.QueryOptions.UseCache && useStreaming && args.Source.Node != "" {
-		return nil, BadRequestError{Reason: "'near' query param can not be used with streaming"}
-	}


This logic has moved into rpcclient/health.Client.useStreaming so that all of the logic to determine which backend to use is in a single place.

agent/submatview/store.go

rboyer · 2021-04-26T18:31:09Z

agent/setup.go

@@ -123,6 +126,10 @@ func NewBaseDeps(configLoader ConfigLoader, logOut io.Writer) (BaseDeps, error)
 	return d, nil
 }

+// grpcLogInitOnce because the test suite will call NewBaseDeps in many tests and
+// causes data races when it is re-initialized.
+var grpcLogInitOnce sync.Once


Doesn't this mean that these logs are useless during tests?

I guess it depends on how the logs are setup. If the logs are setup to go to stdout/stderr then it should work fine, they'll be attributed to the active test. If the logs are sent via t.Log it might hide some.

Although not doing this doesn't really fix the issue either, because the gRPC logger is a global. So if multiple tests are run in parallel, we would send all of the logs to one test logger, and none to the other.

This is only the logs for the grpc client and server though, so probably not important for any tests of Consul behaviour.

agent/agent.go

agent/submatview/materializer.go

agent/submatview/store.go

rboyer · 2021-04-27T20:41:01Z

agent/submatview/store.go

+					"error", err,
+					"request-type", req.Type(),
+					"index", index)
+				continue


Wouldn't this busy-loop since the index value hasn't been bumped?

I don't believe so. This is an error case, so I think it's expected that index doesn't change. getFromView will block when it gets called again, waiting on either a new value, another error, or the timeout.

In Materializer there is a retry with backoff on an error, so the updateCh that getFromView blocks on shouldn't get updated until that reports another error.

I think that sounds right, perhaps a comment here could clarify that assumption that getFromView will never return repeated errors without a delay/backoff so this is safe would be useful?

It's not needed here, but I was thinking if we did need to be defensive, adding a rate limit in this error case of the loop rather than exponentially backing off errors again might be a nice approach. That way if something else is already doing a backoff we won't add additional compounded delays here, but if there is a condition introduced later where the materializer could return errors without backing of then the rate limit will kick in to prevent this from becoming a busy loop?

I don't think we need to do that now, just thinking aloud.

agent/submatview/store.go

rboyer · 2021-04-27T20:48:50Z

agent/submatview/store.go

+
+	ctx, cancel := context.WithCancel(context.Background())
+	go mat.Run(ctx)
+


Should getEntry remove the entry from the expiryHeap?

It could, but it's not necessary the way it is written now. getEntry increments the count of requests, and the expiry is ignored if requests > 0. Alternatively we could remove things from the expiry heap instead of tracking requests. I wonder if that would be more expensive because of the need to modify the heap more often. Where as incrementing the request count is pretty cheap since we have a reference to it from the map.

And I guess in many cases there won't be an entry in the heap, because there are already active requests.

Yeah I like this approach where the expiry heap only contains inactive entries rather than needing to maintain a heap over all entries and constantly update it to keep nudging back expiry on the active ones.

99% of the time I suspect all entries will be active so not needing to maintain a heap in steady state seems like a cleaner approach.

Also fixes a minor data race in Materializer. Capture the error before releasing the lock.

Also rename it to readEntry now that it doesn't return the entire entry. Based on feedback in PR review, the full entry is not used by the caller, and accessing the fields wouldn't be safe outside the lock, so it is safer to return only the Materializer

Previous getFromView would call view.Result when the result may not have been returned (because the index is updated past the minIndex. This would allocate a slice and sort it for no reason, because the values would never be returned. Fix this by re-ordering the operations in getFromView. The test changes in this commit were an attempt to cover the case where an update is received but the index does not exceed the minIndex.

banks

Looks awesome @dnephin 👏

I have a few minor comments inline, none are blocking though so see what you think!

banks · 2021-04-28T09:28:28Z

.changelog/10112.txt

@@ -0,0 +1,3 @@
+```release-note:bug
+streaming: fixes a bug that would cause context cancellation errors when a cache entry expired while requests were active.


I wonder if it's worth adding either in this entry or a separate one, an explicit reference/link to the scalability challenge and the reduction in the long tail of deliveries at scale? Not important though just a thought.

banks · 2021-04-28T09:36:19Z

agent/rpcclient/health/health_test.go

+			run(t, tc)
+		})
+	}
+}


Great testing pattern. I found this really easy to read and validate that we covered and asserted the right behaviour/cases!

banks · 2021-04-28T09:45:11Z

agent/submatview/materializer.go

@@ -32,7 +31,7 @@ type View interface {
 	// separately and passed in in case the return type needs an Index field
 	// populating. This allows implementations to not worry about maintaining
 	// indexes seen during Update.
-	Result(index uint64) (interface{}, error)
+	Result(index uint64) interface{}


I guess you removed the error as we were never using it, but I wonder how we handle propagating background fetching errors without it?

For example if the streaming connection can't be established, I'd assume that we'd want to 500 the HTTP request to the user rather than send them a blank 200 as if there was no data on the servers? How does the caller know the difference?

This interface is only for the view, which is effectively just a map with some filtering and sorting. There's no IO or network operations here. It seems unlikely that other View implementations would need an error return for this, but we can always add it back if we do need it.

Even the error return from View.Update is arguably not necessary. The one place we error there would be a logic bug, and could panic.

agent/submatview/store.go

banks · 2021-04-28T10:02:32Z

agent/submatview/store.go

+	// context.DeadlineExceeded is translated to nil to match the behaviour of
+	// agent/cache.Cache.Get.


I wonder if this could mask underlying network timeouts too?

I assume gRPC client code also uses contexts for things like connection timeout, if it did and there was an actual error like unable to connect to servers which caused a timeout, could we mistakenly mask that error?

I wonder if there is a way to be sure this is the timeout based on the user's requested blocking timeout vs a downstream network timeout that's propagated directly back through the calls? Right now this might never be possible but it makes me wonder if this could ever result in a subtle bug later?

From what I can tell request timeouts in gRPC are always set by the context, so there's no difference. Connect timeouts are reported using this error: https://github.com/grpc/grpc-go/blob/24d03d9f769106b3c96b4145244ce682999d3d88/internal/transport/transport.go#L713, so would not be context.DeadlineExceeded.

I believe this is safe. I do think it is unfortunate, and it would be better to report this timeout to the user, but agent/cache masks these kinds of errors, so I thought for now it would be better to match the behaviour.

banks · 2021-04-28T10:10:36Z

agent/submatview/store.go

+					"error", err,
+					"request-type", req.Type(),
+					"index", index)
+				continue


I think that sounds right, perhaps a comment here could clarify that assumption that getFromView will never return repeated errors without a delay/backoff so this is safe would be useful?

It's not needed here, but I was thinking if we did need to be defensive, adding a rate limit in this error case of the loop rather than exponentially backing off errors again might be a nice approach. That way if something else is already doing a backoff we won't add additional compounded delays here, but if there is a condition introduced later where the materializer could return errors without backing of then the rate limit will kick in to prevent this from becoming a busy loop?

I don't think we need to do that now, just thinking aloud.

banks · 2021-04-28T10:20:02Z

agent/submatview/store.go

+
+	ctx, cancel := context.WithCancel(context.Background())
+	go mat.Run(ctx)
+


Yeah I like this approach where the expiry heap only contains inactive entries rather than needing to maintain a heap over all entries and constantly update it to keep nudging back expiry on the active ones.

99% of the time I suspect all entries will be active so not needing to maintain a heap in steady state seems like a cleaner approach.

agent/submatview/store.go

banks · 2021-04-28T10:27:30Z

agent/submatview/store_test.go

+	defer store.lock.Unlock()
+	require.Len(t, store.byKey, 0)
+	require.Equal(t, ttlcache.NotIndexed, e.expiry.Index())
+}


I think these tests look great.

Possibly a TODO for later but what do you think about adding a slightly "fuzzy" test that starts up a bunch of concurrent Notify calls and asserts reasonable behaviour for each one. The value in the test isn't so much covering specific anomalies or cases but just verifying the concurrent behaviour works (i.e. all clients see the update), and when run with -race we have a better chance of catching any data race issues?

Ya, that is an option. In my mind, if we were to do that, we should use a real backend not a fake one. It seems like a fake will never be close enough to the real behaviour that any such test wouldn't be meaningful with a fake.

sdk/testutil/retry/retry.go

Co-authored-by: Paul Banks <[email protected]>

rboyer

LGTM

dnephin · 2021-04-28T21:31:37Z

Test failures pass locally, so likely flakes.

hc-github-team-consul-core · 2021-04-28T21:32:21Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/358627.

hc-github-team-consul-core · 2021-04-28T21:32:23Z

🍒❌ Cherry pick of commit 9b344b3 onto release/1.10.x failed! Build Log

hc-github-team-consul-core · 2021-04-28T21:58:31Z

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/358712.

hc-github-team-consul-core · 2021-04-28T21:58:35Z

🍒✅ Cherry pick of commit 9b344b3 onto release/1.10.x succeeded!

…m-cache streaming: replace agent/cache with submatview.Store

dnephin added type/enhancement Proposed improvement or new feature theme/streaming Related to Streaming connections between server and client labels Apr 23, 2021

github-actions bot added theme/health-checks Health Check functionality type/docs Documentation needs to be created/updated/clarified labels Apr 23, 2021

dnephin force-pushed the dnephin/remove-streaming-from-cache branch from 47172b7 to fc5cd56 Compare April 23, 2021 21:42

vercel bot temporarily deployed to Preview – consul-ui-staging April 23, 2021 21:42 Inactive

vercel bot temporarily deployed to Preview – consul April 23, 2021 21:42 Inactive

dnephin commented Apr 23, 2021

View reviewed changes

dnephin force-pushed the dnephin/remove-streaming-from-cache branch from fc5cd56 to ab169bb Compare April 23, 2021 22:05

vercel bot temporarily deployed to Preview – consul April 23, 2021 22:06 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging April 23, 2021 22:06 Inactive

dnephin commented Apr 23, 2021

View reviewed changes

agent/submatview/store.go Outdated Show resolved Hide resolved

agent/submatview/store.go Outdated Show resolved Hide resolved

vercel bot temporarily deployed to Preview – consul April 26, 2021 15:57 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging April 26, 2021 15:57 Inactive

rboyer reviewed Apr 26, 2021

View reviewed changes

dnephin force-pushed the dnephin/remove-streaming-from-cache branch from 7dc881d to cf4e666 Compare April 27, 2021 17:47

vercel bot temporarily deployed to Preview – consul April 27, 2021 17:47 Inactive

vercel bot temporarily deployed to Preview – consul-ui-staging April 27, 2021 17:47 Inactive