agent: add an inflight cache better concurrent request handling #10705

calvn · 2021-01-15T18:43:45Z

This PR refactors LeaseCache.Send to better handle identical concurrent requests that are processed at this layer.

The original code relied solely on an idLock, which would get updated to a write lock if the request resulted in a cache miss, meaning that non-cacheable identical concurrent requests (e.g. Vault KV reads) would be processed serially. It also was not properly protecting identical concurrent cacheable requests on a clean cache from being proxied more than once.

This PR introduces an inflight cache to allow for one of these requests to win and proceed with processing the request before the other ones are processed. This implementation takes a fan-out approach to ensure that a request is processed at least once (and that only one of them does) so that it has a chance to for LeaseCache to cache its response.

The changes in this PR also ensures that identical concurrent cacheable requests that does not have it response cached yet will only ever have the request proxied once since the inflight cache will only allow one winner to fully process the request before the others can proceed.

…calls

…ed independently

command/agent/cache/lease_cache.go

Co-authored-by: Nick Cabatoff <[email protected]>

command/agent/cache/lease_cache_test.go

ncabatoff · 2021-01-15T19:54:51Z

Just to confirm my understanding: let's assume we have a steady stream of non-cacheable requests coming in for the exact same URL, say 1/ms, and they take 20ms each for Vault to process.

The first request to hit Agent will get added to the cache. The next 19 requests will queue up behind it. After 20ms the first request is done and Agent will then immediately forward all 19 other requests that came in while we were waiting.

As new requests come in every 1ms, each of them will be forwarded immediately since the inflight req ch is closed.

If there were an interruption in the flow of incoming requests long enough for the cache entry to be removed, the whole sequence would restart, with an initial delay, then all pending requests released at once, then no more delays.

This approach is motivated by the fact that we don't cache some requests, and we don't know at the outset whether a request will be cacheable.

I don't think it's bad, it's certainly an improvement over what we have now, and I don't have a better idea right now, but it does lead to some burstiness as well as somewhat unpredictable latency.

calvn · 2021-01-15T20:06:27Z

As new requests come in every 1ms, each of them will be forwarded immediately since the inflight req ch is closed.

Yes, that's correct.

If there were an interruption in the flow of incoming requests long enough for the cache entry to be removed, the whole sequence would restart, with an initial delay, then all pending requests released at once, then no more delays.

Yes, that's the case as well. Although, if there is no interrupt and this is a continuous steady stream of requests, the inflight cache entry might never get cleaned up. On the bright side, the delta here is just on the remaining counter and not an increase in the case size. In theory it means that it prevents the inflight cache from shrinking back down, but in practice I don't see this scenario to be large enough (numerous distinct and steady-stream concurrent requests against a single agent instance) to overwhelm the cache/memory. You'd need to have a lot of these steady stream requests happening at the same time and on unique requests to populate and take up a large enough cache size for it to become an issue.

Good point on the burstiness aspect of this logic. I don't have a straight answer to this other than introducing a jitter, but this would also add additional latency. Maybe that's a fine trade-off to avoid overwhelming the Vault server.

ncabatoff · 2021-01-15T20:12:03Z

Good point on the burstiness aspect of this logic. I don't have a straight answer to this other than introducing a jitter, but this would also add additional latency. Maybe that's a fine trade-off to avoid overwhelming the Vault server.

I wasn't asking for any changes, just making sure we're clear on the new behaviour. Like I say, I don't think is particularly bad, it's just potentially a bit surprising. I vote against adding more complexity to the solution unless it improves the user experience. I don't think there's a straightforward way to stagger the backlogged requests, at least I don't see a way to improve things in general - it feels like any attempt we make could help some in some cases and hurt in others.

kalafut · 2021-01-15T22:35:13Z

command/agent/cache/lease_cache.go

+	defer func() {
+		// Cleanup on the cache if there are no remaining inflight requests.
+		// This is the last step, so we defer the call first
+		if inflight != nil && inflight.remaining.Load() == 0 {


When would inflight be nil at this point?

It shouldn't be nil because the conditional further down always assigns this values to something, but I'm nil-checking just in case since this is within a defer that's called right after the variable is declared but before the value is assigned.

kalafut · 2021-01-15T22:56:57Z

command/agent/cache/lease_cache.go

+		// Cleanup on the cache if there are no remaining inflight requests.
+		// This is the last step, so we defer the call first
+		if inflight != nil && inflight.remaining.Load() == 0 {
+			c.inflightCache.Delete(id)


This felt a little racy since we can be here multiple times for the same id if a request comes in right after the condition Load() == 0 is checked. But I've gone through it and don't think there is any harmful behavior. The inflight object still exists even if it's deleted from the cache so the final request can complete, and calling Delete() on an id not present is a no-op.

calvn · 2021-01-25T16:59:08Z

@ncabatoff @kalafut @briankassouf thoughts on having this backported to 1.6.x? It's technically a bug fix, but at the same time it's not a common case and not noticeable unless it's under a specific scenario. We're also tight on time for the upcoming patch release.

briankassouf · 2021-01-25T17:50:37Z

command/agent/cache/lease_cache.go

+		select {
+		case <-ctx.Done():
+			return nil, ctx.Err()
+		case <-inflight.ch:


So right now if we detect here that the thread processing the request has completed (channel has been closed) then we simply continue. But once we get down to:

cachedResp, err := c.checkCacheForRequest(id)

We'd see an nil cachedResp since that is still going to only cache leased values. Then, i think, we'd simply re-send the request to the Vault server. I think this fix is missing a step where we store the resulting request in the inflightRequest object and access it here when the channel is closed. Thoughts?

The winner who had closed the channel will have cached the response before this thread gets to call c.checkCacheForRequest(id) so it will result in a cache hit. In the case that the request resulted in a non-cacheable response, it would proxy to Vault as it should.

The changes in the PR don't actually prevent identical non-cacheable requests from being proxied to Vault; it simply allows one of the requests to be processed first (since we don't know if it's cacheable) before opening the floodgate to let other identical request to be processed concurrently. I don't think there's a need to store the actual request/response object in the inflightRequest .

vercel · 2021-01-26T19:35:13Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

vault-storybook – ./ui

🔍 Inspect: https://vercel.com/hashicorp/vault-storybook/fx2c6casd
✅ Preview: https://vault-storybook-git-agent-inflight-cache.hashicorp.vercel.app

calvn added 10 commits January 7, 2021 15:26

agent: do not grap idLock writelock until caching entry

efb17f1

agent: inflight cache using sync.Map

5adbf7f

agent: implement an inflight caching mechanism

01d3094

agent/lease: add lock for inflight cache to prevent simultaneous Set …

ade1b13

…calls

agent/lease: lock on a per-ID basis so unique requests can be process…

e434bab

…ed independently

agent/lease: add some concurrency tests

512e426

test: use lease_id for uniqueness

8eab0f3

agent: remove env flags, add comments around locks

63e0734

agent: clean up test comment

12b9cb3

agent: clean up test comment

fcc5ccf

calvn added this to the 1.7 milestone Jan 15, 2021

calvn requested review from kalafut, briankassouf and ncabatoff January 15, 2021 18:43

kalafut reviewed Jan 15, 2021

View reviewed changes

command/agent/cache/lease_cache.go Outdated Show resolved Hide resolved

calvn added 2 commits January 15, 2021 10:53

agent: remove commented debug code

a407d7f

agent/lease: word-smithing

a2cafc1

ncabatoff reviewed Jan 15, 2021

View reviewed changes

command/agent/cache/lease_cache.go Outdated Show resolved Hide resolved

ncabatoff reviewed Jan 15, 2021

View reviewed changes

command/agent/cache/lease_cache.go Show resolved Hide resolved

ncabatoff reviewed Jan 15, 2021

View reviewed changes

command/agent/cache/lease_cache.go Show resolved Hide resolved

calvn and others added 5 commits January 15, 2021 11:07

Update command/agent/cache/lease_cache.go

47383ea

Co-authored-by: Nick Cabatoff <[email protected]>

agent/lease: return the context error if the Done ch got closed

474cfc2

agent/lease: fix data race in concurrency tests

06d0f22

agent/lease: mockDelayProxier: return ctx.Err() if context got canceled

20eb35b

agent/lease: remove unused inflightCacheLock

ea4d6c8

ncabatoff reviewed Jan 15, 2021

View reviewed changes

command/agent/cache/lease_cache_test.go Outdated Show resolved Hide resolved

calvn added 2 commits January 15, 2021 11:46

agent/lease: test: bump context timeout to 3s

8d1641e

Merge branch 'master' into agent-inflight-cache

6fed851

ncabatoff approved these changes Jan 15, 2021

View reviewed changes

kalafut reviewed Jan 15, 2021

View reviewed changes

briankassouf reviewed Jan 25, 2021

View reviewed changes

Merge branch 'master' into agent-inflight-cache

3b51125

vercel bot deployed to Preview – vault January 26, 2021 19:35 View deployment

vercel bot deployed to Preview – vault-storybook January 26, 2021 19:35 View deployment

calvn merged commit df51db7 into master Jan 26, 2021

calvn deleted the agent-inflight-cache branch January 26, 2021 20:09

calvn mentioned this pull request Jan 26, 2021

changelog: add entry for PR 10705 #10785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent: add an inflight cache better concurrent request handling #10705

agent: add an inflight cache better concurrent request handling #10705

calvn commented Jan 15, 2021 •

edited

Loading

ncabatoff commented Jan 15, 2021

calvn commented Jan 15, 2021 •

edited

Loading

ncabatoff commented Jan 15, 2021

kalafut Jan 15, 2021

calvn Jan 19, 2021 •

edited

Loading

kalafut Jan 15, 2021

calvn commented Jan 25, 2021

briankassouf Jan 25, 2021 •

edited

Loading

calvn Jan 25, 2021

vercel bot commented Jan 26, 2021 •

edited

Loading

agent: add an inflight cache better concurrent request handling #10705

agent: add an inflight cache better concurrent request handling #10705

Conversation

calvn commented Jan 15, 2021 • edited Loading

ncabatoff commented Jan 15, 2021

calvn commented Jan 15, 2021 • edited Loading

ncabatoff commented Jan 15, 2021

kalafut Jan 15, 2021

Choose a reason for hiding this comment

calvn Jan 19, 2021 • edited Loading

Choose a reason for hiding this comment

kalafut Jan 15, 2021

Choose a reason for hiding this comment

calvn commented Jan 25, 2021

briankassouf Jan 25, 2021 • edited Loading

Choose a reason for hiding this comment

calvn Jan 25, 2021

Choose a reason for hiding this comment

vercel bot commented Jan 26, 2021 • edited Loading

vault-storybook – ./ui

calvn commented Jan 15, 2021 •

edited

Loading

calvn commented Jan 15, 2021 •

edited

Loading

calvn Jan 19, 2021 •

edited

Loading

briankassouf Jan 25, 2021 •

edited

Loading

vercel bot commented Jan 26, 2021 •

edited

Loading