Race Condition in Agent Cache #5793

mkeeler · 2019-05-06T15:44:56Z

We were running into some intermittent test failures when running the following:

make test-envoy-integ ENVOY_VERSIONS="1.9.1" FILTER_TESTS="centralconf" STOP_ON_FAIL=1

Some times we would fail to provide envoy with a proxy configuration in a timely manner. After extensive search I believe the problem resides in the agent cache and in particular with the interaction of the Cache.getWithIndex and Cache.fetch functions.

Race Condition in Agent Cache

getWithIndex
1. validate request
2. Get a read lock on the cache entries
3. Get current cache entry
4. Relinquish the read lock
5. Check if the entry is valid and other things to determine whether we need to refetch
6. Issue a fetch for the new value (do not pass through the minIndex)
fetch
1. looking the type/ validate the request
2. gain a write lock on the cache entries
3. defer unlocking until the end of the func
4. lookup the current entry
5. if being fetched then we return the current waiter
6. If no entry and we are allowing new entries then create an invalid entry
7. Set the entry to Fetching = true
8. Put entry back into the entries map
9. Spawn go routine to perform the real cache request
10. Return the waiter chan
go routine real request
1. Setup timer to zero out the RefreshLostContact time after 31 seconds because that’s when we assume that yamux would have killed our connection if we actually weren’t in contact due to keep-alives.
2. Setup the blocking query. Use the current entries index for the MinIndex

How things go wrong.

If a call to fetch finishes after the read lock on the entries is relinquished but before the next invocation of fetch and in particular before the second fetch gains the write lock, we will miss the update. Not only do we miss the update but because we are not passing the minIndex from getWithIndex into fetch we are going to wait until there is another update even though the currently cached value would be new enough.

I will be considering how to fix this but wanted to track it here in case someone else from the team picks it up.

The text was updated successfully, but these errors were encountered:

mkeeler added this to the 1.5.0 milestone May 6, 2019

mkeeler self-assigned this May 6, 2019

mkeeler mentioned this issue May 6, 2019

Fixes race condition in Agent Cache #5796

Merged

banks closed this as completed in #5796 May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race Condition in Agent Cache #5793

Race Condition in Agent Cache #5793

mkeeler commented May 6, 2019 •

edited

Loading

Race Condition in Agent Cache #5793

Race Condition in Agent Cache #5793

Comments

mkeeler commented May 6, 2019 • edited Loading

Race Condition in Agent Cache

How things go wrong.

mkeeler commented May 6, 2019 •

edited

Loading