Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in Agent Cache #5793

Closed
mkeeler opened this issue May 6, 2019 · 0 comments · Fixed by #5796
Closed

Race Condition in Agent Cache #5793

mkeeler opened this issue May 6, 2019 · 0 comments · Fixed by #5796
Assignees
Milestone

Comments

@mkeeler
Copy link
Member

mkeeler commented May 6, 2019

We were running into some intermittent test failures when running the following:

make test-envoy-integ ENVOY_VERSIONS="1.9.1" FILTER_TESTS="centralconf" STOP_ON_FAIL=1

Some times we would fail to provide envoy with a proxy configuration in a timely manner. After extensive search I believe the problem resides in the agent cache and in particular with the interaction of the Cache.getWithIndex and Cache.fetch functions.

Race Condition in Agent Cache

  • getWithIndex

    1. validate request
    2. Get a read lock on the cache entries
    3. Get current cache entry
    4. Relinquish the read lock
    5. Check if the entry is valid and other things to determine whether we need to refetch
    6. Issue a fetch for the new value (do not pass through the minIndex)
  • fetch

    1. looking the type/ validate the request
    2. gain a write lock on the cache entries
    3. defer unlocking until the end of the func
    4. lookup the current entry
    5. if being fetched then we return the current waiter
    6. If no entry and we are allowing new entries then create an invalid entry
    7. Set the entry to Fetching = true
    8. Put entry back into the entries map
    9. Spawn go routine to perform the real cache request
    10. Return the waiter chan
  • go routine real request

    1. Setup timer to zero out the RefreshLostContact time after 31 seconds because that’s when we assume that yamux would have killed our connection if we actually weren’t in contact due to keep-alives.
    2. Setup the blocking query. Use the current entries index for the MinIndex

How things go wrong.

If a call to fetch finishes after the read lock on the entries is relinquished but before the next invocation of fetch and in particular before the second fetch gains the write lock, we will miss the update. Not only do we miss the update but because we are not passing the minIndex from getWithIndex into fetch we are going to wait until there is another update even though the currently cached value would be new enough.

I will be considering how to fix this but wanted to track it here in case someone else from the team picks it up.

@mkeeler mkeeler added this to the 1.5.0 milestone May 6, 2019
@mkeeler mkeeler self-assigned this May 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant