-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock between Cache.put and invalidateAll #99480
Conversation
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Hi @thecoop, I've created a changelog YAML for you. |
This is probably worth backporting, maybe even to 7.17 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This does seem worth backporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I checked the other methods and indeed invalidateAll is the only one acquiring locks in the wrong order.
The test added here doesn't actually find the original bug. I'm not sure how to even create a test for this case - in which case I might as well just remove it... |
I think it is extremely hard to find. I tried using |
I've removed the test, given the problem was quite a basic one (incorrect lock acquisition order) and the fix is clear |
@elasticmachine update branch |
@elasticsearchmachine rerun elasticsearch-ci/part-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still looks fine, but even if the situation is rare, ensuring the order of releasing locks seems a worthwhile thing to test. I don't feel that strongly about it (merge if you like), but I have the following suggestion for how to test:
The CacheSegment is what holds the lock objects which we would want to mock. I think a package private ctor for Cache could take the ctor for CacheSegment. Then move the construction of the read/write lock to an overridable method, remove the final (no reason for it to be final really anyways since it is private). Have a TestCacheSegment which subclasses and creates a delegate, so that you can hook into when locking/unlocking happens, and then assert on the order. Again, this is just an idea, I realize it is a bit of a change, but IMO not too much to ensure something that can result in a deadlock.
The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put. This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.
The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put. This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.
The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put. This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.
This fixes #99326
Unfortunately testing this is exceptionally difficult - hitting the second lock in
put
requires the item to be there already, butinvalidateAll
removes all items from the cache, meaning the put doesn't hit the second lock. The test I've added doesn't trigger the deadlock condition on the old code after 2000 runs