Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock between Cache.put and invalidateAll #99480

Merged
merged 6 commits into from
Sep 14, 2023

Conversation

thecoop
Copy link
Member

@thecoop thecoop commented Sep 12, 2023

This fixes #99326

Unfortunately testing this is exceptionally difficult - hitting the second lock in put requires the item to be there already, but invalidateAll removes all items from the cache, meaning the put doesn't hit the second lock. The test I've added doesn't trigger the deadlock condition on the old code after 2000 runs

@thecoop thecoop added >bug :Core/Infra/Core Core issues without another label labels Sep 12, 2023
@elasticsearchmachine elasticsearchmachine added Team:Core/Infra Meta label for core/infra team v8.11.0 labels Sep 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @thecoop, I've created a changelog YAML for you.

@thecoop
Copy link
Member Author

thecoop commented Sep 12, 2023

This is probably worth backporting, maybe even to 7.17

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. This does seem worth backporting.

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I checked the other methods and indeed invalidateAll is the only one acquiring locks in the wrong order.

@thecoop
Copy link
Member Author

thecoop commented Sep 13, 2023

The test added here doesn't actually find the original bug. I'm not sure how to even create a test for this case - in which case I might as well just remove it...

@ldematte
Copy link
Contributor

ldematte commented Sep 13, 2023

I think it is extremely hard to find. I tried using jcstress and cook up a quick test using the 8.9.2 artifacts, and after 680 iterations it does not find a issues. Now, jcstress is not particularly well versed in finding deadlocks, but if it does not find anything the chances of finding it with a normal test are slim.
The only thing I can think about is faking the locks, but that would require changes to Cache that I think are not worth it.

@thecoop
Copy link
Member Author

thecoop commented Sep 14, 2023

I've removed the test, given the problem was quite a basic one (incorrect lock acquisition order) and the fix is clear

@thecoop thecoop requested a review from rjernst September 14, 2023 09:55
@thecoop
Copy link
Member Author

thecoop commented Sep 14, 2023

@elasticmachine update branch

@thecoop
Copy link
Member Author

thecoop commented Sep 14, 2023

@elasticsearchmachine rerun elasticsearch-ci/part-1

Copy link
Member

@rjernst rjernst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still looks fine, but even if the situation is rare, ensuring the order of releasing locks seems a worthwhile thing to test. I don't feel that strongly about it (merge if you like), but I have the following suggestion for how to test:

The CacheSegment is what holds the lock objects which we would want to mock. I think a package private ctor for Cache could take the ctor for CacheSegment. Then move the construction of the read/write lock to an overridable method, remove the final (no reason for it to be final really anyways since it is private). Have a TestCacheSegment which subclasses and creates a delegate, so that you can hook into when locking/unlocking happens, and then assert on the order. Again, this is just an idea, I realize it is a bit of a change, but IMO not too much to ensure something that can result in a deadlock.

@thecoop thecoop merged commit 7ad521f into elastic:main Sep 14, 2023
thecoop added a commit to thecoop/elasticsearch that referenced this pull request Sep 14, 2023
The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put.

This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.10
7.17

@thecoop thecoop deleted the fix-invalidate-put-deadlock branch September 14, 2023 14:38
elasticsearchmachine pushed a commit that referenced this pull request Sep 14, 2023
The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put.

This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.
elasticsearchmachine pushed a commit that referenced this pull request Sep 14, 2023
The invalidateAll method is taking out the lru lock and segment locks in a different order to the put method, when the put is replacing an existing value. This results in a deadlock between the two methods as they try to swap locks. This fixes it by making sure invalidateAll takes out locks in the same order as put.

This is difficult to test because the put needs to be replacing an existing value, and invalidateAll clears the cache, resulting in subsequent puts not hitting the deadlock condition. A test that overrides some internal implementations to expose this particular deadlock will be coming later.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v7.17.14 v8.10.1 v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential deadlock via the Cache implementation
5 participants