POC: Cap shard failure lists to a fixed small size (March 2024) #106135
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is a POC exploratory coding attempt to address #103708 and #99220
After some earlier exploratory code, I decided not to change the AtomicArray of ShardSearchFailures in
AbstractSearchAsyncAction
. Changing it really messes up the lock-free thread safety model of that class. In addition, other classes keep AtomicArray's of all shard results, so this is not the only offender.Instead, I focused on reducing the number of failures reported in the SearchResponse. The SearchResponse does not track failed shard count independent of the ShardSearchFailure array, so that new field had to be added.
Most tests are passing, but need to do further work on those. Also CCS MRT=false is not yet truncating the number of failures in the _cluster/details/failures section so I need to track down where that occurs.