Loki: query scheduler should send shutdown to frontends when ReplicationSet changes #4614
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
This PR attempts to improve a race when adding/removing instances into the scheduler ring where messages could end up stuck in a schedulers queue with no queriers coming back to process them.
Currently when an elected scheduler is no longer "elected" everyone finds out about this through the ring, but the schedulers themselves do nothing and technically stay active, the frontend and queriers poll the ring and when they find out about the scheduler change, disconnect from the old and connect to the new.
The distributed nature of this creates the possibility that all the queriers find out about a scheduler change and disconnect from the scheduler while the frontend still stays connected. The frontend could then be stuck waiting for inflight queries to process (which will never happen because all the queriers have moved on) ultimately leading to a timeout or failure of the query.
To mitigate this we make sure that as soon as the scheduler knows it's not in the ReplicationSet of who should act as a scheduler it should send a shutdown message to connected frontends so they will cancel requests and retry them in other schedulers.
Special notes for your reviewer:
Checklist
CHANGELOG.md
about the changes.