Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed node controller assignemnts update after pods restart #368

Merged
merged 1 commit into from
Jul 22, 2023

Conversation

merlimat
Copy link
Collaborator

If a storage node restarts and its IP stop responding, the coordinator health check will detect it in a ~2seconds period, though the component that sends the cluster assignements updates will not detect the issue.

One of the reason is that it's using a grpc stream, so there is no timeout on the send operation (because it would apply to the lifetime of the stream itself, which instead needs to be long-lived).

The storage servers need to receive the new cluster assignments after restart, otherwise they cannot pass it back to their clients.

To fix the issue, we close the assignments stream whenever the health-check failure is triggered. This will ensure that coordinator will retry after the restart.

@merlimat merlimat merged commit ea7a70f into streamnative:main Jul 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant