Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support multiple replicas for linstor-controller #73

Merged
merged 1 commit into from
Sep 30, 2020

Conversation

WanzenBug
Copy link
Member

@WanzenBug WanzenBug commented Aug 4, 2020

Running multiple replicas requires special support from the linstor
controller. The controller container will start a leader election
process when it detects the presence of the K8S_AWAIT_ELECTION_*
variables.

The election process determines which pod is allowed to start the
linstor-controller process. Only this pod will be marked as ready
and will receive traffic from the k8s service object.

Should the leader crash or the node its running on goes offline,
a new leader will be elected and allowed to start the controller process.
Note: in case the full node goes offline, the old pod will still be marked
as ready. By using ClusterIP: "" on our service, we ensure we create an actual
proxy (which automatically chooses a responding pod) instead of each client
having to deal with multiple DNS responses which may or may not respond.

See also:

@WanzenBug WanzenBug force-pushed the controller-replicas branch 2 times, most recently from 2a9141e to 4c8a83a Compare August 4, 2020 09:59
@WanzenBug WanzenBug marked this pull request as ready for review August 4, 2020 11:14
@JoelColledge
Copy link
Collaborator

We can get this merged now that we are no longer in an RC phase. Rebase needed, and the Helm keys need to be added to the new docs.

@WanzenBug WanzenBug force-pushed the controller-replicas branch from 4c8a83a to 627cb7c Compare August 7, 2020 10:01
@WanzenBug WanzenBug requested a review from JoelColledge August 7, 2020 10:03
@WanzenBug WanzenBug force-pushed the controller-replicas branch from 627cb7c to 707a117 Compare August 7, 2020 10:08
JoelColledge
JoelColledge previously approved these changes Aug 7, 2020
Copy link
Collaborator

@JoelColledge JoelColledge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little unsure about dumping a shell script in the middle of a long go function, but separating it out would be messy because it is a format string...

@WanzenBug
Copy link
Member Author

I need to check if environment variables are passed to the ExecProbe. Then we could maybe find a cleaner solution

@WanzenBug WanzenBug force-pushed the controller-replicas branch 2 times, most recently from 69802cb to 76b0b9f Compare August 7, 2020 12:30
@WanzenBug
Copy link
Member Author

So environment variables are available in ExecProbe

During testing I noticed a few additional issues:

  • The controller could not handle multiple running controllers at once. I just removed that check, as in this new version multiple controllers are expected.
  • Rolling upgrades, are not really working. Probably because the new container won't be Ready until it is elected, but the old leader is still running, so the upgrade process basically gets stuck.

The second issue can be traced to the way we (ab-)use the Readymechanism. Not sure how we should deal with this yet

JoelColledge
JoelColledge previously approved these changes Aug 20, 2020
Copy link
Collaborator

@JoelColledge JoelColledge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just waiting for internal tests to succeed.

@WanzenBug WanzenBug marked this pull request as draft August 20, 2020 12:56
@WanzenBug
Copy link
Member Author

I want to polish our documentation a bit

@WanzenBug WanzenBug force-pushed the controller-replicas branch from 066a083 to 092803b Compare August 20, 2020 12:58
@WanzenBug WanzenBug marked this pull request as ready for review August 20, 2020 13:16
@WanzenBug WanzenBug force-pushed the controller-replicas branch from 092803b to 2df767c Compare August 20, 2020 15:31
@WanzenBug WanzenBug force-pushed the controller-replicas branch from 2df767c to 67600ca Compare August 31, 2020 08:01
Running multiple replicas requires special support from the linstor
controller. The controller container will start a leader election
process when it detects the presence of the K8S_AWAIT_ELECTION_*
variables.

The election process determines which pod is allowed to start the
linstor-controller process. Only this pod will be added as endpoint for the
controller service.

Should the leader crash or the node its running on goes offline,
a new leader will be elected and allowed to start the controller process.
Note: in case the full node goes offline, the old pod will still be marked
as ready. By using ClusterIP: "" on our service, we ensure we create an actual
proxy (which automatically chooses the responding pod) instead of each client
having to try multiple DNS responses.
@WanzenBug
Copy link
Member Author

@JoelColledge I think this is finally ready

Copy link
Collaborator

@JoelColledge JoelColledge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Just waiting for internal tests to pass before merging.

@JoelColledge JoelColledge merged commit 53688fe into piraeusdatastore:master Sep 30, 2020
@JoelColledge
Copy link
Collaborator

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable linstor-controller to run with multiple replicas
2 participants