Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle lock release with SIGHUP in VTGR #8472

Merged
merged 2 commits into from
Jul 23, 2021
Merged

Conversation

5antelope
Copy link
Member

@5antelope 5antelope commented Jul 14, 2021

Signed-off-by: crowu [email protected]

Description

VTGR relies on the lock from topo server. Most of the topo lock uses health-check of session to determine if a process is still holding the lock. This is not ideal for lock release process during a restart with SIGHUP. Take consul as an example, the session check by default uses serfHealth, which is a health check on the node level - even if the process restart because the node is healthy, consul will think the lock is still being held until the TTL. As a result, during a deploy, there could be TTL period of time that VTGR cannot grab the lock and fix the cluster.

This PR handle the SIGHUP signal by explicitly release the lock that was held to avoid the situation above.

Related Issue(s)

#8386

Checklist

  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@5antelope 5antelope added Type: Enhancement Logical improvement (somewhere between a bug and feature) Component: Cluster management labels Jul 14, 2021
Signed-off-by: crowu <[email protected]>
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine, but it seems to be handling SIGHUP, not SIGTERM.
Which one should it be? or both?

@5antelope 5antelope changed the title Handle lock release with SIGTERM in VTGR Handle lock release with SIGHUP in VTGR Jul 16, 2021
@5antelope
Copy link
Member Author

Code looks fine, but it seems to be handling SIGHUP, not SIGTERM.
Which one should it be? or both?

It depends on what signal ppl use for restart, I think we can start with SIGHUP and call it out in the document. Later add supports for SIGTERM if required. wdyt?

@deepthi
Copy link
Member

deepthi commented Jul 19, 2021

Code looks fine, but it seems to be handling SIGHUP, not SIGTERM.
Which one should it be? or both?

It depends on what signal ppl use for restart, I think we can start with SIGHUP and call it out in the document. Later add supports for SIGTERM if required. wdyt?

For safety it is better to handle SIGTERM as well. I don't have a preference on when this should be added, I will leave that up to you.
LMK if that is planned to be a separate PR so that I can approve and merge this one.

@5antelope
Copy link
Member Author

Thanks @deepthi, I will try to address SIGTERM separately (hopefully soon :-))

@5antelope
Copy link
Member Author

Gentle nudge

@deepthi deepthi merged commit 584f829 into vitessio:main Jul 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants