-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance /debug/health to be usable as a liveness probe #3139
Conversation
I've tested this out and it works as expected in our environment. As I anticipated in my original description, there is a period of 5s or so where the state is transitioning and it returns an error. We could whitelist that state, but I'm just tuning my probes to be able to ride over it. |
tabletType := tsv.target.TabletType | ||
tsv.mu.Unlock() | ||
switch tabletType { | ||
case topodatapb.TabletType_MASTER, topodatapb.TabletType_REPLICA, topodatapb.TabletType_BATCH: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be future-proof, let's add experimental
, which is considered a serving type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. And rdonly also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BATCH and RDONLY are the same value -- the switch complains if i add that. I'll do experimintal tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course :). Just experimental then.
added experimental. i think shard 1 is never going to go green, it's been failing all morning for multiple PRs |
Only report health for serving types, namely master, replica, batch. Fix api to return 500 if in error
This should be useful for those in kubernetes or who otherwise want to use a probe to determine whether the vttablet process is healthy. This is distinct from /healthz, which is used for load balancers and returns unhealthy in more cases.
Open question: should we also whitelist the
StateTransitioning
state? Maybe not, as this should probably be called with a heuristic that allows for a certain number of failures, like many probe implementations do.@sougou @adkhare