-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: split health endpoint into health and readiness endpoints #22911
Conversation
I am planning on cherrypicking this into 2.0 and 1.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
pkg/server/admin.go
Outdated
@@ -1040,14 +1041,44 @@ func (s *adminServer) Cluster( | |||
func (s *adminServer) Health( | |||
ctx context.Context, req *serverpb.HealthRequest, | |||
) (*serverpb.HealthResponse, error) { | |||
isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, comments more clearly explaining the reason for using IsLive
vs IsHealthy
here or in node_liveness.go would be nice. To really be sure I had to look at the commit message of bcc6d0b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (added to node_liveness.go
) let me know if this is what you had in mind.
pkg/server/admin.go
Outdated
@@ -1040,14 +1041,44 @@ func (s *adminServer) Cluster( | |||
func (s *adminServer) Health( | |||
ctx context.Context, req *serverpb.HealthRequest, | |||
) (*serverpb.HealthResponse, error) { | |||
isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID()) | |||
if err != nil { | |||
return nil, status.Errorf(codes.Internal, "node is not live") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we return "node is not live" here but the body of the error in the analogous case of the Ready()
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oversight, thanks for catching! Also changed the Errorf
calls to just Error
.
pkg/server/admin.go
Outdated
} | ||
return &serverpb.HealthResponse{}, nil | ||
|
||
// Attempt to execute a SQL query with a timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain for posterity which situations this will catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this to observe the serveMode
on the server, which is set after startup. I think this is the best way to check for readiness since more involved checks are not necessarily what we're going for here. i.e. we're not trying to answer whether or not we can execute SQL queries at the time of the check, we're trying to answer if the server should be able to accept connections.
Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful. pkg/server/serverpb/admin.proto, line 415 at r2 (raw file):
I'd rather add parameters to HealthRequest than an entirely different method. Something like Comments from Reviewable |
Something I'm confused about: we have another endpoint at Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions. pkg/server/serverpb/admin.proto, line 415 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Reviewed 5 of 5 files at r3. pkg/server/admin.go, line 1045 at r3 (raw file):
"Ready" checks are a superset of regular checks, so I'd like to see this structured as Comments from Reviewable |
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful. Comments from Reviewable |
I'm arguing that we should make the endpoint used by load balancers |
Review status: 4 of 5 files reviewed at latest revision, 4 unresolved discussions. pkg/server/admin.go, line 1045 at r3 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
Would we then make |
I think |
It's very weird from an API design perspective that Just to make sure we're on the same page, the cockroach/pkg/server/status.go Lines 496 to 519 in e36099f
|
True, but it's very weird that we've ended up with two endpoints called "health" that do such different things. I'd rather consolidate all variants of health checking onto |
Alright, I can live with that. Sounds like @asubiotto has some more work to do then. |
This reverts bcc6d0b, as the load balancer endpoint used is /health which hits the Details endpoint. Additionally, if liveness checks are failed, a node is usually restarted. We don't want a draining node to be restarted while it is draining. Release note: None
The readiness endpoint (/health?ready=1) returns whether the node is ready to receive client traffic. This is necessary for load balancers. Addresses #22424 Addresses #22468 Fixes #cockroachdb-cloudformation/issues/13 Release note (bug fix): readiness endpoint added (/health?ready=1) for better integration with load balancers.
Added the readiness check to the |
Reviewed 1 of 1 files at r4, 2 of 5 files at r5, 4 of 4 files at r6. pkg/server/admin.go, line 1043 at r6 (raw file):
Is this change from IsHealthy to IsLive still appropriate? This change should at least be called out in the commit message. Comments from Reviewable |
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/server/admin.go, line 1043 at r6 (raw file): Previously, bdarnell (Ben Darnell) wrote…
The first commit mentions it. I think it is appropriate because the change from Comments from Reviewable |
The health endpoint reverts to returning whether the node liveness is
expired or not. The readiness endpoint returns whether the node is ready
to receive SQL traffic. This is an important distinction needed by load
balancers as an unhealthy node is restarted by force and an unready node
simply does not get traffic sent to it. The new readiness endpoint is
accessed through "_admin/v1/ready"
Addresses #22424
Addresses #22468
Fixes #cockroachdb/cockroachdb-cloudformation#13
Release note (bug fix): Health endpoint has been split into a simpler
health endpoint and a readiness endpoint to integrate better with load
balancers.
cc @bobvawter @nstewart @bdarnell