Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: split health endpoint into health and readiness endpoints #22911

Merged
merged 2 commits into from
Feb 28, 2018
Merged

server: split health endpoint into health and readiness endpoints #22911

merged 2 commits into from
Feb 28, 2018

Conversation

asubiotto
Copy link
Contributor

The health endpoint reverts to returning whether the node liveness is
expired or not. The readiness endpoint returns whether the node is ready
to receive SQL traffic. This is an important distinction needed by load
balancers as an unhealthy node is restarted by force and an unready node
simply does not get traffic sent to it. The new readiness endpoint is
accessed through "_admin/v1/ready"

Addresses #22424
Addresses #22468
Fixes #cockroachdb/cockroachdb-cloudformation#13

Release note (bug fix): Health endpoint has been split into a simpler
health endpoint and a readiness endpoint to integrate better with load
balancers.

cc @bobvawter @nstewart @bdarnell

@asubiotto asubiotto requested review from a-robinson and a team February 21, 2018 22:31
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@asubiotto
Copy link
Contributor Author

I am planning on cherrypicking this into 2.0 and 1.1

Copy link
Contributor

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -1040,14 +1041,44 @@ func (s *adminServer) Cluster(
func (s *adminServer) Health(
ctx context.Context, req *serverpb.HealthRequest,
) (*serverpb.HealthResponse, error) {
isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, comments more clearly explaining the reason for using IsLive vs IsHealthy here or in node_liveness.go would be nice. To really be sure I had to look at the commit message of bcc6d0b

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (added to node_liveness.go) let me know if this is what you had in mind.

@@ -1040,14 +1041,44 @@ func (s *adminServer) Cluster(
func (s *adminServer) Health(
ctx context.Context, req *serverpb.HealthRequest,
) (*serverpb.HealthResponse, error) {
isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID())
if err != nil {
return nil, status.Errorf(codes.Internal, "node is not live")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we return "node is not live" here but the body of the error in the analogous case of the Ready() function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oversight, thanks for catching! Also changed the Errorf calls to just Error.

}
return &serverpb.HealthResponse{}, nil

// Attempt to execute a SQL query with a timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain for posterity which situations this will catch?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this to observe the serveMode on the server, which is set after startup. I think this is the best way to check for readiness since more involved checks are not necessarily what we're going for here. i.e. we're not trying to answer whether or not we can execute SQL queries at the time of the check, we're trying to answer if the server should be able to accept connections.

@bdarnell
Copy link
Contributor

Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/server/serverpb/admin.proto, line 415 at r2 (raw file):

// ReadyRequest inquires whether the addressed node is ready to receive client
// traffic.
message ReadyRequest {

I'd rather add parameters to HealthRequest than an entirely different method. Something like /health?ready=1. We may want to add additional options in the future, such as a minimum version or cluster ID (some deployment environments find these options useful to avoid problems caused by back-from-the-dead servers).


Comments from Reviewable

@asubiotto
Copy link
Contributor Author

Something I'm confused about: we have another endpoint at /health which hits the Details endpoint (different from /_admin/v1/health). This is kind of confusing and I think we might be using the wrong endpoint in our kubernetes configuration (@nstewart). Does anybody know what's up with that?


Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions.


pkg/server/serverpb/admin.proto, line 415 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd rather add parameters to HealthRequest than an entirely different method. Something like /health?ready=1. We may want to add additional options in the future, such as a minimum version or cluster ID (some deployment environments find these options useful to avoid problems caused by back-from-the-dead servers).

Done.


Comments from Reviewable

@bdarnell
Copy link
Contributor

/health is the one I thought you were changing here; that's the one we use in our LB configs. I don't know what /_admin/v1/health is used for.


Reviewed 5 of 5 files at r3.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


pkg/server/admin.go, line 1045 at r3 (raw file):

	// This is a simple health check, the client doesn't care whether or not the
	// node is ready to receive client traffic.
	if !req.Ready {

"Ready" checks are a superset of regular checks, so I'd like to see this structured as if !req.Ready { return } in the middle of this method instead of a completely separate path.


Comments from Reviewable

@a-robinson
Copy link
Contributor

/health only really tells you whether the process is running or not, which isn't particularly useful for indicating whether cockroach is able to serve SQL queries. We should change LB configs to use /_admin/v1/health?ready=1 once this is in. We use /_admin/v1/health today in our DC/OS framework.


Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


Comments from Reviewable

@bdarnell
Copy link
Contributor

I'm arguing that we should make the endpoint used by load balancers /health?ready=1. Doesn't the underscore in /_admin mean internal use only?

@asubiotto
Copy link
Contributor Author

Review status: 4 of 5 files reviewed at latest revision, 4 unresolved discussions.


pkg/server/admin.go, line 1045 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"Ready" checks are a superset of regular checks, so I'd like to see this structured as if !req.Ready { return } in the middle of this method instead of a completely separate path.

Done.


Comments from Reviewable

@a-robinson
Copy link
Contributor

a-robinson commented Feb 27, 2018

Would we then make /health in v2.0 behave like /_admin/v1/health behaves in v1.1? What's your proposal for moving these endpoints around?

@bdarnell
Copy link
Contributor

I think /health should work as it does today, and /health?ready=1 should work according to the new changes in this PR. I'm not sure what /_admin/v1/health is trying to do or how it is used; I don't see any reason to propagate the stuff that it does to the non-ready form of /health.

@a-robinson
Copy link
Contributor

a-robinson commented Feb 27, 2018

It's very weird from an API design perspective that /health?ready=1 will have more in common with /_admin/v1/health than it will with /health.

Just to make sure we're on the same page, the (*adminServer).Health method that @asubiotto has been modifying in this PR is the code for /_admin/v1/health. The code for /health is:

func (s *statusServer) Details(
ctx context.Context, req *serverpb.DetailsRequest,
) (*serverpb.DetailsResponse, error) {
ctx = s.AnnotateCtx(ctx)
nodeID, local, err := s.parseNodeID(req.NodeId)
if err != nil {
return nil, grpcstatus.Errorf(codes.InvalidArgument, err.Error())
}
if local {
resp := &serverpb.DetailsResponse{
NodeID: s.gossip.NodeID.Get(),
BuildInfo: build.GetInfo(),
}
if addr, err := s.gossip.GetNodeIDAddress(s.gossip.NodeID.Get()); err == nil {
resp.Address = *addr
}
return resp, nil
}
status, err := s.dialNode(ctx, nodeID)
if err != nil {
return nil, err
}
return status.Details(ctx, req)
}

@bdarnell
Copy link
Contributor

It's very weird from an API design perspective that /health?ready=1 will have more in common with /_admin/v1/health than it will with /health.

True, but it's very weird that we've ended up with two endpoints called "health" that do such different things. I'd rather consolidate all variants of health checking onto /health with different options.

@a-robinson
Copy link
Contributor

Alright, I can live with that. Sounds like @asubiotto has some more work to do then.

This reverts bcc6d0b, as the load
balancer endpoint used is /health which hits the Details endpoint.
Additionally, if liveness checks are failed, a node is usually
restarted. We don't want a draining node to be restarted while it is
draining.

Release note: None
The readiness endpoint (/health?ready=1) returns whether the node is
ready to receive client traffic. This is necessary for load balancers.

Addresses #22424
Addresses #22468
Fixes #cockroachdb-cloudformation/issues/13

Release note (bug fix): readiness endpoint added (/health?ready=1)
for better integration with load balancers.
@asubiotto
Copy link
Contributor Author

Added the readiness check to the Details endpoint (accessible as /health?ready=1). I left Health() as is (reverted to its previous behavior). The future of /_admin/v1/health is not something I want to tackle in this PR. PTAL. Note that liveness checks will not take into account node liveness.

@bdarnell
Copy link
Contributor

:lgtm:


Reviewed 1 of 1 files at r4, 2 of 5 files at r5, 4 of 4 files at r6.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/server/admin.go, line 1043 at r6 (raw file):

	ctx context.Context, req *serverpb.HealthRequest,
) (*serverpb.HealthResponse, error) {
	isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID())

Is this change from IsHealthy to IsLive still appropriate? This change should at least be called out in the commit message.


Comments from Reviewable

@asubiotto
Copy link
Contributor Author

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.


pkg/server/admin.go, line 1043 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is this change from IsHealthy to IsLive still appropriate? This change should at least be called out in the commit message.

The first commit mentions it. I think it is appropriate because the change from IsLive to IsHealthy was badly informed since I assumed this was the endpoint checked by load balancers.


Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants