server: split health endpoint into health and readiness endpoints #22911

asubiotto · 2018-02-21T22:31:16Z

The health endpoint reverts to returning whether the node liveness is
expired or not. The readiness endpoint returns whether the node is ready
to receive SQL traffic. This is an important distinction needed by load
balancers as an unhealthy node is restarted by force and an unready node
simply does not get traffic sent to it. The new readiness endpoint is
accessed through "_admin/v1/ready"

Addresses #22424
Addresses #22468
Fixes #cockroachdb/cockroachdb-cloudformation#13

Release note (bug fix): Health endpoint has been split into a simpler
health endpoint and a readiness endpoint to integrate better with load
balancers.

cc @bobvawter @nstewart @bdarnell

cockroach-teamcity · 2018-02-21T22:31:21Z

This change is

asubiotto · 2018-02-21T22:31:56Z

I am planning on cherrypicking this into 2.0 and 1.1

a-robinson

LGTM

a-robinson · 2018-02-21T22:47:53Z

pkg/server/admin.go

@@ -1040,14 +1041,44 @@ func (s *adminServer) Cluster(
 func (s *adminServer) Health(
 	ctx context.Context, req *serverpb.HealthRequest,
 ) (*serverpb.HealthResponse, error) {
+	isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID())


For what it's worth, comments more clearly explaining the reason for using IsLive vs IsHealthy here or in node_liveness.go would be nice. To really be sure I had to look at the commit message of bcc6d0b

Done (added to node_liveness.go) let me know if this is what you had in mind.

a-robinson · 2018-02-21T22:49:23Z

pkg/server/admin.go

@@ -1040,14 +1041,44 @@ func (s *adminServer) Cluster(
 func (s *adminServer) Health(
 	ctx context.Context, req *serverpb.HealthRequest,
 ) (*serverpb.HealthResponse, error) {
+	isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID())
+	if err != nil {
+		return nil, status.Errorf(codes.Internal, "node is not live")


Why do we return "node is not live" here but the body of the error in the analogous case of the Ready() function?

Oversight, thanks for catching! Also changed the Errorf calls to just Error.

a-robinson · 2018-02-21T22:50:19Z

pkg/server/admin.go

 	}
-	return &serverpb.HealthResponse{}, nil
+
+	// Attempt to execute a SQL query with a timeout.


Can you explain for posterity which situations this will catch?

I changed this to observe the serveMode on the server, which is set after startup. I think this is the best way to check for readiness since more involved checks are not necessarily what we're going for here. i.e. we're not trying to answer whether or not we can execute SQL queries at the time of the check, we're trying to answer if the server should be able to accept connections.

bdarnell · 2018-02-25T14:39:56Z

Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/server/serverpb/admin.proto, line 415 at r2 (raw file):

// ReadyRequest inquires whether the addressed node is ready to receive client
// traffic.
message ReadyRequest {

I'd rather add parameters to HealthRequest than an entirely different method. Something like /health?ready=1. We may want to add additional options in the future, such as a minimum version or cluster ID (some deployment environments find these options useful to avoid problems caused by back-from-the-dead servers).

Comments from Reviewable

asubiotto · 2018-02-26T18:33:57Z

Something I'm confused about: we have another endpoint at /health which hits the Details endpoint (different from /_admin/v1/health). This is kind of confusing and I think we might be using the wrong endpoint in our kubernetes configuration (@nstewart). Does anybody know what's up with that?

Review status: 0 of 5 files reviewed at latest revision, 4 unresolved discussions.

pkg/server/serverpb/admin.proto, line 415 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I'd rather add parameters to HealthRequest than an entirely different method. Something like /health?ready=1. We may want to add additional options in the future, such as a minimum version or cluster ID (some deployment environments find these options useful to avoid problems caused by back-from-the-dead servers).

Done.

Comments from Reviewable

bdarnell · 2018-02-26T22:39:31Z

/health is the one I thought you were changing here; that's the one we use in our LB configs. I don't know what /_admin/v1/health is used for.

Reviewed 5 of 5 files at r3.
Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

pkg/server/admin.go, line 1045 at r3 (raw file):

	// This is a simple health check, the client doesn't care whether or not the
	// node is ready to receive client traffic.
	if !req.Ready {

"Ready" checks are a superset of regular checks, so I'd like to see this structured as if !req.Ready { return } in the middle of this method instead of a completely separate path.

Comments from Reviewable

a-robinson · 2018-02-27T14:50:04Z

/health only really tells you whether the process is running or not, which isn't particularly useful for indicating whether cockroach is able to serve SQL queries. We should change LB configs to use /_admin/v1/health?ready=1 once this is in. We use /_admin/v1/health today in our DC/OS framework.

Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.

Comments from Reviewable

bdarnell · 2018-02-27T15:11:24Z

I'm arguing that we should make the endpoint used by load balancers /health?ready=1. Doesn't the underscore in /_admin mean internal use only?

asubiotto · 2018-02-27T16:21:07Z

Review status: 4 of 5 files reviewed at latest revision, 4 unresolved discussions.

pkg/server/admin.go, line 1045 at r3 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"Ready" checks are a superset of regular checks, so I'd like to see this structured as if !req.Ready { return } in the middle of this method instead of a completely separate path.

Done.

Comments from Reviewable

a-robinson · 2018-02-27T18:10:32Z

Would we then make /health in v2.0 behave like /_admin/v1/health behaves in v1.1? What's your proposal for moving these endpoints around?

bdarnell · 2018-02-27T18:12:18Z

I think /health should work as it does today, and /health?ready=1 should work according to the new changes in this PR. I'm not sure what /_admin/v1/health is trying to do or how it is used; I don't see any reason to propagate the stuff that it does to the non-ready form of /health.

a-robinson · 2018-02-27T18:20:30Z

It's very weird from an API design perspective that /health?ready=1 will have more in common with /_admin/v1/health than it will with /health.

Just to make sure we're on the same page, the (*adminServer).Health method that @asubiotto has been modifying in this PR is the code for /_admin/v1/health. The code for /health is:

cockroach/pkg/server/status.go

Lines 496 to 519 in e36099f

    
           func (s *statusServer) Details( 
        
           	ctx context.Context, req *serverpb.DetailsRequest, 
        
           ) (*serverpb.DetailsResponse, error) { 
        
           	ctx = s.AnnotateCtx(ctx) 
        
           	nodeID, local, err := s.parseNodeID(req.NodeId) 
        
           	if err != nil { 
        
           		return nil, grpcstatus.Errorf(codes.InvalidArgument, err.Error()) 
        
           	} 
        
           	if local { 
        
           		resp := &serverpb.DetailsResponse{ 
        
           			NodeID:    s.gossip.NodeID.Get(), 
        
           			BuildInfo: build.GetInfo(), 
        
           		} 
        
           		if addr, err := s.gossip.GetNodeIDAddress(s.gossip.NodeID.Get()); err == nil { 
        
           			resp.Address = *addr 
        
           		} 
        
           		return resp, nil 
        
           	} 
        
           	status, err := s.dialNode(ctx, nodeID) 
        
           	if err != nil { 
        
           		return nil, err 
        
           	} 
        
           	return status.Details(ctx, req) 
        
           }

bdarnell · 2018-02-27T18:28:41Z

It's very weird from an API design perspective that /health?ready=1 will have more in common with /_admin/v1/health than it will with /health.

True, but it's very weird that we've ended up with two endpoints called "health" that do such different things. I'd rather consolidate all variants of health checking onto /health with different options.

a-robinson · 2018-02-27T18:32:50Z

Alright, I can live with that. Sounds like @asubiotto has some more work to do then.

This reverts bcc6d0b, as the load balancer endpoint used is /health which hits the Details endpoint. Additionally, if liveness checks are failed, a node is usually restarted. We don't want a draining node to be restarted while it is draining. Release note: None

The readiness endpoint (/health?ready=1) returns whether the node is ready to receive client traffic. This is necessary for load balancers. Addresses #22424 Addresses #22468 Fixes #cockroachdb-cloudformation/issues/13 Release note (bug fix): readiness endpoint added (/health?ready=1) for better integration with load balancers.

asubiotto · 2018-02-27T21:46:17Z

Added the readiness check to the Details endpoint (accessible as /health?ready=1). I left Health() as is (reverted to its previous behavior). The future of /_admin/v1/health is not something I want to tackle in this PR. PTAL. Note that liveness checks will not take into account node liveness.

bdarnell · 2018-02-28T03:21:32Z

Reviewed 1 of 1 files at r4, 2 of 5 files at r5, 4 of 4 files at r6.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

pkg/server/admin.go, line 1043 at r6 (raw file):

	ctx context.Context, req *serverpb.HealthRequest,
) (*serverpb.HealthResponse, error) {
	isLive, err := s.server.nodeLiveness.IsLive(s.server.NodeID())

Is this change from IsHealthy to IsLive still appropriate? This change should at least be called out in the commit message.

Comments from Reviewable

asubiotto · 2018-02-28T16:11:41Z

Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful.

pkg/server/admin.go, line 1043 at r6 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Is this change from IsHealthy to IsLive still appropriate? This change should at least be called out in the commit message.

The first commit mentions it. I think it is appropriate because the change from IsLive to IsHealthy was badly informed since I assumed this was the endpoint checked by load balancers.

Comments from Reviewable

asubiotto requested review from a-robinson and a team February 21, 2018 22:31

a-robinson approved these changes Feb 21, 2018

View reviewed changes

asubiotto mentioned this pull request Feb 22, 2018

QPS Drops to Zero during rolling upgrade #22424

Closed

a-robinson mentioned this pull request Feb 26, 2018

kubernetes: revisit removal of health checks #22468

Closed

a-robinson mentioned this pull request Feb 27, 2018

Production monitoring and alerting cockroachdb/docs#2564

Merged

asubiotto added 2 commits February 27, 2018 16:14

asubiotto merged commit 34c7c69 into cockroachdb:master Feb 28, 2018

This was referenced Feb 28, 2018

cherrypick-2.0: server: add readiness endpoint #23247

Merged

Flaky test: crashed pods may temporarily be "ready" on instantiation. cockroachdb/cockroachdb-cloudformation#13

Closed

asubiotto deleted the lb-checks branch May 22, 2018 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: split health endpoint into health and readiness endpoints #22911

server: split health endpoint into health and readiness endpoints #22911

asubiotto commented Feb 21, 2018

cockroach-teamcity commented Feb 21, 2018

asubiotto commented Feb 21, 2018

a-robinson left a comment

a-robinson Feb 21, 2018

asubiotto Feb 22, 2018

a-robinson Feb 21, 2018

asubiotto Feb 22, 2018

a-robinson Feb 21, 2018

asubiotto Feb 22, 2018

bdarnell commented Feb 25, 2018

asubiotto commented Feb 26, 2018

bdarnell commented Feb 26, 2018

a-robinson commented Feb 27, 2018

bdarnell commented Feb 27, 2018

asubiotto commented Feb 27, 2018

a-robinson commented Feb 27, 2018 •

edited

Loading

bdarnell commented Feb 27, 2018

a-robinson commented Feb 27, 2018 •

edited

Loading

bdarnell commented Feb 27, 2018

a-robinson commented Feb 27, 2018

asubiotto commented Feb 27, 2018

bdarnell commented Feb 28, 2018

asubiotto commented Feb 28, 2018

server: split health endpoint into health and readiness endpoints #22911

server: split health endpoint into health and readiness endpoints #22911

Conversation

asubiotto commented Feb 21, 2018

cockroach-teamcity commented Feb 21, 2018

asubiotto commented Feb 21, 2018

a-robinson left a comment

Choose a reason for hiding this comment

a-robinson Feb 21, 2018

Choose a reason for hiding this comment

asubiotto Feb 22, 2018

Choose a reason for hiding this comment

a-robinson Feb 21, 2018

Choose a reason for hiding this comment

asubiotto Feb 22, 2018

Choose a reason for hiding this comment

a-robinson Feb 21, 2018

Choose a reason for hiding this comment

asubiotto Feb 22, 2018

Choose a reason for hiding this comment

bdarnell commented Feb 25, 2018

asubiotto commented Feb 26, 2018

bdarnell commented Feb 26, 2018

a-robinson commented Feb 27, 2018

bdarnell commented Feb 27, 2018

asubiotto commented Feb 27, 2018

a-robinson commented Feb 27, 2018 • edited Loading

bdarnell commented Feb 27, 2018

a-robinson commented Feb 27, 2018 • edited Loading

bdarnell commented Feb 27, 2018

a-robinson commented Feb 27, 2018

asubiotto commented Feb 27, 2018

bdarnell commented Feb 28, 2018

asubiotto commented Feb 28, 2018

a-robinson commented Feb 27, 2018 •

edited

Loading

a-robinson commented Feb 27, 2018 •

edited

Loading