server: declare node ready while decommissioning

Before this patch, a decommissioning or draining, but otherwise healthy, node would return an error from the /health?ready=1 endpoint, thus declaring itself unready and signaling load balancers to send traffic away. This patch makes it so that the decommissioning status no longer matters for the readyness determination. Draining nodes continue to declare themselves unready. Note that decommissioning nodes typically go through draining at the end of the process. The justification is that a node can be decommissioning for an arbitrary amount of time. During that time it can continue to hold leases, etc. So trying to avoid traffic is not particularly desirable. In fact, some people even want to keep a node in a decommissioning state indefinitely (by restarting a decommissioning node without recommissioning it). Also, the server.shutdown.drain_wait cluster setting is there to give load balancers ample time to find out about a draining node. Also, tactically, the code is simplified. Release note (general change): A node no longer declares itself to not be ready through the /health/ready=1 endpoint while it's in the process of decommissioning. It continues to declare itself unready while draining.
cockroachdb · Jan 14, 2020 · 139bd21 · 139bd21
1 parent 811f75a
commit 139bd21
Showing 1 changed file with 4 additions and 3 deletions.
diff --git a/pkg/server/status.go b/pkg/server/status.go
@@ -50,6 +50,7 @@ import (
 	"github.com/cockroachdb/cockroach/pkg/storage"
 	"github.com/cockroachdb/cockroach/pkg/storage/storagepb"
 	"github.com/cockroachdb/cockroach/pkg/util/contextutil"
+	"github.com/cockroachdb/cockroach/pkg/util/hlc"
 	"github.com/cockroachdb/cockroach/pkg/util/httputil"
 	"github.com/cockroachdb/cockroach/pkg/util/log"
 	"github.com/cockroachdb/cockroach/pkg/util/stop"
@@ -670,9 +671,9 @@ func (s *statusServer) Details(
 	if err != nil {
 		return nil, grpcstatus.Error(codes.Internal, err.Error())
 	}
-	ls := l.LivenessStatus(s.admin.server.clock.PhysicalTime(), 0 /* threshold */)
-	isHealthy := ls == storagepb.NodeLivenessStatus_LIVE
-	if !isHealthy {
+	nowHlc := hlc.Timestamp{WallTime: s.admin.server.clock.PhysicalNow()}
+	isReady := l.IsLive(nowHlc) && !l.Draining
+	if !isReady {
 		return nil, grpcstatus.Error(codes.Unavailable, "node is not ready")
 	}