Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: Add comments about leases and liveness #36984

Merged
merged 1 commit into from
Apr 24, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
93 changes: 79 additions & 14 deletions pkg/storage/node_liveness.go
Original file line number Diff line number Diff line change
Expand Up @@ -124,13 +124,33 @@ type IsLiveCallback func(nodeID roachpb.NodeID)
// indicating that it is alive.
type HeartbeatCallback func(context.Context)

// NodeLiveness encapsulates information on node liveness and provides
// an API for querying, updating, and invalidating node
// liveness. Nodes periodically "heartbeat" the range holding the node
// liveness system table to indicate that they're available. The
// resulting liveness information is used to ignore unresponsive nodes
// while making range quiescence decisions, as well as for efficient,
// node liveness epoch-based range leases.
// NodeLiveness is a centralized failure detector that coordinates
// with the epoch-based range system to provide for leases of
// indefinite length (replacing frequent per-range lease renewals with
// heartbeats to the liveness system).
//
// It is also used as a general-purpose failure detector, but it is
// not ideal for this purpose. It is inefficient due to the use of
// replicated durable writes, and is not very sensitive (it primarily
// tests connectivity from the node to the liveness range; a node with
// a failing disk could still be considered live by this system).
//
// The persistent state of node liveness is stored in the KV layer,
// near the beginning of the keyspace. These are normal MVCC keys,
// written by CPut operations in 1PC transactions (the use of
// transactions and MVCC is regretted because it means that the
// liveness span depends on MVCC GC and can get overwhelmed if GC is
// not working. Transactions were used only to piggyback on the
// transaction commit trigger). The leaseholder of the liveness range
// gossips its contents whenever they change (only the changed
// portion); other nodes rarely read from this range directly.
//
// The use of conditional puts is crucial to maintain the guarantees
// needed by epoch-based leases. Both the Heartbeat and IncrementEpoch
// on this type require an expected value to be passed in; see
// comments on those methods for more.
//
// TODO(bdarnell): Also document interaction with draining and decommissioning.
type NodeLiveness struct {
ambientCtx log.AmbientContext
clock *hlc.Clock
Expand Down Expand Up @@ -268,6 +288,11 @@ func (nl *NodeLiveness) SetDecommissioning(
// decommissioning commands in a tight loop on different nodes sometimes
// results in unintentional no-ops (due to the Gossip lag); this could be
// observed by users in principle, too.
//
// TODO(bdarnell): This is the one place where a range other than
// the leaseholder reads from this range. Should this read from
// gossip instead? (I have vague concerns about concurrent reads
// and timestamp cache pushes causing problems here)
var oldLiveness storagepb.Liveness
if err := nl.db.GetProto(ctx, keys.NodeLivenessKey(nodeID), &oldLiveness); err != nil {
return false, errors.Wrap(err, "unable to get liveness")
Expand Down Expand Up @@ -507,6 +532,21 @@ var errNodeAlreadyLive = errors.New("node already live")
// Heartbeat is called to update a node's expiration timestamp. This
// method does a conditional put on the node liveness record, and if
// successful, stores the updated liveness record in the nodes map.
//
// The liveness argument is the expected previous value of this node's
// liveness.
//
// If this method returns nil, the node's liveness has been extended,
// relative to the previous value. It may or may not still be alive
// when this method returns.
//
// On failure, this method returns ErrEpochIncremented, although this
// may not necessarily mean that the epoch was actually incremented.
// TODO(bdarnell): Fix error semantics here.
//
// This method is rarely called directly; heartbeats are normally sent
// by the StartHeartbeat loop.
// TODO(bdarnell): Should we just remove this synchronous heartbeat completely?
func (nl *NodeLiveness) Heartbeat(ctx context.Context, liveness *storagepb.Liveness) error {
return nl.heartbeatInternal(ctx, liveness, false /* increment epoch */)
}
Expand Down Expand Up @@ -575,6 +615,14 @@ func (nl *NodeLiveness) heartbeatInternal(
// considered live, treat the heartbeat as a success. This can
// happen when the periodic heartbeater races with a concurrent
// lease acquisition.
//
// TODO(bdarnell): If things are very slow, the new liveness may
// have already expired and we'd incorrectly return
// ErrEpochIncremented. Is this check even necessary? The common
// path through this method doesn't check whether the liveness
// expired while in flight, so maybe we don't have to care about
// that and only need to distinguish between same and different
// epochs in our return value.
if actual.IsLive(nl.clock.Now(), nl.clock.MaxOffset()) && !incrementEpoch {
return errNodeAlreadyLive
}
Expand Down Expand Up @@ -690,13 +738,30 @@ func (nl *NodeLiveness) getLivenessLocked(nodeID roachpb.NodeID) (*storagepb.Liv
return nil, ErrNoLivenessRecord
}

// IncrementEpoch is called to increment the current liveness epoch,
// thereby invalidating anything relying on the liveness of the
// previous epoch. This method does a conditional put on the node
// liveness record, and if successful, stores the updated liveness
// record in the nodes map. If this method is called on a node ID
// which is considered live according to the most recent information
// gathered through gossip, an error is returned.
// IncrementEpoch is called to attempt to revoke another node's
// current epoch, causing an expiration of all its leases. This method
// does a conditional put on the node liveness record, and if
// successful, stores the updated liveness record in the nodes map. If
// this method is called on a node ID which is considered live
// according to the most recent information gathered through gossip,
// an error is returned.
//
// The liveness argument is used as the expected value on the
// conditional put. If this method returns nil, there was a match and
// the epoch has been incremented. This means that the expiration time
// in the supplied liveness accurately reflects the time at which the
// epoch ended.
//
// If this method returns ErrEpochAlreadyIncremented, the epoch has
// already been incremented past the one in the liveness argument, but
// the conditional put did not find a match. This means that another
// node performed a successful IncrementEpoch, but we can't tell at
// what time the epoch actually ended. (Usually when multiple
// IncrementEpoch calls race, they're using the same expected value.
// But when there is a severe backlog, it's possible for one increment
// to get stuck in a queue long enough for the dead node to make
// another successful heartbeat, and a second increment to come in
// after that)
func (nl *NodeLiveness) IncrementEpoch(ctx context.Context, liveness *storagepb.Liveness) error {
// Allow only one increment at a time.
sem := nl.sem(liveness.NodeID)
Expand Down
52 changes: 52 additions & 0 deletions pkg/storage/replica_range_lease.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,36 @@
// permissions and limitations under the License.

// This file contains replica methods related to range leases.
//
// Here be dragons: The lease system (especially for epoch-based
// leases) relies on multiple interlocking conditional puts (here and
// in NodeLiveness). Reads (to get expected values) and conditional
// puts have to happen in a certain order, leading to surprising
// dependencies at a distance (for example, there's a LeaseStatus
// object that gets plumbed most of the way through this file.
// LeaseStatus bundles the results of multiple checks with the time at
// which they were performed, so that timestamp must be used for later
// operations). The current arrangement is not perfect, and some
// opportunities for improvement appear, but any changes must be made
// very carefully.
//
// NOTE(bdarnell): The biggest problem with the current code is that
// with epoch-based leases, we may do two separate slow operations
// (IncrementEpoch/Heartbeat and RequestLease/AdminTransferLease). In
// the organization that was inherited from expiration-based leases,
// we prepare the arguments we're going to use for the lease
// operations before performing the liveness operations, and by the
// time the liveness operations complete those may be stale.
//
// Therefore, my suggested refactoring would be to move the liveness
// operations earlier in the process, soon after the initial
// leaseStatus call. If a liveness operation is required, do it and
// start over, with a fresh leaseStatus.
//
// This could also allow the liveness operations to be coalesced per
// node instead of having each range separately queue up redundant
// liveness operations. (The InitOrJoin model predates the
// singleflight package; could we simplify things by using it?)

package storage

Expand Down Expand Up @@ -139,6 +169,10 @@ func (p *pendingLeaseRequest) RequestPending() (roachpb.Lease, bool) {
// replica happen either before or after a result for a pending request has
// happened.
//
// The new lease will be a successor to the one in the status
// argument, and its fields will be used to fill in the expected
// values for liveness and lease operations.
//
// transfer needs to be set if the request represents a lease transfer (as
// opposed to an extension, or acquiring the lease when none is held).
//
Expand Down Expand Up @@ -173,6 +207,17 @@ func (p *pendingLeaseRequest) InitOrJoinRequest(
var leaseReq roachpb.Request
now := p.repl.store.Clock().Now()
reqLease := roachpb.Lease{
// It's up to us to ensure that Lease.Start is greater than the
// end time of the previous lease. This means that if status
// refers to an expired epoch lease, we must increment the epoch
// *at status.Timestamp* before we can propose this lease.
//
// Note that the server may decrease our proposed start time if it
// decides that it is safe to do so (for example, this happens
// when renewing an expiration-based lease), but it will never
// increase it (and a start timestamp that is too low is unsafe
// because it results in incorrect initialization of the timestamp
// cache on the new leaseholder).
Start: status.Timestamp,
Replica: nextLeaseHolder,
ProposedTS: &now,
Expand Down Expand Up @@ -206,6 +251,9 @@ func (p *pendingLeaseRequest) InitOrJoinRequest(
leaseReq = &roachpb.RequestLeaseRequest{
RequestHeader: reqHeader,
Lease: reqLease,
// PrevLease must match for our lease to be accepted. If another
// lease is applied between our previous call to leaseStatus and
// our lease request applying, it will be rejected.
PrevLease: status.Lease,
MinProposedTS: &minProposedTS,
}
Expand All @@ -231,6 +279,9 @@ func (p *pendingLeaseRequest) InitOrJoinRequest(

// requestLeaseAsync sends a transfer lease or lease request to the
// specified replica. The request is sent in an async task.
//
// The status argument is used as the expected value for liveness operations.
// reqLease and leaseReq must be consistent with the LeaseStatus.
func (p *pendingLeaseRequest) requestLeaseAsync(
parentCtx context.Context,
nextLeaseHolder roachpb.ReplicaDescriptor,
Expand Down Expand Up @@ -339,6 +390,7 @@ func (p *pendingLeaseRequest) requestLeaseAsync(
}
// Set error for propagation to all waiters below.
if err != nil {
// TODO(bdarnell): is status.Lease really what we want to put in the NotLeaseHolderError here?
pErr = roachpb.NewError(newNotLeaseHolderError(&status.Lease, p.repl.store.StoreID(), p.repl.Desc()))
}
}
Expand Down
31 changes: 23 additions & 8 deletions pkg/storage/storagepb/lease_status.pb.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

23 changes: 19 additions & 4 deletions pkg/storage/storagepb/lease_status.proto
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,27 @@ enum LeaseState {
ERROR = 0;
// VALID indicates that the lease can be used.
VALID = 1;
// STASIS indicates that the lease has not expired, but can't be used.
// STASIS indicates that the lease has not expired, but can't be
// used because it is close to expiration (a stasis period at the
// end of each lease is one of the ways we handle clock
// uncertainty). A lease in STASIS may become VALID for the same
// leaseholder after a successful RequestLease (for expiration-based
// leases) or Heartbeat (for epoch-based leases). A lease may not
// change hands while it is in stasis; would-be acquirers must wait
// for the stasis period to expire.
STASIS = 2;
// EXPIRED indicates that the lease can't be used.
// EXPIRED indicates that the lease can't be used. An expired lease
// may become VALID for the same leaseholder on RequestLease or
// Heartbeat, or it may be replaced by a new leaseholder with a
// RequestLease (for expiration-based leases) or
// IncrementEpoch+RequestLease (for epoch-based leases).
EXPIRED = 3;
// PROSCRIBED indicates that the lease's proposed timestamp is earlier than
// allowed.
// PROSCRIBED indicates that the lease's proposed timestamp is
// earlier than allowed. This is used to detect node restarts: a
// node that has restarted will see its former incarnation's leases
// as PROSCRIBED so it will renew them before using them. Note that
// the PROSCRIBED state is only visible to the leaseholder; other
// nodes will see this as a VALID lease.
PROSCRIBED = 4;
}

Expand Down