-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: add support for allocator range check via store #92367
Conversation
094f7b5
to
f9a087e
Compare
f9a087e
to
abce17f
Compare
ee0b569
to
2ebb088
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @kvoli)
pkg/kv/kvserver/store.go
line 3687 at r3 (raw file):
} conf, err := confReader.GetSpanConfigForKey(ctx, desc.StartKey)
@irfansharif added you to the review to double check this span config lookup seems ok.
2ebb088
to
27b9f38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/store.go
line 3687 at r3 (raw file):
Previously, AlexTalks (Alex Sarkesian) wrote…
@irfansharif added you to the review to double check this span config lookup seems ok.
I've not kept up with how this API is going to get used. Are you invoking it on on stores that actually hold replicas of the descriptor in question? If so, and you just want the associated span config, it already hangs off the Replica
type here, accessible using r.SpanConfig(): https://github.com/cockroachdb/cockroach/blob/1d4f79b52b56e68969cc5b310658d3f16905d7cf/pkg/kv/kvserver/replica.go#L439-L440.
Aside: Today the conf reader holds span configs of the entire key space, but that's not a necessity. If we want to push on a much higher number WRT range count and want to reduce the O(# of ranges) data structures we have in memory, we'll want the conf reader to only hold in memory data for ranges it cares about (i.e. ones held on the local store). Do with that as you will.
Aside: I didn't quite follow why we needed CheckRangeAction
. We say it's "similar to PlanOneChange, but range-scoped rather than replica-scoped" but PlanOneChange is quite intricate, and importantly, only does planning, which is what we effectively want here. Can we refactor things to reduce the amount of code duplication? If given just a RangeDescriptor, we can always look up the replica locally held. I'll defer to you and @kvoli since I'm just a passer-by here.
Aside: I see that in #91965 we're introducing a new use of NodeLivenessStatus
. We shouldn't, it's bad (ambiguous semantics in combining a node's cluster membership with its process state) and something we've moved away from everywhere but the allocator. It'd be unfortunate to regress here again. See this comment and the linked issue:
cockroach/pkg/kv/kvserver/liveness/livenesspb/liveness.proto
Lines 107 to 120 in d17382d
// NodeLivenessStatus describes the status of a node from the perspective of the | |
// liveness system. See comment on LivenessStatus() for a description of the | |
// states. | |
// | |
// TODO(irfansharif): We should reconsider usage of NodeLivenessStatus. | |
// It's unclear if the enum is well considered. It enumerates across two | |
// distinct set of things: the "membership" status (live/active, | |
// decommissioning, decommissioned), and the node "process" status (live, | |
// unavailable, available). It's possible for two of these "states" to be true, | |
// simultaneously (consider a decommissioned, dead node). It makes for confusing | |
// semantics, and the code attempting to disambiguate across these states | |
// (kvserver.LivenessStatus() for e.g.) seem wholly arbitrary. | |
// | |
// See #50707 for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/store.go
line 3687 at r3 (raw file):
Previously, irfansharif (irfan sharif) wrote…
I've not kept up with how this API is going to get used. Are you invoking it on on stores that actually hold replicas of the descriptor in question? If so, and you just want the associated span config, it already hangs off the
Replica
type here, accessible using r.SpanConfig(): https://github.com/cockroachdb/cockroach/blob/1d4f79b52b56e68969cc5b310658d3f16905d7cf/pkg/kv/kvserver/replica.go#L439-L440.Aside: Today the conf reader holds span configs of the entire key space, but that's not a necessity. If we want to push on a much higher number WRT range count and want to reduce the O(# of ranges) data structures we have in memory, we'll want the conf reader to only hold in memory data for ranges it cares about (i.e. ones held on the local store). Do with that as you will.
Aside: I didn't quite follow why we needed
CheckRangeAction
. We say it's "similar to PlanOneChange, but range-scoped rather than replica-scoped" but PlanOneChange is quite intricate, and importantly, only does planning, which is what we effectively want here. Can we refactor things to reduce the amount of code duplication? If given just a RangeDescriptor, we can always look up the replica locally held. I'll defer to you and @kvoli since I'm just a passer-by here.Aside: I see that in #91965 we're introducing a new use of
NodeLivenessStatus
. We shouldn't, it's bad (ambiguous semantics in combining a node's cluster membership with its process state) and something we've moved away from everywhere but the allocator. It'd be unfortunate to regress here again. See this comment and the linked issue:.cockroach/pkg/kv/kvserver/liveness/livenesspb/liveness.proto
Lines 107 to 120 in d17382d
// NodeLivenessStatus describes the status of a node from the perspective of the // liveness system. See comment on LivenessStatus() for a description of the // states. // // TODO(irfansharif): We should reconsider usage of NodeLivenessStatus. // It's unclear if the enum is well considered. It enumerates across two // distinct set of things: the "membership" status (live/active, // decommissioning, decommissioned), and the node "process" status (live, // unavailable, available). It's possible for two of these "states" to be true, // simultaneously (consider a decommissioned, dead node). It makes for confusing // semantics, and the code attempting to disambiguate across these states // (kvserver.LivenessStatus() for e.g.) seem wholly arbitrary. // // See #50707 for more details.
@irfansharif The intent is for the store to NOT need to hold a replica in order to be able to process it. This allows us to evaluate the readiness to decommission without cluster-wide RPCs, so we can simply iterate through range descriptors in meta1/2 and see if each one that has a replica on the decommissioning node(s) can be moved.
This is why the distinct code path is needed. Given this, it also means that the conf reader can read span configs over the entire key space as well - I'm not sure how to deal with potential future changes to this interface, though perhaps some method of looking up a span config (regardless of if a store has a replica to which it applies) will be necessary?
I'm not sure this is actually introducing a new usage of NodeLivenessStatus
so much as it's allowing for an override for the purposes of the allocator. This is literally the method the allocator calls to evaluate the status for its purposes, so we have simply added a way to evaluate possible states (rather than simply the current state).
This change adds methods to be to evaluate allocator actions and targets utilizing a passed-in `StorePool` object, allowing for the allocator to consider potential scenarios rather than those simply based on the current liveness. Depends on cockroachdb#91461, cockroachdb#91965. Part of cockroachdb#91570. Release note: None
Adds support for using the allocator to compute the action, if any, needed to "repair" the range and, if the action requires a replica addition (including replacements), the target of that addition. This can be checked by using the new `CheckRangeAction(..)` function as part of a store's replicate queue, and can use the default store pool or an override store pool, so that potential states can be evaluated prior to transition to those states. As such, this feature adds support for allocator action and target validation prior to decommission, in order to support decommission pre-checks. While this is similar to the replicate queue's `PlanOneChange(..)`, this new check supports evaluation based on a range descriptor, rather than an individual replica. Depends on cockroachdb#91941 Part of cockroachdb#91570. Release note: None
This change exposes support via a store for checking the allocator action and upreplication target (if applicable) for any range descriptor. The range does not need to have a replica on the given store, nor is it required to evaluate given the current state of the cluster (i.e. the store's configured `StorePool`), as a node liveness override can be provided in order to evaluate possible future states. Depends on cockroachdb#92176. Part of cockroachdb#91570. Release note: None
27b9f38
to
91701d0
Compare
This adds support for the evaluation of the decommission readiness of a node (or set of nodes), by simulating their liveness to have the DECOMMISSIONING status and utilizing the allocator to ensure that we are able to perform any actions needed to repair the range. This supports a "strict" mode, in which case we expect all ranges to only need replacement or removal due to the decommissioning status, or a more permissive "non-strict" mode, which allows for other actions needed, as long as they do not encounter errors in finding a suitable allocation target. The non-strict mode allows us to permit situations where a range may have more than one action needed to repair it, such as a range that needs to reach its replication factor before the decommissioning replica can be replaced, or a range that needs to finalize an atomic replication change. Depends on cockroachdb#92367. Part of cockroachdb#91571 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/store.go
line 3687 at r3 (raw file):
Check out what we populate in here:
cockroach/pkg/kv/kvserver/allocation_op.go
Lines 68 to 73 in 7d1793b
// AllocationChangeReplicasOp represents an operation to execute a change | |
// replicas txn. | |
type AllocationChangeReplicasOp struct { | |
usage allocator.RangeUsageInfo | |
lhStore roachpb.StoreID | |
chgs roachpb.ReplicationChanges |
It contains the target replica when we're {adding,removing} a {voter,non-voter}, something that CheckRangeAction
seems to also now do but through a different code path. But also, as your comment outlines, not always -- not when the local store is not a leaseholder since the replicate queue only works on leaseholder replicas. I see in #93758 we're not using the target replica at all, so do we need it? I assume we did in order to bubble up errors when attempting to allocator.Allocate{Voter,NonVoter}WithStorePool. But is that something we're surfacing when evaluating this RPC for replicas on non-leaseholder stores? What's our plan for that? Not returning this unused target would make the code clearer.
Stepping back: What's the reasoning behind avoiding cluster-wide RPCs? Some envelope math suggests decommissioning happens infrequently, probably <1000 times/year for a cluster that can handle 100000x+ that in statements/year: ((1000*100000)/365)/24/60 = 190 statements/sec. So if it's for performance reasons, I suspect we can get away with cluster-wide RPCs given this infrequency. Is there another reason? For cluster-wide RPCs, we only need ~1 round of them, right? Each node can inform the caller what it thinks of its leaseholder replicas that happen to have a replica on the potentially decommissioning node. The caller is free to aggregate across the responses. If replica/lease membership has changed during this fanout, we have a smaller set of ranges to re-evaluate on the leaseholder nodes. We can reduce the fanout by only targeting nodes that hold the same ranges as the decommissioning node.
though perhaps some method of looking up a span config (regardless of if a store has a replica to which it applies) will be necessary?
This will need an RPC BTW.
Replaced by #94024. |
This change exposes support via a store for checking the allocator
action and upreplication target (if applicable) for any range descriptor.
The range does not need to have a replica on the given store, nor is it
required to evaluate given the current state of the cluster (i.e. the
store's configured
StorePool
), as a node liveness override can beprovided in order to evaluate possible future states.
Depends on #92176.
Part of #91570.
Release note: None