-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocator2: preliminary interface definition #104334
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
09cfdbe
to
1454a39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 33 at r1 (raw file):
// AddNodeID informs the allocator of a new node. AddNodeID(nodeID roachpb.NodeID) error
Not an important point -
Would AddStore
fail if the store descriptor's node ID wasn't previously added with AddNodeID(..)
?
Likewise, would ChangeStore
fail if the store wasn't previously added via AddStore(..)
?
I'm wondering what does the separation give us, over say SetStore
(in place of AddStore
and AddNodeID
) which handles the node initialization as well.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 59 at r1 (raw file):
// and local synthesis: // // - storeLoadMsg.storeRanges is the list of ranges at the node, and is
The gossip info origin timestamp isn't readily available outside the internal gossip state. So we'd either need to add a timestamp field to the StoreDescriptor
(or start using a new) proto or update to expose the origin TS.
cockroach/pkg/gossip/gossip.proto
Lines 140 to 141 in bc46049
// Wall time of info when generated by originating node (Unix-nanos). | |
int64 orig_stamp = 2; |
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 71 at r1 (raw file):
// local node is the leaseholder. // // - storeLoadMsg.topKRanges could be locally synthesized by estimating the
+1 This seems doable. What we are missing is follower non-raft CPU. The other follower load dimensions can be estimated for writeBytes, size and raft CPU.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 92 at r1 (raw file):
AdjustPendingChangesDisposition(changes []pendingReplicaChange, success bool) error // ComputeChanges is called periodically and frequently, say every 10s.
Did you have any thoughts on the distributed allocators responsibility for rebalancing?
The options I've been thinking of are:
- Each allocator is responsible for keeping the store its on out of overload/above means.
- Each allocator is responsible for keeping every store in the cluster out of overload/above means.
Then the legal changes an allocator could issue:
A. Allocator can issue changes for all ranges where it has the range leaseholder for.
B. Allocator can issue changes for all ranges where it has a replica for.
C. Allocator can issue changes for all ranges.
The store rebalancer does 1A currently.
The replicate queue (inc rebalancing) does 2A currently.
In 1A, stores which have no leaseholders won't be effectively dealt (happens today).
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 112 at r1 (raw file):
should we use load?
I'm not sure if there are any callers which would benefit.
StoreRebalancer always provides the final LH.
MergeQueue doesn't provide the final LH, but probably doesn't matter since the RHS LH will go away soon.
I don't know about any other uses in the code.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 119 at r1 (raw file):
// // Note for reviewer: the intention here is to subsume some of the code in // replica_send.go, and this method will be called from
ack. Is this supposed to be replica_command.go
?
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 122 at r1 (raw file):
// Replica.relocateReplicas (ReplocateOne code is no longer needed). // // TODO(sumeer): is leaseholderStore needed. Presumably we are doing this at
relocateRanges
issues DB commands like AdminTransferLease
and AdminChangeReplicas
instead of local leaseholder variants like changeReplicasImpl
.
I think we do need to know the current leaseholder and we don't need to keep the leaseholder on the local store until the end.
pkg/kv/kvserver/allocator/allocator2/messages.go
line 89 at r1 (raw file):
// complete load even if the sender asked for a diff. // // TODO(sumeer): the diff story is not properly fleshed out for the local
ack - I've been struggling a bit with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Also added a summary of our synchronous discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 33 at r1 (raw file):
Previously, kvoli (Austen) wrote…
Not an important point -
Would
AddStore
fail if the store descriptor's node ID wasn't previously added withAddNodeID(..)
?Likewise, would
ChangeStore
fail if the store wasn't previously added viaAddStore(..)
?I'm wondering what does the separation give us, over say
SetStore
(in place ofAddStore
andAddNodeID
) which handles the node initialization as well.
Good point. I was mirroring what was initially done in clusterState
, and looking at it again, there isn't any data-structural benefit. I've cleaned this up.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 59 at r1 (raw file):
Previously, kvoli (Austen) wrote…
The gossip info origin timestamp isn't readily available outside the internal gossip state. So we'd either need to add a timestamp field to the
StoreDescriptor
(or start using a new) proto or update to expose the origin TS.cockroach/pkg/gossip/gossip.proto
Lines 140 to 141 in bc46049
// Wall time of info when generated by originating node (Unix-nanos). int64 orig_stamp = 2;
Since it's already there in the gossip proto, I think we should try to expose it. It's all our code after all.
(from sync discussion)
- Default gossip with up-to-date load information is every 10s.
- There is additional gossip when lease count and range count have changed by higher than some threshold. This has stale load information, and up to date range and leaseholder information, so this message is partially stale. It makes the problem described in kvserver: timestamp and invalidate storepool estimates #93532 worse. The main benefit of this additional gossip is that it prevents a thundering herd of lease transfers to a store. We tentatively decided that allocator2 will ignore such partially stale messages, and that we will handle the thundering herd problem of lease transfers to a store due to distributed allocators via a distributed throttling mechanism.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 71 at r1 (raw file):
What we are missing is follower non-raft CPU
Can we start with 0? I assume we don't have this in the current allocator either.
My understanding is that StoreDescriptor
is being gossiped (looking at StorePool.storeGossipUpdate
), and I don't see any per-range information in the contained StoreCapacity
. Is my understanding correct?
I updated the code comment here.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 92 at r1 (raw file):
Previously, kvoli (Austen) wrote…
Did you have any thoughts on the distributed allocators responsibility for rebalancing?
The options I've been thinking of are:
- Each allocator is responsible for keeping the store its on out of overload/above means.
- Each allocator is responsible for keeping every store in the cluster out of overload/above means.
Then the legal changes an allocator could issue:
A. Allocator can issue changes for all ranges where it has the range leaseholder for.
B. Allocator can issue changes for all ranges where it has a replica for.
C. Allocator can issue changes for all ranges.The store rebalancer does 1A currently.
The replicate queue (inc rebalancing) does 2A currently.In 1A, stores which have no leaseholders won't be effectively dealt (happens today).
(from sync discussion)
- We will do both 1A and 2A, and the replicateQueue will not do rebalancing when allocator2 is configured.
- Some more details about replicateQueue, related to rebalancing:
- Can take 10min or longer to make a pass through all ranges and call
shouldQueue
. When large number of ranges (common in large production clusters), this can be significantly longer than 10min. If no recovery action is needed, will fall into theAllocatorConsiderRebalance
path. In this case will do a dry-run and if it gets a target, it will queue. - Consequently,
replicateQueue
can lag significantly in rebalancing to relieve overload on other stores, and allocator2 should be able to react faster. - The distributed throttling scheme mentioned earlier for lease transfers will also be used here to reduce the probability of a thundering herd of transfers.
- Can take 10min or longer to make a pass through all ranges and call
Updated the code comment.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 112 at r1 (raw file):
Previously, kvoli (Austen) wrote…
should we use load?
I'm not sure if there are any callers which would benefit.
StoreRebalancer always provides the final LH.
MergeQueue doesn't provide the final LH, but probably doesn't matter since the RHS LH will go away soon.I don't know about any other uses in the code.
Updated to say that load is not used.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 119 at r1 (raw file):
Previously, kvoli (Austen) wrote…
ack. Is this supposed to be
replica_command.go
?
Ah yes. Fixed.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 122 at r1 (raw file):
relocateRanges
issues DB commands likeAdminTransferLease
andAdminChangeReplicas
instead of local leaseholder variants likechangeReplicasImpl
.
Does allocator2 need to be involved in AdminTransferLease
-- I am guessing not since there isn't any decision to be made here other than doing the thing?
Regarding AdminChangeReplicasRequest
, it seems to be called from 2 places, (a) internally, as the enactment steps for AdminRelocateRange
, and (b) externally by sql.relocateRange
. Does (b) need any interaction with allocator2 or has the caller specified the exact sequence of changes?
I think we do need to know the current leaseholder and we don't need to keep the leaseholder on the local store until the end.
I updated the comment to say that a store on this node does not need to be the leaseholder. And added a SpanConfig
parameter for the same reason.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 151 at r1 (raw file):
// TODO(sumeer): // - understand what is really happening as a consequence of // ScorerOptionsForScatter.
Let's discuss this in our next synchronous session.
pkg/kv/kvserver/allocator/allocator2/messages.go
line 89 at r1 (raw file):
Previously, kvoli (Austen) wrote…
ack - I've been struggling a bit with this.
Ack. I think it will be easy to make such improvements once we have something working end-to-end.
For some reason github is not picking up the latest commit in the branch, to show in this PR. I'll wait for some time to see if it fixes itself. |
135796e
to
b1b7ea2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 71 at r1 (raw file):
Can we start with 0?
Yup sgtm.
My understanding is that
StoreDescriptor
is being gossiped (looking atStorePool.storeGossipUpdate
), and I don't see any per-range information in the containedStoreCapacity
. Is my understanding correct?
Understanding is correct. No per-range information in StoreCapacity
The current system never knows about a range's non-leaseholder replica load.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 122 at r1 (raw file):
Does allocator2 need to be involved in
AdminTransferLease
-- I am guessing not since there isn't any decision to be made here other than doing the thing?
It doesn't need to be involved.
[..] (b) externally by
sql.relocateRange
. Does (b) need any interaction with allocator2 or has the caller specified the exact sequence of changes?
No interaction necessary with allocator2. changeReplicas
does have a dependency on the storepool (currently) for determining live and dead replicas used here.
The caller does not specify the exact sequence of changes though, it is enacted in a symmetric priority to the allocator action (replicate queue) priority.
cockroach/pkg/kv/kvserver/replica_command.go
Lines 1038 to 1043 in fdf04ff
// We choose to execute changes in the following order: | |
// 1. Promotions / demotions / swaps between voters and non-voters | |
// 2. Voter additions | |
// 3. Voter removals | |
// 4. Non-voter additions | |
// 5. Non-voter removals |
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 151 at r1 (raw file):
Previously, sumeerbhola wrote…
Let's discuss this in our next synchronous session.
Sounds good.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 43 at r2 (raw file):
// stores. // // TODO(sumeer): do we need a RemoveStore(store roachpb.StoreDescriptor)
I imagine the only time a store would be removed is when the membership record for the node its on transitions from decommissioning -> decommissioned
It isn't possible to decommission just a store afaik, only nodes.
Informs cockroachdb#103320 Epic: CRDB-25222 Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @kvoli)
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 71 at r1 (raw file):
Previously, kvoli (Austen) wrote…
Can we start with 0?
Yup sgtm.
My understanding is that
StoreDescriptor
is being gossiped (looking atStorePool.storeGossipUpdate
), and I don't see any per-range information in the containedStoreCapacity
. Is my understanding correct?Understanding is correct. No per-range information in StoreCapacity
The current system never knows about a range's non-leaseholder replica load.
Ack. Left the comment in place.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 122 at r1 (raw file):
Previously, kvoli (Austen) wrote…
Does allocator2 need to be involved in
AdminTransferLease
-- I am guessing not since there isn't any decision to be made here other than doing the thing?It doesn't need to be involved.
[..] (b) externally by
sql.relocateRange
. Does (b) need any interaction with allocator2 or has the caller specified the exact sequence of changes?No interaction necessary with allocator2.
changeReplicas
does have a dependency on the storepool (currently) for determining live and dead replicas used here.The caller does not specify the exact sequence of changes though, it is enacted in a symmetric priority to the allocator action (replicate queue) priority.
cockroach/pkg/kv/kvserver/replica_command.go
Lines 1038 to 1043 in fdf04ff
// We choose to execute changes in the following order: // 1. Promotions / demotions / swaps between voters and non-voters // 2. Voter additions // 3. Voter removals // 4. Non-voter additions // 5. Non-voter removals
Thanks for the pointers. IIUC, that issue is the main reason we still have some value-add from AdminRelocateRange
, and can't simply pass all that information to a single AdminChangeReplicasRequest
.
pkg/kv/kvserver/allocator/allocator2/allocator.go
line 43 at r2 (raw file):
Previously, kvoli (Austen) wrote…
I imagine the only time a store would be removed is when the membership record for the node its on transitions from decommissioning -> decommissioned
It isn't possible to decommission just a store afaik, only nodes.
Ack. Removed the todo.
bors r=kvoli |
Build failed (retrying...): |
Build succeeded: |
Informs #103320
Epic: CRDB-25222
Release note: None