allocator2: preliminary interface definition #104334

sumeerbhola · 2023-06-05T14:47:39Z

Release note: None

blathers-crl · 2023-06-05T14:47:42Z

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cockroach-teamcity · 2023-06-05T14:47:48Z

This change is

kvoli

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/kv/kvserver/allocator/allocator2/allocator.go line 33 at r1 (raw file):

	// AddNodeID informs the allocator of a new node.
	AddNodeID(nodeID roachpb.NodeID) error

Not an important point -

Would AddStore fail if the store descriptor's node ID wasn't previously added with AddNodeID(..)?

Likewise, would ChangeStore fail if the store wasn't previously added via AddStore(..)?

I'm wondering what does the separation give us, over say SetStore (in place of AddStore and AddNodeID) which handles the node initialization as well.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 59 at r1 (raw file):

	// and local synthesis:
	//
	// - storeLoadMsg.storeRanges is the list of ranges at the node, and is

The gossip info origin timestamp isn't readily available outside the internal gossip state. So we'd either need to add a timestamp field to the StoreDescriptor (or start using a new) proto or update to expose the origin TS.

cockroach/pkg/gossip/gossip.proto

Lines 140 to 141 in bc46049

    
           // Wall time of info when generated by originating node (Unix-nanos). 
        
           int64 orig_stamp = 2;

pkg/kv/kvserver/allocator/allocator2/allocator.go line 71 at r1 (raw file):

	//   local node is the leaseholder.
	//
	// - storeLoadMsg.topKRanges could be locally synthesized by estimating the

+1 This seems doable. What we are missing is follower non-raft CPU. The other follower load dimensions can be estimated for writeBytes, size and raft CPU.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 92 at r1 (raw file):

	AdjustPendingChangesDisposition(changes []pendingReplicaChange, success bool) error

	// ComputeChanges is called periodically and frequently, say every 10s.

Did you have any thoughts on the distributed allocators responsibility for rebalancing?

The options I've been thinking of are:

Each allocator is responsible for keeping the store its on out of overload/above means.
Each allocator is responsible for keeping every store in the cluster out of overload/above means.

Then the legal changes an allocator could issue:

A. Allocator can issue changes for all ranges where it has the range leaseholder for.
B. Allocator can issue changes for all ranges where it has a replica for.
C. Allocator can issue changes for all ranges.

The store rebalancer does 1A currently.
The replicate queue (inc rebalancing) does 2A currently.

In 1A, stores which have no leaseholders won't be effectively dealt (happens today).

pkg/kv/kvserver/allocator/allocator2/allocator.go line 112 at r1 (raw file):

should we use load?

I'm not sure if there are any callers which would benefit.

StoreRebalancer always provides the final LH.
MergeQueue doesn't provide the final LH, but probably doesn't matter since the RHS LH will go away soon.

I don't know about any other uses in the code.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 119 at r1 (raw file):

	//
	// Note for reviewer: the intention here is to subsume some of the code in
	// replica_send.go, and this method will be called from

ack. Is this supposed to be replica_command.go?

pkg/kv/kvserver/allocator/allocator2/allocator.go line 122 at r1 (raw file):

	// Replica.relocateReplicas (ReplocateOne code is no longer needed).
	//
	// TODO(sumeer): is leaseholderStore needed. Presumably we are doing this at

relocateRanges issues DB commands like AdminTransferLease and AdminChangeReplicas instead of local leaseholder variants like changeReplicasImpl.

I think we do need to know the current leaseholder and we don't need to keep the leaseholder on the local store until the end.

pkg/kv/kvserver/allocator/allocator2/messages.go line 89 at r1 (raw file):

	// complete load even if the sender asked for a diff.
	//
	// TODO(sumeer): the diff story is not properly fleshed out for the local

ack - I've been struggling a bit with this.

sumeerbhola

TFTR!

Also added a summary of our synchronous discussion.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli)

pkg/kv/kvserver/allocator/allocator2/allocator.go line 33 at r1 (raw file):

Previously, kvoli (Austen) wrote…

Not an important point -

Would AddStore fail if the store descriptor's node ID wasn't previously added with AddNodeID(..)?

Likewise, would ChangeStore fail if the store wasn't previously added via AddStore(..)?

I'm wondering what does the separation give us, over say SetStore (in place of AddStore and AddNodeID) which handles the node initialization as well.

Good point. I was mirroring what was initially done in clusterState, and looking at it again, there isn't any data-structural benefit. I've cleaned this up.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 59 at r1 (raw file):

Previously, kvoli (Austen) wrote…

The gossip info origin timestamp isn't readily available outside the internal gossip state. So we'd either need to add a timestamp field to the StoreDescriptor (or start using a new) proto or update to expose the origin TS.

cockroach/pkg/gossip/gossip.proto

Lines 140 to 141 in bc46049

// Wall time of info when generated by originating node (Unix-nanos).

int64 orig_stamp = 2;

Since it's already there in the gossip proto, I think we should try to expose it. It's all our code after all.

(from sync discussion)

Default gossip with up-to-date load information is every 10s.
There is additional gossip when lease count and range count have changed by higher than some threshold. This has stale load information, and up to date range and leaseholder information, so this message is partially stale. It makes the problem described in kvserver: timestamp and invalidate storepool estimates #93532 worse. The main benefit of this additional gossip is that it prevents a thundering herd of lease transfers to a store. We tentatively decided that allocator2 will ignore such partially stale messages, and that we will handle the thundering herd problem of lease transfers to a store due to distributed allocators via a distributed throttling mechanism.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 71 at r1 (raw file):

What we are missing is follower non-raft CPU

Can we start with 0? I assume we don't have this in the current allocator either.
My understanding is that StoreDescriptor is being gossiped (looking at StorePool.storeGossipUpdate), and I don't see any per-range information in the contained StoreCapacity. Is my understanding correct?

I updated the code comment here.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 92 at r1 (raw file):

Previously, kvoli (Austen) wrote…

Did you have any thoughts on the distributed allocators responsibility for rebalancing?

The options I've been thinking of are:

Each allocator is responsible for keeping the store its on out of overload/above means.

Each allocator is responsible for keeping every store in the cluster out of overload/above means.

Then the legal changes an allocator could issue:

A. Allocator can issue changes for all ranges where it has the range leaseholder for.
B. Allocator can issue changes for all ranges where it has a replica for.
C. Allocator can issue changes for all ranges.

The store rebalancer does 1A currently.
The replicate queue (inc rebalancing) does 2A currently.

In 1A, stores which have no leaseholders won't be effectively dealt (happens today).

(from sync discussion)

We will do both 1A and 2A, and the replicateQueue will not do rebalancing when allocator2 is configured.
Some more details about replicateQueue, related to rebalancing:
- Can take 10min or longer to make a pass through all ranges and call shouldQueue. When large number of ranges (common in large production clusters), this can be significantly longer than 10min. If no recovery action is needed, will fall into the AllocatorConsiderRebalance path. In this case will do a dry-run and if it gets a target, it will queue.
- Consequently, replicateQueue can lag significantly in rebalancing to relieve overload on other stores, and allocator2 should be able to react faster.
- The distributed throttling scheme mentioned earlier for lease transfers will also be used here to reduce the probability of a thundering herd of transfers.

Updated the code comment.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 112 at r1 (raw file):

Previously, kvoli (Austen) wrote…

should we use load?

I'm not sure if there are any callers which would benefit.

StoreRebalancer always provides the final LH.
MergeQueue doesn't provide the final LH, but probably doesn't matter since the RHS LH will go away soon.

I don't know about any other uses in the code.

Updated to say that load is not used.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 119 at r1 (raw file):

Previously, kvoli (Austen) wrote…

ack. Is this supposed to be replica_command.go?

Ah yes. Fixed.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 122 at r1 (raw file):

relocateRanges issues DB commands like AdminTransferLease and AdminChangeReplicas instead of local leaseholder variants like changeReplicasImpl.

Does allocator2 need to be involved in AdminTransferLease -- I am guessing not since there isn't any decision to be made here other than doing the thing?
Regarding AdminChangeReplicasRequest, it seems to be called from 2 places, (a) internally, as the enactment steps for AdminRelocateRange, and (b) externally by sql.relocateRange. Does (b) need any interaction with allocator2 or has the caller specified the exact sequence of changes?

I think we do need to know the current leaseholder and we don't need to keep the leaseholder on the local store until the end.

I updated the comment to say that a store on this node does not need to be the leaseholder. And added a SpanConfig parameter for the same reason.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 151 at r1 (raw file):

	// TODO(sumeer):
	// - understand what is really happening as a consequence of
	//   ScorerOptionsForScatter.

Let's discuss this in our next synchronous session.

pkg/kv/kvserver/allocator/allocator2/messages.go line 89 at r1 (raw file):

Previously, kvoli (Austen) wrote…

ack - I've been struggling a bit with this.

Ack. I think it will be easy to make such improvements once we have something working end-to-end.

sumeerbhola · 2023-06-07T17:09:56Z

For some reason github is not picking up the latest commit in the branch, to show in this PR. I'll wait for some time to see if it fixes itself.

kvoli

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

pkg/kv/kvserver/allocator/allocator2/allocator.go line 71 at r1 (raw file):

Can we start with 0?

Yup sgtm.

My understanding is that StoreDescriptor is being gossiped (looking at StorePool.storeGossipUpdate), and I don't see any per-range information in the contained StoreCapacity. Is my understanding correct?

Understanding is correct. No per-range information in StoreCapacity

The current system never knows about a range's non-leaseholder replica load.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 122 at r1 (raw file):

Does allocator2 need to be involved in AdminTransferLease -- I am guessing not since there isn't any decision to be made here other than doing the thing?

It doesn't need to be involved.

[..] (b) externally by sql.relocateRange. Does (b) need any interaction with allocator2 or has the caller specified the exact sequence of changes?

No interaction necessary with allocator2. changeReplicas does have a dependency on the storepool (currently) for determining live and dead replicas used here.

The caller does not specify the exact sequence of changes though, it is enacted in a symmetric priority to the allocator action (replicate queue) priority.

cockroach/pkg/kv/kvserver/replica_command.go

Lines 1038 to 1043 in fdf04ff

    
           // We choose to execute changes in the following order: 
        
           // 1. Promotions / demotions / swaps between voters and non-voters 
        
           // 2. Voter additions 
        
           // 3. Voter removals 
        
           // 4. Non-voter additions 
        
           // 5. Non-voter removals

pkg/kv/kvserver/allocator/allocator2/allocator.go line 151 at r1 (raw file):

Previously, sumeerbhola wrote…

Let's discuss this in our next synchronous session.

Sounds good.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 43 at r2 (raw file):

	// stores.
	//
	// TODO(sumeer): do we need a RemoveStore(store roachpb.StoreDescriptor)

I imagine the only time a store would be removed is when the membership record for the node its on transitions from decommissioning -> decommissioned

It isn't possible to decommission just a store afaik, only nodes.

Informs cockroachdb#103320 Epic: CRDB-25222 Release note: None

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @kvoli)

pkg/kv/kvserver/allocator/allocator2/allocator.go line 71 at r1 (raw file):

Previously, kvoli (Austen) wrote…

Can we start with 0?

Yup sgtm.

My understanding is that StoreDescriptor is being gossiped (looking at StorePool.storeGossipUpdate), and I don't see any per-range information in the contained StoreCapacity. Is my understanding correct?

Understanding is correct. No per-range information in StoreCapacity

The current system never knows about a range's non-leaseholder replica load.

Ack. Left the comment in place.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 122 at r1 (raw file):

Previously, kvoli (Austen) wrote…

Does allocator2 need to be involved in AdminTransferLease -- I am guessing not since there isn't any decision to be made here other than doing the thing?

It doesn't need to be involved.

[..] (b) externally by sql.relocateRange. Does (b) need any interaction with allocator2 or has the caller specified the exact sequence of changes?

No interaction necessary with allocator2. changeReplicas does have a dependency on the storepool (currently) for determining live and dead replicas used here.

The caller does not specify the exact sequence of changes though, it is enacted in a symmetric priority to the allocator action (replicate queue) priority.

cockroach/pkg/kv/kvserver/replica_command.go

Lines 1038 to 1043 in fdf04ff

// We choose to execute changes in the following order:

// 1. Promotions / demotions / swaps between voters and non-voters

// 2. Voter additions

// 3. Voter removals

// 4. Non-voter additions

// 5. Non-voter removals

Thanks for the pointers. IIUC, that issue is the main reason we still have some value-add from AdminRelocateRange, and can't simply pass all that information to a single AdminChangeReplicasRequest.

pkg/kv/kvserver/allocator/allocator2/allocator.go line 43 at r2 (raw file):

Previously, kvoli (Austen) wrote…

I imagine the only time a store would be removed is when the membership record for the node its on transitions from decommissioning -> decommissioned

It isn't possible to decommission just a store afaik, only nodes.

Ack. Removed the todo.

sumeerbhola · 2023-06-08T22:55:15Z

bors r=kvoli

craig · 2023-06-08T23:31:11Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-06-09T00:22:28Z

Build succeeded:

Bazel Essential CI (Cockroach)

sumeerbhola requested a review from kvoli June 5, 2023 14:47

sumeerbhola requested a review from a team as a code owner June 5, 2023 14:47

sumeerbhola force-pushed the a2_interface branch 3 times, most recently from 09cfdbe to 1454a39 Compare June 5, 2023 17:49

kvoli reviewed Jun 6, 2023

View reviewed changes

sumeerbhola commented Jun 7, 2023

View reviewed changes

sumeerbhola force-pushed the a2_interface branch 2 times, most recently from 135796e to b1b7ea2 Compare June 7, 2023 18:01

sumeerbhola mentioned this pull request Jun 7, 2023

allocator2: change rate limiter etc. #104559

Merged

kvoli approved these changes Jun 8, 2023

View reviewed changes

allocator2: preliminary interface definition

965149b

Informs cockroachdb#103320 Epic: CRDB-25222 Release note: None

sumeerbhola force-pushed the a2_interface branch from b1b7ea2 to 965149b Compare June 8, 2023 21:29

sumeerbhola commented Jun 8, 2023

View reviewed changes

craig bot merged commit 33525d5 into cockroachdb:master Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allocator2: preliminary interface definition #104334

allocator2: preliminary interface definition #104334

sumeerbhola commented Jun 5, 2023

blathers-crl bot commented Jun 5, 2023

cockroach-teamcity commented Jun 5, 2023

kvoli left a comment

sumeerbhola left a comment

sumeerbhola commented Jun 7, 2023

kvoli left a comment

sumeerbhola left a comment

sumeerbhola commented Jun 8, 2023

craig bot commented Jun 8, 2023

craig bot commented Jun 9, 2023

	// Wall time of info when generated by originating node (Unix-nanos).
	int64 orig_stamp = 2;

	// We choose to execute changes in the following order:
	// 1. Promotions / demotions / swaps between voters and non-voters
	// 2. Voter additions
	// 3. Voter removals
	// 4. Non-voter additions
	// 5. Non-voter removals

allocator2: preliminary interface definition #104334

allocator2: preliminary interface definition #104334

Conversation

sumeerbhola commented Jun 5, 2023

blathers-crl bot commented Jun 5, 2023

cockroach-teamcity commented Jun 5, 2023

kvoli left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Jun 7, 2023

kvoli left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

sumeerbhola commented Jun 8, 2023

craig bot commented Jun 8, 2023

craig bot commented Jun 9, 2023