ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81177

jaylim-crl · 2022-05-10T19:24:48Z

ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events

This commit invokes the rebalancing logic during RUNNING pod events as part of
the pod watcher. Since the rebalancing logic depends on the tenant directory,
the pod watcher will now only emit events once the directory has been updated.
This is done for better responsiveness, i.e. the moment a new SQL pod gets
added, we would like to rebalance all connections to the tenant.

Note that the Watch endpoint on the tenant directory server currently emits
events in multiple cases: changes to load, and changes to pod (added/modified/
deleted). The plan is to update the tenant directory server to only emit events
for pod updates. The next commit will rate limit the number of times the
rebalancing logic for a given tenant can be called.

At the same time, we introduce a new test static directory server which does
not automatically spin up tenants for us (i.e. SQL pods for tenants can now
be managed manually, giving more control to tests).

ccl/sqlproxyccl: rate limit the number of rebalances per tenant

This commit rate limits the number of rebalances per tenant to once every
15 seconds (i.e. 1/2 of the rebalance loop interval). The main purpose of
this is to prevent a burst of pod events for the same tenant causing multiple
rebalances, which may move a lot of connections around.

Release note: None

cockroach-teamcity · 2022-05-10T19:24:57Z

This change is

jeffswenson

LGTM

pkg/ccl/sqlproxyccl/balancer/balancer.go

jeffswenson · 2022-05-24T14:29:30Z

pkg/ccl/sqlproxyccl/balancer/balancer.go

+
+	// rebalanceDelay is the minimum amount of time that must elapse between
+	// attempts to rebalance a given tenant. Defaults to defaultRebalanceDelay.
+	rebalanceDelay time.Duration


Thought: we could probably get rid of rebalanceDelay if we improve the behavior of rebalanceRate. Currently if we schedule two reconciles back to back, it will schedule connections*rebalanceRate transfers. If we limited the number of in progress transfers to rebalanceRate * connections, we could remove rebalanceDelay.

Alternatively we could change the definition to something like:
We can rebalance connections * rebalanceRate of a tenant's connections every second. If rebalances would exceed that rate, they are delayed until the next poll for idle connections.

This commit invokes the rebalancing logic during RUNNING pod events as part of the pod watcher. Since the rebalancing logic depends on the tenant directory, the pod watcher will now only emit events once the directory has been updated. This is done for better responsiveness, i.e. the moment a new SQL pod gets added, we would like to rebalance all connections to the tenant. Note that the Watch endpoint on the tenant directory server currently emits events in multiple cases: changes to load, and changes to pod (added/modified/ deleted). The plan is to update the tenant directory server to only emit events for pod updates. The next commit will rate limit the number of times the rebalancing logic for a given tenant can be called. At the same time, we introduce a new test static directory server which does not automatically spin up tenants for us (i.e. SQL pods for tenants can now be managed manually, giving more control to tests). Release note: None

This commit rate limits the number of rebalances per tenant to once every 15 seconds (i.e. 1/2 of the rebalance loop interval). The main purpose of this is to prevent a burst of pod events for the same tenant causing multiple rebalances, which may move a lot of connections around. Release note: None

jaylim-crl

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)

pkg/ccl/sqlproxyccl/balancer/balancer.go line 176 at r2 (raw file):
I left this as-is. Will discuss, and make another PR if necessary.

If we limited the number of in progress transfers to rebalanceRate * connections, we could remove rebalanceDelay.

I think the tricky part to this idea is that there could be a constant stream of connections that need to be rebalanced (i.e. triggered by the pod watcher). Without the delay, we'll be constantly rebalancing those connections, and capping the defined rate of rebalanceRate * connections all the time. The goal of rebalanceDelay is to rate limit the pod watcher events. For example, if the operator adds 7 SQL pods, we'll emit 7 RUNNING events, each calling RebalanceTenant. RebalanceTenant does not work well if called concurrently, or within a short period of time (when all the previously queued transfers have not finished yet). Open to other suggestions if you have any.

pkg/ccl/sqlproxyccl/balancer/balancer.go line 208 at r2 (raw file):

Previously, JeffSwenson (Jeff Swenson) wrote…

nit: Instead of checking the options for default values, assign them in the struct and allow the options to override them. That makes for cleaner code and a more intuitive interface.

E.g. setting the default in the struct allows RebalanceDelay to accept 0 as the value that disables the delay feature instead of requiring a sentinel -1.
func NewBalancer(...) {
    options := &balancerOptions {
        macConcurrentRebalances: defaultMaxConcurrentRebalances,
        timeSource: &timeutil.DefaultTimeSource{},
        rebalanceRate: defaultRebalanceRate
    } 
	for _, opt := range opts {
		opt(options)
	}
    // No if conditions necessary
}
I would also fold the default values into the struct initialization instead of defining constants.

Done. I left the default values as constants as I find it easier to look for them (since they are grouped at the top of the file).

jaylim-crl · 2022-05-24T20:46:48Z

TFTR! Happy to address any follow ups in another PR, if there are any.

bors r=JeffSwenson

craig · 2022-05-24T22:08:09Z

Build succeeded:

GitHub CI (Cockroach)

jaylim-crl mentioned this pull request May 16, 2022

release-22.1: ccl/sqlproxyccl: remove the idle monitor component #81305

Merged

jaylim-crl force-pushed the jay/220510-rebalance-single-tenant branch 9 times, most recently from d4b51c2 to 1b008f3 Compare May 23, 2022 14:00

jaylim-crl marked this pull request as ready for review May 23, 2022 16:53

jaylim-crl requested review from a team as code owners May 23, 2022 16:53

jaylim-crl requested review from jeffswenson and andy-kimball and removed request for a team May 23, 2022 16:54

jaylim-crl mentioned this pull request May 24, 2022

ccl/sqlproxyccl: add --disable-connection-rebalancing flag to "mt start-proxy" #81712

Merged

jeffswenson approved these changes May 24, 2022

View reviewed changes

jaylim-crl added 2 commits May 24, 2022 15:00

jaylim-crl force-pushed the jay/220510-rebalance-single-tenant branch from 1b008f3 to 62021aa Compare May 24, 2022 19:10

jaylim-crl commented May 24, 2022

View reviewed changes

jaylim-crl added the backport-22.1.x label May 24, 2022

craig bot merged commit e2c163b into cockroachdb:master May 24, 2022

blathers-crl bot mentioned this pull request May 24, 2022

release-22.1: ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81790

Merged

jaylim-crl deleted the jay/220510-rebalance-single-tenant branch May 24, 2022 22:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81177

ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81177

jaylim-crl commented May 10, 2022 •

edited

Loading

cockroach-teamcity commented May 10, 2022

jeffswenson left a comment

jeffswenson May 24, 2022

jaylim-crl left a comment

jaylim-crl commented May 24, 2022

craig bot commented May 24, 2022

ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81177

ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81177

Conversation

jaylim-crl commented May 10, 2022 • edited Loading

ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events

ccl/sqlproxyccl: rate limit the number of rebalances per tenant

cockroach-teamcity commented May 10, 2022

jeffswenson left a comment

Choose a reason for hiding this comment

jeffswenson May 24, 2022

Choose a reason for hiding this comment

jaylim-crl left a comment

Choose a reason for hiding this comment

jaylim-crl commented May 24, 2022

craig bot commented May 24, 2022

jaylim-crl commented May 10, 2022 •

edited

Loading