-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81177
ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events #81177
Conversation
d4b51c2
to
1b008f3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
// rebalanceDelay is the minimum amount of time that must elapse between | ||
// attempts to rebalance a given tenant. Defaults to defaultRebalanceDelay. | ||
rebalanceDelay time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: we could probably get rid of rebalanceDelay if we improve the behavior of rebalanceRate. Currently if we schedule two reconciles back to back, it will schedule connections*rebalanceRate transfers. If we limited the number of in progress transfers to rebalanceRate * connections, we could remove rebalanceDelay.
Alternatively we could change the definition to something like:
We can rebalance connections * rebalanceRate of a tenant's connections every second. If rebalances would exceed that rate, they are delayed until the next poll for idle connections.
This commit invokes the rebalancing logic during RUNNING pod events as part of the pod watcher. Since the rebalancing logic depends on the tenant directory, the pod watcher will now only emit events once the directory has been updated. This is done for better responsiveness, i.e. the moment a new SQL pod gets added, we would like to rebalance all connections to the tenant. Note that the Watch endpoint on the tenant directory server currently emits events in multiple cases: changes to load, and changes to pod (added/modified/ deleted). The plan is to update the tenant directory server to only emit events for pod updates. The next commit will rate limit the number of times the rebalancing logic for a given tenant can be called. At the same time, we introduce a new test static directory server which does not automatically spin up tenants for us (i.e. SQL pods for tenants can now be managed manually, giving more control to tests). Release note: None
This commit rate limits the number of rebalances per tenant to once every 15 seconds (i.e. 1/2 of the rebalance loop interval). The main purpose of this is to prevent a burst of pod events for the same tenant causing multiple rebalances, which may move a lot of connections around. Release note: None
1b008f3
to
62021aa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/balancer/balancer.go
line 176 at r2 (raw file):
I left this as-is. Will discuss, and make another PR if necessary.
If we limited the number of in progress transfers to rebalanceRate * connections, we could remove rebalanceDelay.
I think the tricky part to this idea is that there could be a constant stream of connections that need to be rebalanced (i.e. triggered by the pod watcher). Without the delay, we'll be constantly rebalancing those connections, and capping the defined rate of rebalanceRate * connections all the time. The goal of rebalanceDelay is to rate limit the pod watcher events. For example, if the operator adds 7 SQL pods, we'll emit 7 RUNNING events, each calling RebalanceTenant. RebalanceTenant does not work well if called concurrently, or within a short period of time (when all the previously queued transfers have not finished yet). Open to other suggestions if you have any.
pkg/ccl/sqlproxyccl/balancer/balancer.go
line 208 at r2 (raw file):
Previously, JeffSwenson (Jeff Swenson) wrote…
nit: Instead of checking the options for default values, assign them in the struct and allow the options to override them. That makes for cleaner code and a more intuitive interface.
E.g. setting the default in the struct allows RebalanceDelay to accept 0 as the value that disables the delay feature instead of requiring a sentinel -1.
func NewBalancer(...) { options := &balancerOptions { macConcurrentRebalances: defaultMaxConcurrentRebalances, timeSource: &timeutil.DefaultTimeSource{}, rebalanceRate: defaultRebalanceRate } for _, opt := range opts { opt(options) } // No if conditions necessary }
I would also fold the default values into the struct initialization instead of defining constants.
Done. I left the default values as constants as I find it easier to look for them (since they are grouped at the top of the file).
TFTR! Happy to address any follow ups in another PR, if there are any. bors r=JeffSwenson |
Build succeeded: |
ccl/sqlproxyccl: invoke rebalancing logic during RUNNING pod events
This commit invokes the rebalancing logic during RUNNING pod events as part of
the pod watcher. Since the rebalancing logic depends on the tenant directory,
the pod watcher will now only emit events once the directory has been updated.
This is done for better responsiveness, i.e. the moment a new SQL pod gets
added, we would like to rebalance all connections to the tenant.
Note that the Watch endpoint on the tenant directory server currently emits
events in multiple cases: changes to load, and changes to pod (added/modified/
deleted). The plan is to update the tenant directory server to only emit events
for pod updates. The next commit will rate limit the number of times the
rebalancing logic for a given tenant can be called.
At the same time, we introduce a new test static directory server which does
not automatically spin up tenants for us (i.e. SQL pods for tenants can now
be managed manually, giving more control to tests).
ccl/sqlproxyccl: rate limit the number of rebalances per tenant
This commit rate limits the number of rebalances per tenant to once every
15 seconds (i.e. 1/2 of the rebalance loop interval). The main purpose of
this is to prevent a burst of pod events for the same tenant causing multiple
rebalances, which may move a lot of connections around.
Release note: None