-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage controller: tenant creation sometimes disagrees with background optimization #8969
Closed
Tracked by
#8264
Labels
Comments
5 tasks
VladLazar
added a commit
that referenced
this issue
Sep 24, 2024
## Problem Scheduling on tenant creation uses different heuristics compared to the scheduling done during background optimizations. This results in scenarios where shards are created and then immediately migrated by the optimizer. ## Summary of changes 1. Make scheduler aware of the type of the shard it is scheduling (attached vs secondary). We wish to have different heuristics. 2. For attached shards, include the attached shard count from the context in the node score calculation. This brings initial shard scheduling in line with what the optimization passes do. 3. Add a test for (2). This looks like a bigger change than required, but the refactoring serves as the basis for az-aware shard scheduling where we also need to make the distinction between attached and secondary shards. Closes #8969
I think our fix isn't quite complete, I'm analyzing a log where this happened again... |
There's a unit test that reproduces the unstable case in #9275 |
This was referenced Nov 27, 2024
github-merge-queue bot
pushed a commit
that referenced
this issue
Jan 13, 2025
## Problem We want to do a more robust job of scheduling tenants into their home AZ: #8264. Closes: #8969 ## Summary of changes ### Scope This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one. ### Scheduling optimisation `TenantShard`'s `optimize_attachment` and `optimize_secondary` methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it: - To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later. - Score types get a `for_optimization` method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we _will_ want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating. - When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well. - A unit test is added for #8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Tenant creation uses
schedule_shard
, which chooses affinity scores based on the absolute number of shards that a tenant has on a node, treating attached and secondary locations the same.However, the optimize_all logic prefers to move attached locations away from nodes that have lots of existing attachments for the same tenant.
This can lead to a situation where we create a tenant, and then immediately start migrating one of its shards, because several attached locations for the same tenant were already on that node.
We can fix this by modifying schedule_shard to know whether it is scheduling an attached or secondary location, and if scheduling an attached location then include ScheduleContext::attached_nodes in the affinity score calculation.
This will probably get solved implicitly when implementing AZ-aware scheduling, as that will also need to distinguish attached vs. secondary locations.
The text was updated successfully, but these errors were encountered: