storage controller: tenant creation sometimes disagrees with background optimization #8969

jcsp · 2024-09-09T09:59:01Z

Tenant creation uses schedule_shard, which chooses affinity scores based on the absolute number of shards that a tenant has on a node, treating attached and secondary locations the same.

However, the optimize_all logic prefers to move attached locations away from nodes that have lots of existing attachments for the same tenant.

This can lead to a situation where we create a tenant, and then immediately start migrating one of its shards, because several attached locations for the same tenant were already on that node.

We can fix this by modifying schedule_shard to know whether it is scheduling an attached or secondary location, and if scheduling an attached location then include ScheduleContext::attached_nodes in the affinity score calculation.

This will probably get solved implicitly when implementing AZ-aware scheduling, as that will also need to distinguish attached vs. secondary locations.

The text was updated successfully, but these errors were encountered:

## Problem Scheduling on tenant creation uses different heuristics compared to the scheduling done during background optimizations. This results in scenarios where shards are created and then immediately migrated by the optimizer. ## Summary of changes 1. Make scheduler aware of the type of the shard it is scheduling (attached vs secondary). We wish to have different heuristics. 2. For attached shards, include the attached shard count from the context in the node score calculation. This brings initial shard scheduling in line with what the optimization passes do. 3. Add a test for (2). This looks like a bigger change than required, but the refactoring serves as the basis for az-aware shard scheduling where we also need to make the distinction between attached and secondary shards. Closes #8969

jcsp · 2024-10-03T10:46:23Z

I think our fix isn't quite complete, I'm analyzing a log where this happened again...

Closes: #8969

jcsp · 2024-10-04T10:03:21Z

There's a unit test that reproduces the unstable case in #9275

## Problem We want to do a more robust job of scheduling tenants into their home AZ: #8264. Closes: #8969 ## Summary of changes ### Scope This PR combines prioritizing AZ with a larger rework of how we do optimisation. The rationale is that just bumping AZ in the order of Score attributes is a very tiny change: the interesting part is lining up all the optimisation logic to respect this properly, which means rewriting it to use the same scores as the scheduler, rather than the fragile hand-crafted logic that we had before. Separating these changes out is possible, but would involve doing two rounds of test updates instead of one. ### Scheduling optimisation `TenantShard`'s `optimize_attachment` and `optimize_secondary` methods now both use the scheduler to pick a new "favourite" location. Then there is some refined logic for whether + how to migrate to it: - To decide if a new location is sufficiently "better", we generate scores using some projected ScheduleContexts that exclude the shard under consideration, so that we avoid migrating from a node with AffinityScore(2) to a node with AffinityScore(1), only to migrate back later. - Score types get a `for_optimization` method so that when we compare scores, we will only do an optimisation if the scores differ by their highest-ranking attributes, not just because one pageserver is lower in utilization. Eventually we _will_ want a mode that does this, but doing it here would make scheduling logic unstable and harder to test, and to do this correctly one needs to know the size of the tenant that one is migrating. - When we find a new attached location that we would like to move to, we will create a new secondary location there, even if we already had one on some other node. This handles the case where we have a home AZ A, and want to migrate the attachment between pageservers in that AZ while retaining a secondary location in some other AZ as well. - A unit test is added for #8969, which is implicitly fixed by reworking optimisation to use the same scheduling scores as scheduling.

jcsp added a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller labels Sep 9, 2024

jcsp assigned VladLazar Sep 9, 2024

VladLazar mentioned this issue Sep 20, 2024

storcon: improve initial shard scheduling #9081

Merged

5 tasks

VladLazar closed this as completed in #9081 Sep 24, 2024

jcsp reopened this Oct 3, 2024

jcsp assigned jcsp and unassigned VladLazar Oct 3, 2024

jcsp added a commit that referenced this issue Oct 4, 2024

storage controller: fix a disagreement between schedule + optimize

b66bf89

Closes: #8969

jcsp mentioned this issue Oct 4, 2024

storage controller: make scheduling more stable wrt optimization #9275

Closed

5 tasks

jcsp added a commit that referenced this issue Oct 10, 2024

work around #8969

491c9ee

This was referenced Nov 27, 2024

Epic: storage controller: AZ-aware scheduling #8264

Closed

storcon: rework scheduler optimisation, prioritize AZ #9916

Merged

jcsp closed this as completed in #9916 Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage controller: tenant creation sometimes disagrees with background optimization #8969

storage controller: tenant creation sometimes disagrees with background optimization #8969

jcsp commented Sep 9, 2024

jcsp commented Oct 3, 2024

jcsp commented Oct 4, 2024

storage controller: tenant creation sometimes disagrees with background optimization #8969

storage controller: tenant creation sometimes disagrees with background optimization #8969

Comments

jcsp commented Sep 9, 2024

jcsp commented Oct 3, 2024

jcsp commented Oct 4, 2024