Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: tenant creation sometimes disagrees with background optimization #8969

Closed
Tracked by #8264
jcsp opened this issue Sep 9, 2024 · 2 comments · Fixed by #9081 or #9916
Closed
Tracked by #8264

storage controller: tenant creation sometimes disagrees with background optimization #8969

jcsp opened this issue Sep 9, 2024 · 2 comments · Fixed by #9081 or #9916
Assignees
Labels
a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller

Comments

@jcsp
Copy link
Contributor

jcsp commented Sep 9, 2024

Tenant creation uses schedule_shard, which chooses affinity scores based on the absolute number of shards that a tenant has on a node, treating attached and secondary locations the same.

However, the optimize_all logic prefers to move attached locations away from nodes that have lots of existing attachments for the same tenant.

This can lead to a situation where we create a tenant, and then immediately start migrating one of its shards, because several attached locations for the same tenant were already on that node.

We can fix this by modifying schedule_shard to know whether it is scheduling an attached or secondary location, and if scheduling an attached location then include ScheduleContext::attached_nodes in the affinity score calculation.

This will probably get solved implicitly when implementing AZ-aware scheduling, as that will also need to distinguish attached vs. secondary locations.

@jcsp jcsp added a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller labels Sep 9, 2024
VladLazar added a commit that referenced this issue Sep 24, 2024
## Problem

Scheduling on tenant creation uses different heuristics compared to the
scheduling done during
background optimizations. This results in scenarios where shards are
created and then immediately
migrated by the optimizer. 

## Summary of changes

1. Make scheduler aware of the type of the shard it is scheduling
(attached vs secondary).
We wish to have different heuristics.
2. For attached shards, include the attached shard count from the
context in the node score
calculation. This brings initial shard scheduling in line with what the
optimization passes do.
3. Add a test for (2).

This looks like a bigger change than required, but the refactoring
serves as the basis for az-aware
shard scheduling where we also need to make the distinction between
attached and secondary shards.

Closes #8969
@jcsp
Copy link
Contributor Author

jcsp commented Oct 3, 2024

I think our fix isn't quite complete, I'm analyzing a log where this happened again...

@jcsp
Copy link
Contributor Author

jcsp commented Oct 4, 2024

There's a unit test that reproduces the unstable case in #9275

jcsp added a commit that referenced this issue Oct 10, 2024
github-merge-queue bot pushed a commit that referenced this issue Jan 13, 2025
## Problem

We want to do a more robust job of scheduling tenants into their home
AZ: #8264.

Closes:  #8969

## Summary of changes

### Scope

This PR combines prioritizing AZ with a larger rework of how we do
optimisation. The rationale is that just bumping AZ in the order of
Score attributes is a very tiny change: the interesting part is lining
up all the optimisation logic to respect this properly, which means
rewriting it to use the same scores as the scheduler, rather than the
fragile hand-crafted logic that we had before. Separating these changes
out is possible, but would involve doing two rounds of test updates
instead of one.

### Scheduling optimisation

`TenantShard`'s `optimize_attachment` and `optimize_secondary` methods
now both use the scheduler to pick a new "favourite" location. Then
there is some refined logic for whether + how to migrate to it:
- To decide if a new location is sufficiently "better", we generate
scores using some projected ScheduleContexts that exclude the shard
under consideration, so that we avoid migrating from a node with
AffinityScore(2) to a node with AffinityScore(1), only to migrate back
later.
- Score types get a `for_optimization` method so that when we compare
scores, we will only do an optimisation if the scores differ by their
highest-ranking attributes, not just because one pageserver is lower in
utilization. Eventually we _will_ want a mode that does this, but doing
it here would make scheduling logic unstable and harder to test, and to
do this correctly one needs to know the size of the tenant that one is
migrating.
- When we find a new attached location that we would like to move to, we
will create a new secondary location there, even if we already had one
on some other node. This handles the case where we have a home AZ A, and
want to migrate the attachment between pageservers in that AZ while
retaining a secondary location in some other AZ as well.
- A unit test is added for
#8969, which is implicitly
fixed by reworking optimisation to use the same scheduling scores as
scheduling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller
Projects
None yet
2 participants