Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage controller: make chaos less disruptive to AZ locality #10438

Merged
merged 1 commit into from
Jan 20, 2025

Conversation

jcsp
Copy link
Collaborator

@jcsp jcsp commented Jan 17, 2025

Problem

Since #9916 , the chaos code is actively fighting the optimizer: tenants tend to be attached in their preferred AZ, so most chaos migrations were moving them to a non-preferred AZ.

Summary of changes

  • When picking migrations, prefer to migrate things toward their preferred AZ when possible. Then pick shards to move the other way when necessary.

The resulting behavior should be an alternating "back and forth" where the chaos code migrates thiings away from home, and then migrates them back on the next iteration.

The side effect will be that the chaos code actively helps to push things into their home AZ. That's not contrary to its purpose though: we mainly just want it to continuously migrate things to exercise migration+notification code.

@jcsp jcsp requested a review from problame January 17, 2025 15:15
@jcsp
Copy link
Collaborator Author

jcsp commented Jan 17, 2025

@problame this aims to avoid future conflicts between wanting to exercise migrations and wanting to avoid scrambling AZ locality

@jcsp jcsp marked this pull request as ready for review January 17, 2025 15:24
@jcsp jcsp requested a review from a team as a code owner January 17, 2025 15:24
@jcsp jcsp added a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller labels Jan 17, 2025
Copy link

7348 tests run: 6971 passed, 0 failed, 377 skipped (full report)


Flaky tests (6)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 33.6% (8434 of 25075 functions)
  • lines: 49.1% (70548 of 143684 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
acbfb8c at 2025-01-17T16:12:18.433Z :recycle:

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better than what we have now.
Some nits wrt determinism and randomness.

@jcsp
Copy link
Collaborator Author

jcsp commented Jan 20, 2025

I'm going to merge this to let the staging environment stabilize its scheduling, and do a small followup to make it a hashset etc.

@jcsp jcsp added this pull request to the merge queue Jan 20, 2025
Merged via the queue into main with commit 7d761a9 Jan 20, 2025
93 checks passed
@jcsp jcsp deleted the jcsp/storage-less-chaotic-chaos branch January 20, 2025 09:48
github-merge-queue bot pushed a commit that referenced this pull request Jan 30, 2025
## Problem

In #10438 I had got the
function for picking tenants backwards, and it was preferring to move
things _away_ from their preferred AZ.

## Summary of changes

- Fix condition in `is_attached_outside_preferred_az`
github-merge-queue bot pushed a commit that referenced this pull request Jan 30, 2025
## Problem

In #10438 it was pointed out
that it would be good to avoid picking tenants in ID order, and also to
avoid situations where we might double-select the same tenant.

There was an initial swing at this in
#10443, where Chi suggested a
simpler approach which is done in this PR

## Summary of changes

- Split total set of tenants into in and out of home AZ
- Consume out of home AZ first, and if necessary shuffle + consume from
out of home AZ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/controller Component: Storage Controller
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants