Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Autoscaler][Placement Group] Skip placed bundle when requesting resource #48924

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

mimiliaogo
Copy link
Contributor

@mimiliaogo mimiliaogo commented Nov 25, 2024

Why are these changes needed?

Before the PR, when a node in a placement group (PG) goes down, the autoscaler attempts to reschedule the entire PG (all bundles). However, this will lead to overprovisioning. Details: #40212

This PR solved this by skipping already placed bundles (i.e., bundles with an associated node_id) when demanding resources in autoscaler.

Before: Every bundles get rescheduled

image

After: Only one node will be scaled up

Screenshot 2024-11-25 at 12 37 39 PM

Related issue number

Closes #40212

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@mimiliaogo mimiliaogo requested review from hongchaodeng and a team as code owners November 25, 2024 18:57
@jcotant1 jcotant1 added the core Issues that should be addressed in Ray Core label Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Issues that should be addressed in Ray Core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core][autoscaler][v1] Autoscaler overprovisions nodes when strict placement group is rescheduling
2 participants