Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCS] Make Gcs-based actor scheduler's bookkeeping consistent #18546

Merged
merged 15 commits into from
Sep 29, 2021

Conversation

Chong-Li
Copy link
Contributor

@Chong-Li Chong-Li commented Sep 13, 2021

Why are these changes needed?

In the current version of gcs-based actor scheduler, there is an inconsistency: the destroyed actor forgets
to return resources back to gcs_resource_manager. See #16580 (comment) for details @iycheng . This PR fixes this error.

Note, this is just a temporary workaround. When we take the multi-threaded Java workers into account (in the next PR), the whole GcsActorWorkerAssignment handling will be changed.

Related issue number

#16580

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Chong-Li Chong-Li changed the title [GCS] Make Gcs-based scheduler's bookkeeping consistent [GCS] Make Gcs-based actor scheduler's bookkeeping consistent Sep 13, 2021
@fishbone fishbone self-assigned this Sep 14, 2021
@fishbone
Copy link
Contributor

@Chong-Li could you add one unit test to make sure lease - release is working correctly?

@fishbone fishbone added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 14, 2021
@fishbone
Copy link
Contributor

Btw, if we don't have resources temporarily, will it be rescheduled later when there is enough resources?

@Chong-Li
Copy link
Contributor Author

Btw, if we don't have resources temporarily, will it be rescheduled later when there is enough resources?

For the current version, SchedulePendingActors() only reacts to NodeAdd or ReportResourceUsage. I was planning to
make it more complete in the next PR with job distribution bookkeeping enabled. But to make the current version self-contained, I just (temporarily) updated the release_resources lambda function. It may be not efficient because of over reaction.

@rkooo567 rkooo567 self-assigned this Sep 24, 2021
@fishbone
Copy link
Contributor

fishbone commented Sep 29, 2021

LG! Thanks for the fixing! I'm OK with this one. But before merging, could you rebase to trunk? Recently we got some implicit merge conflict which break the trunk and it just got fixed.

@fishbone fishbone removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 29, 2021
@rkooo567
Copy link
Contributor

tests:test_client_reconnect seems to be flaky in the master

@rkooo567 rkooo567 merged commit 42744f2 into ray-project:master Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants