Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[gcs] Fix actor killing hang due to race condition #17634

Merged
merged 4 commits into from
Aug 10, 2021

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Aug 6, 2021

Why are these changes needed?

Creating and killing an actor immediately will have a race condition.

  • Lease
  • Canceling
  • Lease reply
    It will leave the worker leak. This PR fixed it.

And there is another race condition:

  • Task creating
  • Canceling
  • Task creating reply received

This will leave the worker leaked too.

Related issue number

Closes #17369

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@fishbone fishbone added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Aug 8, 2021
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iycheng can you re-run the originally failing tests 3X times and lmk if that all passes?

<< " has been removed from creating map. Actor status "
<< actor->GetState();
auto actor_id = status.ok() ? actor->GetActorID() : ActorID::Nil();
KillActorOnWorker(worker->GetAddress(), actor_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this another race condition?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 8, 2021
Copy link
Contributor

@ffbin ffbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some small comments.

@fishbone
Copy link
Contributor Author

fishbone commented Aug 9, 2021

@rkooo567 I verified it.

@fishbone fishbone removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. tests-ok The tagger certifies test failures are unrelated and assumes personal liability. labels Aug 9, 2021
@rkooo567
Copy link
Contributor

rkooo567 commented Aug 9, 2021

Retry one rllib test. Btw, is it possible to add an unit test on gcs_actor_manager_test? I prefer to add that in this PR, but doing in the follow up is also fine for me.

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 9, 2021
@fishbone
Copy link
Contributor Author

fishbone commented Aug 9, 2021

Retry one rllib test. Btw, is it possible to add an unit test on gcs_actor_manager_test? I prefer to add that in this PR, but doing in the follow up is also fine for me.

Let me investigate it next week. Feel a little bit tired on this PR. :( I've put it in my schedule.

@fishbone fishbone added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Aug 9, 2021
@rkooo567
Copy link
Contributor

@iycheng sgtm!

@rkooo567
Copy link
Contributor

Can you add it to the sprint task?

@rkooo567 rkooo567 merged commit 473740b into ray-project:master Aug 10, 2021
@fishbone
Copy link
Contributor Author

Can you add it to the sprint task?

I did add that.

@fishbone fishbone deleted the actor-kill2 branch August 17, 2021 07:39
ericl pushed a commit that referenced this pull request Sep 22, 2021
fishbone added a commit that referenced this pull request Sep 22, 2021
ericl pushed a commit that referenced this pull request Sep 22, 2021
fishbone added a commit to fishbone/ray that referenced this pull request Sep 24, 2021
rkooo567 pushed a commit that referenced this pull request Sep 28, 2021
…" (#18871)

* Revert "Revert "[test] add unit test for PR #17634 (#18585)" (#18830)"

This reverts commit 8dd3057.

* up
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Actor hangs after other actors are killed
5 participants