Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TestDirectActorTaskCrossNodesFailure test #5406

Merged
merged 10 commits into from
Aug 11, 2019

Conversation

zhijunfu
Copy link
Contributor

@zhijunfu zhijunfu commented Aug 8, 2019

What do these changes do?

This fixes a commonly occuring test failure, see https://travis-ci.com/ray-project/ray/jobs/223046433

The test that is failing is TwoNodeTest.TestDirectActorTaskCrossNodesFailure

The problem is currently core worker uses PLASMA store provider for all the objects, which implements Get operation via a while loop of FetchOrReconstruct and plasma Get, in this test after actor dies, raylet tries to reconstruct the needed object when task lease expires, but it isn't able to find the task both locally and from gcs because this is a direct actor call, thus task is not known to raylet nor gcs.

The fix is to choose the appropriate store provider in ObjectInterface, in this case when an object is a return object from a direct actor call, use LOCAL_PLASMA provider instead of PLASMA provider, in that case FetchOrReconstruct is not called.

Related issue number

Linter

  • [Y] I've run scripts/format.sh to lint the changes in this PR.

@zhijunfu zhijunfu mentioned this pull request Aug 8, 2019
1 task
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16135/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16136/
Test PASSed.

@edoakes
Copy link
Contributor

edoakes commented Aug 8, 2019

Thanks for looking into this.

How do Get() calls work on any objects created by direct actor calls, then? Or is this just completely unsupported in the current code?

Also, I thought that the direct actor calls were meant to be using the in-memory store, not the plasma store, right? If that's the case, maybe we should just make that change and fix the dispatching in another PR.

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks like the right fix!

However, the code structure is confusing at the moment.

First, there is a bunch of code copy and pasted twice in CoreWorkerObjectInterface::Get (once for PLASMA and once for LOCAL_PLASMA). We should get rid of this code duplication.

Second, we should take this opportunity and make it so that object_ids coming into CoreWorkerObjectInterface::Get are made unique (i.e. remove duplicates) immediately after they come in (and put into two buckets, one for direct and one for non-direct calls), and then inside store_providers_[StoreProviderType::PLASMA]->Get etc. we can assume uniqueness, this will simplify the logic in there, see some other PRs.

A plus if you can do these changes in a way that will make it easy to add more cases in the future (not that we should... ...but it will probably make the code more readable).

@zhijunfu
Copy link
Contributor Author

zhijunfu commented Aug 9, 2019

Thanks for the comments.

There are usually 3 kinds of objects:

  • task arguments that are passed by reference
  • return objects from tasks
  • objects that are Put manually by user

For direct actor calls, by reference arguments are not supported, return objects are retrieved via LOCAL_PLASMA provider, and objects Put by user will go through normal PLAMSA provider.

Yes it's true that direct call will use in-memory store for return objects, but that requires a refactor of existing code to deal with different cases, and that touches a lot of files, I've been doing that but haven't finished yet, would prefer to do it in a separate PR when it's ready.

Thanks for looking into this.

How do Get() calls work on any objects created by direct actor calls, then? Or is this just completely unsupported in the current code?

Also, I thought that the direct actor calls were meant to be using the in-memory store, not the plasma store, right? If that's the case, maybe we should just make that change and fix the dispatching in another PR.

@zhijunfu
Copy link
Contributor Author

zhijunfu commented Aug 9, 2019

Thanks, this looks like the right fix!

However, the code structure is confusing at the moment.

First, there is a bunch of code copy and pasted twice in CoreWorkerObjectInterface::Get (once for PLASMA and once for LOCAL_PLASMA). We should get rid of this code duplication.

Thanks, will fix that.

Second, we should take this opportunity and make it so that object_ids coming into CoreWorkerObjectInterface::Get are made unique (i.e. remove duplicates) immediately after they come in (and put into two buckets, one for direct and one for non-direct calls), and then inside store_providers_[StoreProviderType::PLASMA]->Get etc. we can assume uniqueness, this will simplify the logic in there, see some other PRs.

Yes, will do the dedup as suggested. Probably the simplifications in the providers can be done in a later PR.

A plus if you can do these changes in a way that will make it easy to add more cases in the future (not that we should... ...but it will probably make the code more readable).

Yes, there's some changes needed to support other store providers (e.g. in-memory), I've been working on a PR to refactor existing code to do this in a generic way, will send out when that one is finished.

@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 9, 2019

Thanks, that sounds great!

@zhijunfu
Copy link
Contributor Author

zhijunfu commented Aug 9, 2019

comments are addressed, thanks @pcmoritz @edoakes

Copy link
Contributor

@pcmoritz pcmoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now, thanks for doing it! Maybe one more change, the <algorithm> header should move up after the object_interface.h include and before the other includes, with newlines between them. See https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes.

@pcmoritz
Copy link
Contributor

pcmoritz commented Aug 9, 2019

There is also going to be a travis failure because of the signed/unsigned comparison in the for loop, the int should be changed to size_t (warnings are fatal since #5375).

@zhijunfu
Copy link
Contributor Author

zhijunfu commented Aug 9, 2019

thanks,updated

Looks good now, thanks for doing it! Maybe one more change, the <algorithm> header should move up after the object_interface.h include and before the other includes, with newlines between them. See https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16168/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16170/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16205/
Test PASSed.

@raulchen raulchen merged commit b1e010f into ray-project:master Aug 11, 2019
@raulchen raulchen deleted the fix-unconstrutable-error branch August 11, 2019 03:15
@zhijunfu
Copy link
Contributor Author

Yes, there's some changes needed to support other store providers (e.g. in-memory), I've been working on a PR to refactor existing code to do this in a generic way, will send out when that one is finished.

@pcmoritz
The mentioned PR is ready for review:)
#5443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants