-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TestDirectActorTaskCrossNodesFailure test #5406
Fix TestDirectActorTaskCrossNodesFailure test #5406
Conversation
… fix transport type in return object id
Test PASSed. |
Test PASSed. |
Thanks for looking into this. How do Also, I thought that the direct actor calls were meant to be using the in-memory store, not the plasma store, right? If that's the case, maybe we should just make that change and fix the dispatching in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks like the right fix!
However, the code structure is confusing at the moment.
First, there is a bunch of code copy and pasted twice in CoreWorkerObjectInterface::Get (once for PLASMA and once for LOCAL_PLASMA). We should get rid of this code duplication.
Second, we should take this opportunity and make it so that object_ids coming into CoreWorkerObjectInterface::Get are made unique (i.e. remove duplicates) immediately after they come in (and put into two buckets, one for direct and one for non-direct calls), and then inside store_providers_[StoreProviderType::PLASMA]->Get
etc. we can assume uniqueness, this will simplify the logic in there, see some other PRs.
A plus if you can do these changes in a way that will make it easy to add more cases in the future (not that we should... ...but it will probably make the code more readable).
Thanks for the comments. There are usually 3 kinds of objects:
For direct actor calls, by reference arguments are not supported, return objects are retrieved via Yes it's true that direct call will use in-memory store for return objects, but that requires a refactor of existing code to deal with different cases, and that touches a lot of files, I've been doing that but haven't finished yet, would prefer to do it in a separate PR when it's ready.
|
Thanks, will fix that.
Yes, will do the dedup as suggested. Probably the simplifications in the providers can be done in a later PR.
Yes, there's some changes needed to support other store providers (e.g. in-memory), I've been working on a PR to refactor existing code to do this in a generic way, will send out when that one is finished. |
Thanks, that sounds great! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, thanks for doing it! Maybe one more change, the <algorithm>
header should move up after the object_interface.h
include and before the other includes, with newlines between them. See https://google.github.io/styleguide/cppguide.html#Names_and_Order_of_Includes.
There is also going to be a travis failure because of the signed/unsigned comparison in the for loop, the int should be changed to size_t (warnings are fatal since #5375). |
thanks,updated
|
Test PASSed. |
Test PASSed. |
Test PASSed. |
What do these changes do?
This fixes a commonly occuring test failure, see https://travis-ci.com/ray-project/ray/jobs/223046433
The test that is failing is TwoNodeTest.TestDirectActorTaskCrossNodesFailure
The problem is currently core worker uses
PLASMA
store provider for all the objects, which implementsGet
operation via a while loop ofFetchOrReconstruct
and plasmaGet
, in this test after actor dies, raylet tries to reconstruct the needed object when task lease expires, but it isn't able to find the task both locally and from gcs because this is a direct actor call, thus task is not known to raylet nor gcs.The fix is to choose the appropriate store provider in ObjectInterface, in this case when an object is a return object from a direct actor call, use LOCAL_PLASMA provider instead of PLASMA provider, in that case
FetchOrReconstruct
is not called.Related issue number
Linter
scripts/format.sh
to lint the changes in this PR.