Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix direct actor transport not treating some tasks as failed #5464

Merged
merged 9 commits into from
Aug 20, 2019

Conversation

raulchen
Copy link
Contributor

@raulchen raulchen commented Aug 16, 2019

Why are these changes needed?

CoreWorkerTest::TestActorFailure sometimes hang in CI. There're 2 reasons:

  1. when sending tasks via rpc, if the connection is already broken, client.PushTask will directly return an error status, instead of triggering the callback. So, we need to treat the tasks as failed in this case.
  2. GCS client only returns the latest state of an actor. So when receiving an DEAD notification, we need to treat the pending tasks a failed as well.

What do these changes do?

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@raulchen raulchen requested a review from zhijunfu August 16, 2019 16:32
@@ -153,7 +154,10 @@ Status CoreWorkerDirectActorTaskSubmitter::PushTask(rpc::DirectActorClient &clie
store_provider_->Put(RayObject(data_buffer, metadata_buffer), object_id));
}
});
return status;
if (!status.ok()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this scenario where the status is not ok, does that always mean that the actor has died? Could it mean that the actor is overloaded and some buffer for sending messages is full or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If buffer is full, the request should be blocked. But I guess it's possible that network in temporarily disconnected. However, no matter what case it is, we should treat the task as failed and let the app to decide what to do (retry, ignore, or error). I'll add a TODO here about making the error message more accurate, instead of just actor died. Does that sound good to you?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good.

@@ -117,6 +117,7 @@ void CoreWorkerDirectActorTaskSubmitter::ConnectAndSendPendingTasks(
auto status =
PushTask(*client, request, TaskID::FromBinary(request.task_spec().task_id()),
request.task_spec().num_returns());
RAY_CHECK_OK(status);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that the first time a driver tries to connect to a direct call actor, if the actor is already dead, then the whole driver will fail? That doesn't seem like the right behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. PushTask now always return ok. If the actor is dead, we'll treat the task as failed, and use an app-level exception to inform callers.

@robertnishihara
Copy link
Collaborator

@raulchen, do we have tests that test connecting to and pushing tasks to dead actors?

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16346/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16347/
Test PASSed.

@raulchen
Copy link
Contributor Author

@raulchen, do we have tests that test connecting to and pushing tasks to dead actors?

Yeah, CoreWorkerTest::TestActorFailure is such a case. This test hangs occasionally in CI. I'm trying to run that test multiple times to verify if this PR can fix that test.

Copy link
Contributor

@zhijunfu zhijunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Overall looks good to me. Left a few comments.

// TODO(hchen): Should we propagate this error out of `ObjectInterface::put`?
RAY_LOG(WARNING) << "Trying to put an object that already existed in plasma: "
<< object_id << ".";
return Status::OK();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason to return OK() in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this could happen in normal cases. For example, 1) during reconstructing a task, an existing object could be put again. 2) when treating a task as failed, the task could have been succeeded but we don't know and put a duplicate object.
We are already using this behavior in python/java worker and raylet.

@raulchen
Copy link
Contributor Author

I found and fixed another issue, see updated PR message.
Now this test can pass consistently, see https://travis-ci.com/raulchen/ray/jobs/226214301.

Do you have other comments?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16386/
Test FAILed.

// already exists.
RAY_LOG(WARNING) << "Task " << task_spec.TaskId() << " failed to put object " << id
<< " in store: " << status.message();
RAY_LOG(FATAL) << "Task " << task_spec.TaskId() << " failed to put object " << id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add a comment here that we use log FATAL for put errors except when object exists.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

Copy link
Contributor

@zhijunfu zhijunfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this! LGTM. Just a few nits.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM a few small comments

@@ -92,6 +92,19 @@ Status CoreWorkerDirectActorTaskSubmitter::SubscribeActorUpdates() {
} else {
// Remove rpc client if it's dead or being reconstructed.
rpc_clients_.erase(actor_id);
// If this actor is permanantly dead and there're pending requests, treat
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If this actor is permanantly dead and there're pending requests, treat
// If this actor is permanently dead and there are pending requests, treat

// the pending tasks as failed.
if (actor_data.state() == ActorTableData::DEAD &&
pending_requests_.count(actor_id) > 0) {
auto &requests = pending_requests_[actor_id];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer range-based for loop and then a call to pending_requests_.clear(), more concise. Also don't need the pending_requests.count(actor_id) check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without pending_requests.count(actor_id), pending_requests_[actor_id] will construct an empty list.

if (!status.ok()) {
TreatTaskAsFailed(task_id, num_returns, rpc::ErrorType::ACTOR_DIED);
}
return Status::OK();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be returning OK() here? If so, make sure it's clearly documented - not the behavior I would expect without context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this to return void and update comment.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16409/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16416/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16418/
Test PASSed.

@pcmoritz pcmoritz merged commit f2b3c27 into ray-project:master Aug 20, 2019
@raulchen raulchen deleted the fix_direct_actor branch August 21, 2019 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants