Fix direct actor transport not treating some tasks as failed #5464

raulchen · 2019-08-16T16:32:16Z

Why are these changes needed?

CoreWorkerTest::TestActorFailure sometimes hang in CI. There're 2 reasons:

when sending tasks via rpc, if the connection is already broken, client.PushTask will directly return an error status, instead of triggering the callback. So, we need to treat the tasks as failed in this case.
GCS client only returns the latest state of an actor. So when receiving an DEAD notification, we need to treat the pending tasks a failed as well.

What do these changes do?

Related issue number

Linter

I've run scripts/format.sh to lint the changes in this PR.

robertnishihara · 2019-08-16T18:32:48Z

src/ray/core_worker/transport/direct_actor_transport.cc

@@ -153,7 +154,10 @@ Status CoreWorkerDirectActorTaskSubmitter::PushTask(rpc::DirectActorClient &clie
              store_provider_->Put(RayObject(data_buffer, metadata_buffer), object_id));
        }
      });
-  return status;
+  if (!status.ok()) {


In this scenario where the status is not ok, does that always mean that the actor has died? Could it mean that the actor is overloaded and some buffer for sending messages is full or something like that?

If buffer is full, the request should be blocked. But I guess it's possible that network in temporarily disconnected. However, no matter what case it is, we should treat the task as failed and let the app to decide what to do (retry, ignore, or error). I'll add a TODO here about making the error message more accurate, instead of just actor died. Does that sound good to you?

Ok sounds good.

robertnishihara · 2019-08-16T18:34:18Z

src/ray/core_worker/transport/direct_actor_transport.cc

@@ -117,6 +117,7 @@ void CoreWorkerDirectActorTaskSubmitter::ConnectAndSendPendingTasks(
    auto status =
        PushTask(*client, request, TaskID::FromBinary(request.task_spec().task_id()),
                 request.task_spec().num_returns());
+    RAY_CHECK_OK(status);


Does this mean that the first time a driver tries to connect to a direct call actor, if the actor is already dead, then the whole driver will fail? That doesn't seem like the right behavior.

No. PushTask now always return ok. If the actor is dead, we'll treat the task as failed, and use an app-level exception to inform callers.

robertnishihara · 2019-08-16T18:35:00Z

@raulchen, do we have tests that test connecting to and pushing tasks to dead actors?

AmplabJenkins · 2019-08-16T20:27:40Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16346/
Test PASSed.

AmplabJenkins · 2019-08-16T20:55:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16347/
Test PASSed.

raulchen · 2019-08-17T07:38:53Z

@raulchen, do we have tests that test connecting to and pushing tasks to dead actors?

Yeah, CoreWorkerTest::TestActorFailure is such a case. This test hangs occasionally in CI. I'm trying to run that test multiple times to verify if this PR can fix that test.

zhijunfu

Thanks. Overall looks good to me. Left a few comments.

zhijunfu · 2019-08-17T07:48:59Z

src/ray/core_worker/store_provider/local_plasma_provider.cc

+      // TODO(hchen): Should we propagate this error out of `ObjectInterface::put`?
+      RAY_LOG(WARNING) << "Trying to put an object that already existed in plasma: "
+                       << object_id << ".";
+      return Status::OK();


what's the reason to return OK() in this case?

Because this could happen in normal cases. For example, 1) during reconstructing a task, an existing object could be put again. 2) when treating a task as failed, the task could have been succeeded but we don't know and put a duplicate object.
We are already using this behavior in python/java worker and raylet.

raulchen · 2019-08-19T07:26:44Z

I found and fixed another issue, see updated PR message.
Now this test can pass consistently, see https://travis-ci.com/raulchen/ray/jobs/226214301.

Do you have other comments?

AmplabJenkins · 2019-08-19T09:29:01Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16386/
Test FAILed.

zhijunfu · 2019-08-20T03:07:21Z

src/ray/core_worker/transport/raylet_transport.cc

-      // already exists.
-      RAY_LOG(WARNING) << "Task " << task_spec.TaskId() << " failed to put object " << id
-                       << " in store: " << status.message();
+      RAY_LOG(FATAL) << "Task " << task_spec.TaskId() << " failed to put object " << id


would be good to add a comment here that we use log FATAL for put errors except when object exists.

good point.

zhijunfu

Thanks for fixing this! LGTM. Just a few nits.

edoakes

LGTM a few small comments

edoakes · 2019-08-20T03:04:09Z

src/ray/core_worker/transport/direct_actor_transport.cc

@@ -92,6 +92,19 @@ Status CoreWorkerDirectActorTaskSubmitter::SubscribeActorUpdates() {
    } else {
      // Remove rpc client if it's dead or being reconstructed.
      rpc_clients_.erase(actor_id);
+      // If this actor is permanantly dead and there're pending requests, treat


Suggested change

// If this actor is permanantly dead and there're pending requests, treat

// If this actor is permanently dead and there are pending requests, treat

edoakes · 2019-08-20T03:07:22Z

src/ray/core_worker/transport/direct_actor_transport.cc

+      // the pending tasks as failed.
+      if (actor_data.state() == ActorTableData::DEAD &&
+          pending_requests_.count(actor_id) > 0) {
+        auto &requests = pending_requests_[actor_id];


Would prefer range-based for loop and then a call to pending_requests_.clear(), more concise. Also don't need the pending_requests.count(actor_id) check.

without pending_requests.count(actor_id), pending_requests_[actor_id] will construct an empty list.

edoakes · 2019-08-20T03:10:34Z

src/ray/core_worker/transport/direct_actor_transport.cc

+  if (!status.ok()) {
+    TreatTaskAsFailed(task_id, num_returns, rpc::ErrorType::ACTOR_DIED);
+  }
+  return Status::OK();


Should we be returning OK() here? If so, make sure it's clearly documented - not the behavior I would expect without context.

I'll change this to return void and update comment.

AmplabJenkins · 2019-08-20T07:23:22Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16409/
Test PASSed.

AmplabJenkins · 2019-08-20T13:25:08Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16416/
Test PASSed.

AmplabJenkins · 2019-08-20T15:50:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16418/
Test PASSed.

Fix direct actor transport not treating some tasks as failed

3266973

raulchen requested a review from zhijunfu August 16, 2019 16:32

raulchen added 2 commits August 17, 2019 00:32

check status

a62e134

handle duplicate object id

5f9ea9f

robertnishihara reviewed Aug 16, 2019

View reviewed changes

zhijunfu reviewed Aug 17, 2019

View reviewed changes

raulchen added 2 commits August 19, 2019 14:36

treat pending tasks as failed

53a96eb

Merge branch 'master' into fix_direct_actor

b652866

zhijunfu reviewed Aug 20, 2019

View reviewed changes

zhijunfu approved these changes Aug 20, 2019

View reviewed changes

edoakes approved these changes Aug 20, 2019

View reviewed changes

comments

a74c3b1

raulchen added 3 commits August 20, 2019 18:19

fix

e0f95dd

fix

68423a7

fix

b0f5a22

pcmoritz merged commit f2b3c27 into ray-project:master Aug 20, 2019

raulchen deleted the fix_direct_actor branch August 21, 2019 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix direct actor transport not treating some tasks as failed #5464

Fix direct actor transport not treating some tasks as failed #5464

raulchen commented Aug 16, 2019 •

edited

Loading

robertnishihara Aug 16, 2019

raulchen Aug 17, 2019

robertnishihara Aug 17, 2019

robertnishihara Aug 16, 2019

raulchen Aug 17, 2019

robertnishihara commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

raulchen commented Aug 17, 2019

zhijunfu left a comment

zhijunfu Aug 17, 2019

raulchen Aug 19, 2019

raulchen commented Aug 19, 2019

AmplabJenkins commented Aug 19, 2019

zhijunfu Aug 20, 2019

raulchen Aug 20, 2019

zhijunfu left a comment

edoakes left a comment

edoakes Aug 20, 2019

edoakes Aug 20, 2019

raulchen Aug 20, 2019

edoakes Aug 20, 2019

raulchen Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

	// If this actor is permanantly dead and there're pending requests, treat
	// If this actor is permanently dead and there are pending requests, treat

Fix direct actor transport not treating some tasks as failed #5464

Fix direct actor transport not treating some tasks as failed #5464

Conversation

raulchen commented Aug 16, 2019 • edited Loading

Why are these changes needed?

What do these changes do?

Related issue number

Linter

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

AmplabJenkins commented Aug 16, 2019

raulchen commented Aug 17, 2019

zhijunfu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen commented Aug 19, 2019

AmplabJenkins commented Aug 19, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhijunfu left a comment

Choose a reason for hiding this comment

edoakes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

AmplabJenkins commented Aug 20, 2019

raulchen commented Aug 16, 2019 •

edited

Loading