Throw exception for `ray.get` of an evicted actor object #3490

stephanie-wang · 2018-12-07T03:00:17Z

What do these changes do?

Since actors have state, objects created by an earlier actor method that have been evicted cannot be reconstructed without rolling back the actor. This PR treats such tasks as failed so that the frontend can catch the error, instead of hanging.

Related issue number

#3452 and potentially others that involve actors and a limited amount of object store memory.

AmplabJenkins · 2018-12-07T06:01:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9831/
Test FAILed.

ericl · 2018-12-08T06:44:22Z

src/ray/raylet/node_manager.cc

  if (task.GetTaskSpecification().IsActorTask()) {
    // Actor reconstruction is turned off by default right now.
    const ActorID actor_id = task.GetTaskSpecification().ActorId();
    auto it = actor_registry_.find(actor_id);
    RAY_CHECK(it != actor_registry_.end());
    if (it->second.IsAlive()) {
+      // Only treat the task as failed if its output has been evicted.
+      // Otherwise, this must be a spurious reconstruction.
+      if (return_values_lost) {


What would happen if we unconditionally treated the task as failed? It seems like it would simplify the code a lot to not have to track whether the return values were lost.

The code used to do what you're saying, but it was causing spurious RayGetErrors that we fixed in #3359.

One option that I was thinking we could do is to check whether the task has already been executed using the task counters for the actor, then treat the task as failed if that check passes. To be safe, we could also check whether the return values exist on any nodes, but we wouldn't need the extra logic in this PR to check whether the return values were evicted. What do you think?

That sounds reasonable, so it would just involve checking the locations table after the task count check?

Hmm actually, it may still make sense to have the check that the object is evicted. Since GCS writes are asynchronous, the locations entry might be empty even when some node does have the object.

robertnishihara · 2018-12-11T17:26:00Z

src/ray/object_manager/object_directory.cc

-    std::unordered_set<ClientID> &client_ids,
-    const std::vector<ObjectTableDataT> &location_history,
-    const ray::gcs::ClientTable &client_table) {
+void UpdateObjectLocations(const std::vector<ObjectTableDataT> &location_history,


Can we document this function. In particular the fact that we have output arguments.

robertnishihara · 2018-12-11T17:27:01Z

src/ray/object_manager/object_directory.cc

-    const ray::gcs::ClientTable &client_table) {
+void UpdateObjectLocations(const std::vector<ObjectTableDataT> &location_history,
+                           const ray::gcs::ClientTable &client_table,
+                           std::unordered_set<ClientID> *client_ids, bool *created) {


slight preference for has_been_created over created

robertnishihara · 2018-12-11T17:29:52Z

src/ray/object_manager/object_directory.cc

-        UpdateObjectLocations(object_id_listener_pair->second.current_object_locations,
-                              location_history, gcs_client_->client_table());
+    UpdateObjectLocations(location_history, gcs_client_->client_table(),
+                          &it->second.current_object_locations, &it->second.created);


I think it would be clearer to do

std::vector<ClientID> current_object_locations; UpdateObjectLocations(location_history, gcs_client_->client_table(), &current_object_locations, &it->second.created); it->second.current_object_locations = current_object_locations;

The reason is that the current implementation makes it seem like the past object locations are relevant, but the prior value of it->second.current_object_locations are completely irrelevant (even though this leads to the same output).

Alternatively, It'd be ok to call

it->second.current_object_locations.clear();

first.

Same with the other place where we call UpdateObjectLocations

I think I'm going to move to what Eric and I discussed above about just checking if the object is lost if it's a duplicate, but just to be clear, the past object locations are relevant since UpdateObjectLocations processes a subset of the log, not necessarily the whole log.

Hm, ok, just for my own clarification, the location_history argument in the object notification callback to the object table subscribe function contains the full history of all object table updates for that object ID, right?

If that's true, then it looks to me like `UpdateObjectLocations processes the full log, am I missing something?

No, it's only the new object table updates. So the first notification will contain the full history, but subsequent notifications will only contain a subset.

ericl · 2018-12-11T20:27:02Z

Isn't it also possible that the eviction message was lost? Perhaps we can look for the object for a timeout before giving up?

I would also be fine raising an error on that race condition. An object being evicted locally and available on a remote node seems unlikely.

AmplabJenkins · 2018-12-11T21:21:07Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9964/
Test FAILed.

stephanie-wang · 2018-12-11T21:22:22Z

Yeah, but eventually either the eviction notice will go through or the node will be marked as dead. Either way, reconstruction will get triggered again on a timeout and eventually the task will be marked as failed.

robertnishihara · 2018-12-11T21:37:40Z

test/actor_test.py

+            return np.random.rand(size)
+
+    object_store_memory = 10**8
+    ray.worker._init(


This can actually be ray.init.

if you remove the start_ray_local arg

robertnishihara · 2018-12-11T21:45:13Z

src/ray/raylet/node_manager.cc

+  // Use a shared flag to make sure that we only treat the task as failed at
+  // most once. This flag will get deallocated once all of the object table
+  // lookup callbacks are fired.
+  auto mark_task_failed = std::make_shared<bool>(false);


clearer to call it task_marked_failed

robertnishihara · 2018-12-11T21:51:31Z

src/ray/object_manager/object_directory.cc

+void UpdateObjectLocations(const std::vector<ObjectTableDataT> &location_history,
+                           const ray::gcs::ClientTable &client_table,
+                           std::unordered_set<ClientID> *client_ids,
+                           bool *has_been_created) {


This unfortunately doesn't seem to play too nicely with #3499 because sometimes we evict the keys so they will appear to have never been created. cc @ericl

Hmm that's true...I can't really think of a foolproof way around that except to fail the object after some number of attempts.

robertnishihara · 2018-12-11T21:52:05Z

src/ray/raylet/node_manager.h

+  ///
+  /// \param task The task to potentially fail.
+  /// \return Void.
+  void TreatLostTaskAsFailed(const Task &task);


Maybe TreatTaskAsFailedIfLost?

robertnishihara · 2018-12-11T21:53:26Z

src/ray/object_manager/object_manager.cc

-ObjectManager::ObjectManager(asio::io_service &main_service,
-                             const ObjectManagerConfig &config,
-                             std::unique_ptr<ObjectDirectoryInterface> od)
+                             std::shared_ptr<ObjectDirectoryInterface> od)


I think od -> object_directory

ericl · 2018-12-11T21:59:27Z

Ok, that makes sense then!

AmplabJenkins · 2018-12-11T23:45:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9968/
Test PASSed.

AmplabJenkins · 2018-12-13T00:05:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9992/
Test FAILed.

AmplabJenkins · 2018-12-13T01:58:06Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/9997/
Test PASSed.

raulchen · 2018-12-13T11:58:57Z

test/actor_test.py

@@ -2142,3 +2142,45 @@ def method(self):
        ray.wait([object_id])

    ray.get(results)
+
+
+def test_actor_eviction(shutdown_only):


this test can be simplified as the following:

1. submit a task to an actor. 2. use `ray.internal.free` to evict the object. 3. call `ray.get` on the object and assert that an error is raised.

Thanks! I'll try that.

Hmm this actually doesn't seem to work, I think due to asynchrony between the raylet and the object store.

sorry, I forgot to mention that before step 3, we need to wait until the object is really removed from object store. you can do that by keep reading the id from object_store_client until you get an error saying the object doesn't exist.

raulchen · 2018-12-13T12:51:50Z

I've also implemented this feature in our internal code base. A few comments and questions:

I use a new exception type UnreconstructableException to let users know this task has finished successfully before but its result were lost and cannot be reconstructed now. It'd be useful to distinguish this case with the case where the task failed or the case where actor died. (In all these cases, users get RayGetError.) Because users may want to make different decisions (ignore/retry/fail) based the cases. (BTW, I implemented this exception by writing a special value in object's metadata.)
When I mark the objects as unreconstructable, I only check that this object doesn't exit on any node now, but don't check that this object was never created before. It seems to me that the has_been_created check isn't necessary. Because when CheckDuplicateActorTask fails, the object must have been created before, right?

stephanie-wang · 2018-12-13T19:56:18Z

Thanks for the comments @raulchen.

This would be super useful to do and we've actually been meaning to do that for a while now. Would you be able to open a PR for that?
Because GCS writes are asynchronous, there is actually a chance that the locations of the task's output values haven't been written to the object table yet even though the task has been executed. Although now that I think about it, there is probably a way to rely on the ordering of commands per GCS connection to make sure this doesn't happen. We may have to revisit this once GCS flushing is more stable, since evicted objects will appear to have never been created.

raulchen · 2018-12-14T06:55:31Z

Thanks for the comments @raulchen.

This would be super useful to do and we've actually been meaning to do that for a while now. Would you be able to open a PR for that?

Because GCS writes are asynchronous, there is actually a chance that the locations of the task's output values haven't been written to the object table yet even though the task has been executed. Although now that I think about it, there is probably a way to rely on the ordering of commands per GCS connection to make sure this doesn't happen. We may have to revisit this once GCS flushing is more stable, since evicted objects will appear to have never been created.

For 1: sure, I can do this after this PR is merged.
For 2: Is it possible to make sure that a node doesn't release task lease until object locations are written to GCS?

AmplabJenkins · 2018-12-14T08:04:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/10045/
Test FAILed.

stephanie-wang · 2018-12-14T19:47:14Z

@raulchen, for 2, that is ideally how we would do it! Technically it is possible, but it's a little involved. One way to do it would require:

Remembering which tasks have completed but whose objects haven't been added to the GCS yet.
Adding a callback to the GCS object table write to clear the above data structure and cancel the task lease.

Do you want to open an issue for it? We might not get around to it soon but I think it's important to do, especially if GCS latency is high.

stephanie-wang added 2 commits December 6, 2018 18:33

Add a flag for whether an object has been created before

9a88ff5

Add regression test

a97f45d

stephanie-wang assigned ericl Dec 7, 2018

ericl reviewed Dec 8, 2018

View reviewed changes

robertnishihara reviewed Dec 11, 2018

View reviewed changes

stephanie-wang added 3 commits December 11, 2018 10:40

doc

2d71bf3

Share object directory between object and node managers

94607e6

Treat evicted actor tasks as failed

b7d7b0c

stephanie-wang force-pushed the fix-actor-eviction branch from 5697fc6 to b7d7b0c Compare December 11, 2018 19:29

robertnishihara reviewed Dec 11, 2018

View reviewed changes

minor

25e4808

ericl approved these changes Dec 12, 2018

View reviewed changes

stephanie-wang added 3 commits December 12, 2018 11:07

Check return value

a161d85

Fix bug where object locations weren't getting updated on client death

23e559d

Fix mac build

a303d13

raulchen reviewed Dec 13, 2018

View reviewed changes

Merge branch 'master' into fix-actor-eviction

4a5691f

stephanie-wang added 2 commits December 13, 2018 15:03

Use RayTaskError

32f5e80

Merge branch 'master' into fix-actor-eviction

883dfeb

stephanie-wang merged commit fcc3702 into ray-project:master Dec 14, 2018

stephanie-wang deleted the fix-actor-eviction branch December 14, 2018 19:41

stephanie-wang mentioned this pull request Feb 4, 2019

[rllib] Memory usage constantly growing while training IMPALA #3884

Closed

Throw exception for ray.get of an evicted actor object #3490

Throw exception for ray.get of an evicted actor object #3490

Conversation

stephanie-wang commented Dec 7, 2018

What do these changes do?

Related issue number

AmplabJenkins commented Dec 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Dec 11, 2018

AmplabJenkins commented Dec 11, 2018

stephanie-wang commented Dec 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericl commented Dec 11, 2018

AmplabJenkins commented Dec 11, 2018

AmplabJenkins commented Dec 13, 2018

AmplabJenkins commented Dec 13, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen commented Dec 13, 2018

stephanie-wang commented Dec 13, 2018

raulchen commented Dec 14, 2018

AmplabJenkins commented Dec 14, 2018

stephanie-wang commented Dec 14, 2018

Throw exception for `ray.get` of an evicted actor object #3490

Throw exception for `ray.get` of an evicted actor object #3490