-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Throw exception for ray.get
of an evicted actor object
#3490
Throw exception for ray.get
of an evicted actor object
#3490
Conversation
Test FAILed. |
src/ray/raylet/node_manager.cc
Outdated
if (task.GetTaskSpecification().IsActorTask()) { | ||
// Actor reconstruction is turned off by default right now. | ||
const ActorID actor_id = task.GetTaskSpecification().ActorId(); | ||
auto it = actor_registry_.find(actor_id); | ||
RAY_CHECK(it != actor_registry_.end()); | ||
if (it->second.IsAlive()) { | ||
// Only treat the task as failed if its output has been evicted. | ||
// Otherwise, this must be a spurious reconstruction. | ||
if (return_values_lost) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if we unconditionally treated the task as failed? It seems like it would simplify the code a lot to not have to track whether the return values were lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code used to do what you're saying, but it was causing spurious RayGetError
s that we fixed in #3359.
One option that I was thinking we could do is to check whether the task has already been executed using the task counters for the actor, then treat the task as failed if that check passes. To be safe, we could also check whether the return values exist on any nodes, but we wouldn't need the extra logic in this PR to check whether the return values were evicted. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds reasonable, so it would just involve checking the locations table after the task count check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm actually, it may still make sense to have the check that the object is evicted. Since GCS writes are asynchronous, the locations entry might be empty even when some node does have the object.
std::unordered_set<ClientID> &client_ids, | ||
const std::vector<ObjectTableDataT> &location_history, | ||
const ray::gcs::ClientTable &client_table) { | ||
void UpdateObjectLocations(const std::vector<ObjectTableDataT> &location_history, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we document this function. In particular the fact that we have output arguments.
const ray::gcs::ClientTable &client_table) { | ||
void UpdateObjectLocations(const std::vector<ObjectTableDataT> &location_history, | ||
const ray::gcs::ClientTable &client_table, | ||
std::unordered_set<ClientID> *client_ids, bool *created) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight preference for has_been_created
over created
UpdateObjectLocations(object_id_listener_pair->second.current_object_locations, | ||
location_history, gcs_client_->client_table()); | ||
UpdateObjectLocations(location_history, gcs_client_->client_table(), | ||
&it->second.current_object_locations, &it->second.created); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer to do
std::vector<ClientID> current_object_locations;
UpdateObjectLocations(location_history, gcs_client_->client_table(),
¤t_object_locations,
&it->second.created);
it->second.current_object_locations = current_object_locations;
The reason is that the current implementation makes it seem like the past object locations are relevant, but the prior value of it->second.current_object_locations
are completely irrelevant (even though this leads to the same output).
Alternatively, It'd be ok to call
it->second.current_object_locations.clear();
first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with the other place where we call UpdateObjectLocations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm going to move to what Eric and I discussed above about just checking if the object is lost if it's a duplicate, but just to be clear, the past object locations are relevant since UpdateObjectLocations
processes a subset of the log, not necessarily the whole log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, ok, just for my own clarification, the location_history
argument in the object notification callback to the object table subscribe function contains the full history of all object table updates for that object ID, right?
If that's true, then it looks to me like `UpdateObjectLocations processes the full log, am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's only the new object table updates. So the first notification will contain the full history, but subsequent notifications will only contain a subset.
5697fc6
to
b7d7b0c
Compare
Isn't it also possible that the eviction message was lost? Perhaps we can look for the object for a timeout before giving up? I would also be fine raising an error on that race condition. An object being evicted locally and available on a remote node seems unlikely. |
Test FAILed. |
Yeah, but eventually either the eviction notice will go through or the node will be marked as dead. Either way, reconstruction will get triggered again on a timeout and eventually the task will be marked as failed. |
test/actor_test.py
Outdated
return np.random.rand(size) | ||
|
||
object_store_memory = 10**8 | ||
ray.worker._init( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can actually be ray.init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you remove the start_ray_local
arg
src/ray/raylet/node_manager.cc
Outdated
// Use a shared flag to make sure that we only treat the task as failed at | ||
// most once. This flag will get deallocated once all of the object table | ||
// lookup callbacks are fired. | ||
auto mark_task_failed = std::make_shared<bool>(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clearer to call it task_marked_failed
void UpdateObjectLocations(const std::vector<ObjectTableDataT> &location_history, | ||
const ray::gcs::ClientTable &client_table, | ||
std::unordered_set<ClientID> *client_ids, | ||
bool *has_been_created) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's true...I can't really think of a foolproof way around that except to fail the object after some number of attempts.
src/ray/raylet/node_manager.h
Outdated
/// | ||
/// \param task The task to potentially fail. | ||
/// \return Void. | ||
void TreatLostTaskAsFailed(const Task &task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe TreatTaskAsFailedIfLost
?
ObjectManager::ObjectManager(asio::io_service &main_service, | ||
const ObjectManagerConfig &config, | ||
std::unique_ptr<ObjectDirectoryInterface> od) | ||
std::shared_ptr<ObjectDirectoryInterface> od) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think od
-> object_directory
Ok, that makes sense then! |
Test PASSed. |
Test FAILed. |
Test PASSed. |
@@ -2142,3 +2142,45 @@ def method(self): | |||
ray.wait([object_id]) | |||
|
|||
ray.get(results) | |||
|
|||
|
|||
def test_actor_eviction(shutdown_only): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test can be simplified as the following:
1. submit a task to an actor.
2. use `ray.internal.free` to evict the object.
3. call `ray.get` on the object and assert that an error is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this actually doesn't seem to work, I think due to asynchrony between the raylet and the object store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I forgot to mention that before step 3, we need to wait until the object is really removed from object store. you can do that by keep reading the id from object_store_client
until you get an error saying the object doesn't exist.
I've also implemented this feature in our internal code base. A few comments and questions:
|
Thanks for the comments @raulchen.
|
For 1: sure, I can do this after this PR is merged. |
Test FAILed. |
@raulchen, for 2, that is ideally how we would do it! Technically it is possible, but it's a little involved. One way to do it would require:
Do you want to open an issue for it? We might not get around to it soon but I think it's important to do, especially if GCS latency is high. |
What do these changes do?
Since actors have state, objects created by an earlier actor method that have been evicted cannot be reconstructed without rolling back the actor. This PR treats such tasks as failed so that the frontend can catch the error, instead of hanging.
Related issue number
#3452 and potentially others that involve actors and a limited amount of object store memory.