-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis LRU eviction causes hangs with ray.wait() (but get() is fine) #3997
Comments
Note that ray.get() does not have the same problem. |
Weirdly, I can see the log messages for when the raylet adds the object location to the GCS. I can also see using It could be that the key is getting evicted from redis right after the |
It seems unlikely that key is getting evicted by LRU, unless the policy is going very wrong (would have to be nearly MRU). What's weird is that you get a hang no matter how large redis is, it just takes longer. |
Hmm yeah, that's true. Also, I did find that I could look up the object table entry in the GCS afterwards. Could be some bug in how we're requesting notifications. |
Could it be evicting something else important long-lived (subscription keys) which is not used for get? |
So far I can't figure out how the behavior that I'm observing relates to eviction. When I start another Redis client manually and subscribe to the object table, that client is able to receive the notification that the raylet is somehow not getting. Also, I think the reason |
Okay, just remembered that object table notifications only get sent to clients who requested notifications for a specific object. And the set of clients to notify is itself stored as a Redis entry. So most likely that key is getting evicted, so the client never receives the notification. Two options:
|
Ah, great find :) It might be a bit hard to mark entries as unevictable (the only way I see is setting expires for all other keys in which case redis will evict them before it evicts keys without expires, but it might make everything slower). So we should probably just keep track of the client in C++ datastructures. |
Ah nice! Another possible quick fix is to move that table of clients to notify to the primary redis, is that feasible? |
Yeah, that should work, but it's probably not preferable as a long-term solution since it's often on the critical path for task execution and there may be many active keys (as many as there are objects required). |
@stephanie-wang when there are clients waiting to be notified, shouldn't we also make the corresponding object table key unevictable? Since we know we're going to use it in the future? |
Never mind, moving the table to the primary shard won't work since that means moving the table that you're subscribing to (the object table in this case). Hmm, @robertnishihara, we could probably have the redis server touch the relevant key (if it exists) whenever a notification is requested. Edit: Actually, in many cases i think it will probably just work out without touching the object table key. Usually when a client requests notifications from the object table, it's because the key is currently empty and the client wants to know when the key has a value (i.e. the object has a location). Not sure how this would work out if a client needs notifications for a particular object for a long time, though. |
System information
Describe the problem
Source code / logs
The above example will reproducibly hang with number of iterations proportional to the redis memory size.
The expected behaviour is that once the memory size is large enough, it can run forever.
Also, we shouldn't be hanging and should throw an error if redis capacity is too small.
The text was updated successfully, but these errors were encountered: