Redis LRU eviction causes hangs with ray.wait() (but get() is fine) #3997

ericl · 2019-02-08T21:48:44Z

System information

Ray version: 0.6.4

Describe the problem

import ray

@ray.remote
def f():
    return 0

@ray.remote
def g():
    import time
    start = time.time()
    while time.time() < start + 1:
        ray.get([f.remote() for _ in range(10)])

import ray
# 10MB -> hangs after ~5 iterations
# 20MB -> hangs after ~20 iterations
# 50MB -> hangs after ~50 iterations
ray.init(redis_max_memory=1024 * 1024 * 50)

i = 0
while True:
    i += 1
    a = g.remote()
    [ok], _ = ray.wait([a])
    print("iter", i)

Source code / logs

The above example will reproducibly hang with number of iterations proportional to the redis memory size.

The expected behaviour is that once the memory size is large enough, it can run forever.

Also, we shouldn't be hanging and should throw an error if redis capacity is too small.

The text was updated successfully, but these errors were encountered:

ericl · 2019-02-08T21:49:18Z

cc @stephanie-wang

ericl · 2019-02-08T21:52:21Z

Note that ray.get() does not have the same problem.

stephanie-wang · 2019-02-08T22:28:02Z

Weirdly, I can see the log messages for when the raylet adds the object location to the GCS. I can also see using MONITOR on the redis instance that the GCS adds the key, then calls PUBLISH. However, somehow the GCS client that is waiting for the PUBLISH notification never receives it.

It could be that the key is getting evicted from redis right after the PUBLISH command. I don't really understand what the guarantees are in that case.

ericl · 2019-02-08T22:37:09Z

It seems unlikely that key is getting evicted by LRU, unless the policy is going very wrong (would have to be nearly MRU). What's weird is that you get a hang no matter how large redis is, it just takes longer.

stephanie-wang · 2019-02-08T22:54:30Z

Hmm yeah, that's true. Also, I did find that I could look up the object table entry in the GCS afterwards.

Could be some bug in how we're requesting notifications.

ericl · 2019-02-08T23:42:03Z

Could it be evicting something else important long-lived (subscription keys) which is not used for get?

stephanie-wang · 2019-02-08T23:49:35Z

So far I can't figure out how the behavior that I'm observing relates to eviction. When I start another Redis client manually and subscribe to the object table, that client is able to receive the notification that the raylet is somehow not getting.

Also, I think the reason ray.get works is because the Python worker is just polling the object store until the object becomes local. It doesn't block on the GCS (unless it's multinode and has to subscribe to object locations to pull the data).

stephanie-wang · 2019-02-09T02:11:53Z

Okay, just remembered that object table notifications only get sent to clients who requested notifications for a specific object. And the set of clients to notify is itself stored as a Redis entry. So most likely that key is getting evicted, so the client never receives the notification.

Two options:

Mark those types of entries as unevictable. It would be fine to never evict them since these entries are short-lived anyway.
Store the clients as a C data structure instead of directly in Redis.

pcmoritz · 2019-02-09T02:12:48Z

Ah, great find :)

It might be a bit hard to mark entries as unevictable (the only way I see is setting expires for all other keys in which case redis will evict them before it evicts keys without expires, but it might make everything slower).

So we should probably just keep track of the client in C++ datastructures.

ericl · 2019-02-09T02:49:39Z

Ah nice! Another possible quick fix is to move that table of clients to notify to the primary redis, is that feasible?

stephanie-wang · 2019-02-09T22:16:56Z

Ah nice! Another possible quick fix is to move that table of clients to notify to the primary redis, is that feasible?

Yeah, that should work, but it's probably not preferable as a long-term solution since it's often on the critical path for task execution and there may be many active keys (as many as there are objects required).

robertnishihara · 2019-02-11T18:08:04Z

@stephanie-wang when there are clients waiting to be notified, shouldn't we also make the corresponding object table key unevictable? Since we know we're going to use it in the future?

stephanie-wang · 2019-02-11T21:52:16Z

Never mind, moving the table to the primary shard won't work since that means moving the table that you're subscribing to (the object table in this case).

Hmm, @robertnishihara, we could probably have the redis server touch the relevant key (if it exists) whenever a notification is requested.

Edit: Actually, in many cases i think it will probably just work out without touching the object table key. Usually when a client requests notifications from the object table, it's because the key is currently empty and the client wants to know when the key has a value (i.e. the object has a location). Not sure how this would work out if a client needs notifications for a particular object for a long time, though.

ericl added the stability-blocker label Feb 8, 2019

ericl mentioned this issue Feb 8, 2019

[rllib] Memory usage constantly growing while training IMPALA #3884

Closed

ericl changed the title ~~Redis LRU eviction causes hangs with wait on actor tasks~~ Redis LRU eviction causes hangs with ray.wait() Feb 8, 2019

ericl changed the title ~~Redis LRU eviction causes hangs with ray.wait()~~ Redis LRU eviction causes hangs with ray.wait() (but get() is fine) Feb 8, 2019

stephanie-wang self-assigned this Feb 8, 2019

pcmoritz mentioned this issue Feb 11, 2019

Fix LRU eviction of client notification datastructure #4021

Merged

ericl closed this as completed in #4021 Feb 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redis LRU eviction causes hangs with ray.wait() (but get() is fine) #3997

Redis LRU eviction causes hangs with ray.wait() (but get() is fine) #3997

ericl commented Feb 8, 2019

ericl commented Feb 8, 2019

ericl commented Feb 8, 2019

stephanie-wang commented Feb 8, 2019

ericl commented Feb 8, 2019

stephanie-wang commented Feb 8, 2019

ericl commented Feb 8, 2019

stephanie-wang commented Feb 8, 2019

stephanie-wang commented Feb 9, 2019

pcmoritz commented Feb 9, 2019 •

edited

Loading

ericl commented Feb 9, 2019

stephanie-wang commented Feb 9, 2019

robertnishihara commented Feb 11, 2019

stephanie-wang commented Feb 11, 2019 •

edited

Loading

Redis LRU eviction causes hangs with ray.wait() (but get() is fine) #3997

Redis LRU eviction causes hangs with ray.wait() (but get() is fine) #3997

Comments

ericl commented Feb 8, 2019

System information

Describe the problem

Source code / logs

ericl commented Feb 8, 2019

ericl commented Feb 8, 2019

stephanie-wang commented Feb 8, 2019

ericl commented Feb 8, 2019

stephanie-wang commented Feb 8, 2019

ericl commented Feb 8, 2019

stephanie-wang commented Feb 8, 2019

stephanie-wang commented Feb 9, 2019

pcmoritz commented Feb 9, 2019 • edited Loading

ericl commented Feb 9, 2019

stephanie-wang commented Feb 9, 2019

robertnishihara commented Feb 11, 2019

stephanie-wang commented Feb 11, 2019 • edited Loading

pcmoritz commented Feb 9, 2019 •

edited

Loading

stephanie-wang commented Feb 11, 2019 •

edited

Loading