-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak in Ape-X #3452
Comments
I'm suspecting ape-x might be leaking object store memory unintentionally, perhaps through a view on a numpy array that's allocated in shared memory. I.e. this is an application-level error that is causing weird error messages, rather than a ray bug. This also seems to happen if the replay buffer is disabled (?) |
@ericl and I determined that the error messages like |
I'm still looking into the mysterious ArrowIOError. It's unclear if it's related to memory usage but does deterministically happen after a few hours. |
@stephanie-wang so an object that had been evicted was needed? Do we understand why that's happening? |
Oh, I think that was expected because that version was run with 2gb of
object store memory.
…On Tue, Dec 4, 2018, 6:59 PM Robert Nishihara ***@***.*** wrote:
@stephanie-wang <https://github.com/stephanie-wang> so an object that had
been evicted was needed? Do we understand why that's happening?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3452 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACcSBHujKm9Qa63o84gb9p47SikWFdG9ks5u1zaJgaJpZM4Y9Lv9>
.
|
The above PR doesn't entire solve the stability issue:
Note that message id 10 is NotifyUnblocked, and also the high memory usage. Is it possible this is some slowdown related to super high redis memory usage? |
Actually, I think that message id corresponds to My first guess would have been that some large data structure is getting resized, but it's not obvious from the message handler what that data structure might be. I'm guessing you didn't see anything obvious from the debug state? |
Just the hash maps that we already know is large. Could it be related to
the large redis memory size?
…On Thu, Dec 6, 2018, 5:22 PM Stephanie Wang ***@***.***> wrote:
Actually, I think that message id corresponds to FetchOrReconstruct
according to the protocol
<https://github.com/ray-project/ray/blob/master/src/ray/raylet/format/node_manager.fbs>
.
My first guess would have been that some large data structure is getting
resized, but it's not obvious from the message handler what that data
structure might be. I'm guessing you didn't see anything obvious from the
debug state?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3452 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAA6StMllyFhpIEzZqsWdhHTYQh34RLgks5u2cLmgaJpZM4Y9Lv9>
.
|
I guess it's possible...you could try something like |
It's not redis high memory usage. I'm running with LRU eviction on and the redis memory is steady at 500MB. Messages of type 8 seem to take longer to process over time though:
|
@stephanie-wang do you think it's #3470 here? For the 4.5s delay, the hash map size was around ~28 million entries. That seems about right if you assume a hash map resize can go at 5-6 million entries per second. |
Closing since the issue is confirmed to just be the hash map resizing. |
Note, to re-run this case for QA in the future use these yamls: apex.yaml:
crash.yaml:
And this commandline:
For multiple local schedulers, run: |
System information
You can run this on any 64-core CPU machine:
crash.yaml:
Describe the problem
Source code / logs
The rest of the experiment keeps running, but the particular trial fails.
The text was updated successfully, but these errors were encountered: